Posts Tagged Partition

Teradata Complex CASE_N MLPPI

In addition to RANGE_N partitioning mechanism, CASE_N provides a good choice to distribute data with complex splitting criteria. It is much harder to effectively express the following sample situation with ORACLE’s range-based partitioning syntax.

  • FACT_WEB_CLICKS is a huge click-stream table
  • 40% of clicks are generated by non-members or unsigned-in sessions
    • 50% of the non-member clicks are generated by robots and crawlers
    • 40% of the non-member clicks are associated with the home page (PAGE_ID = 12345)
  • 60% of clicks are generated by members or signed-in sessions
    • 40% of the member clicks go to the 15 most popular PAGE_IDs
    • 50% of the traffic has TRACKING_ID populated, the other 50% does not
    • 2% of the member clicks are generated by robots or crawlers
  • For non-member clicks, negative random MEMBER_ID were generated to avoid PI skewness
  • Most analytical queries only care about the clicks with MEMBER_ID > 0
  • PAGE_ID is highly skewed, because the popular pages have a lot more clicks than the long tail
  • Some analytical queries only care about the clicks with TRACKING_ID > 0
  • Daily partition at the first level. This is commented out in the sample DDL, so we can map each partition’s row count to “PARTITION ID” later

DDL1: Catch-All Bucket without Explicit Criteria

 CREATE MULTISET TABLE FACT_WEB_CLICKS ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     FREESPACE = 0 PERCENT,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      WEB_CLICK_ID BIGINT NOT NULL,
      CLICK_DATE DATE FORMAT 'YYYY-MM-DD',
      CLICK_TIME TIMESTAMP(3),
      MEMBER_ID BIGINT NOT NULL,
      PAGE_ID INTEGER NOT NULL,
      SESSION_ID BIGINT NOT NULL,
      TRACKING_ID INTEGER NOT NULL,
      CLICK_SEQ INTEGER,
      LOCALE_ID INTEGER NOT NULL,
      TOTAL_TIME INTEGER,
      PAGE_VIEW_SEQ INTEGER,
      IS_CRAWLER CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC DEFAULT 'N')
PRIMARY INDEX ( MEMBER_ID )
PARTITION BY (
  --RANGE_N(CLICK_DATE BETWEEN DATE '2012-07-01' AND DATE + 10 EACH INTERVAL '1' DAY ),
  CASE_N(
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'Y',
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N' AND  PAGE_ID =  12345 ,
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N', -- catch_all_bucket for other PAGE_ID(s)

    MEMBER_ID >  0  AND  PAGE_ID  IN (12345 ,12355 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (20001 ,20010 ,20011 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (70001 ,70008 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (80001 ,80002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (90001 ,90002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (60001 ,60002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (50001 ,50002 ),
    MEMBER_ID >  0  -- catch_all_bucket for other PAGE_ID(s)
  ),
  CASE_N(TRACKING_ID <=  0 , TRACKING_ID >  0 )
);

Data Distribution Among Each Partition

select Partition, count(*)
from FACT_WEB_CLICKS
group by 1 order by 1;
row count for each partition

The data are distributed among each partition as expected. The 4 catch-all bucket partitions (5,6,21,22) contain the rows considered as “NO CASE OR UNKNOWN”.

The data are distributed among each partition as expected. But the query optimizer cannot take full advantage of the partition elimination yet.

Explain Plan

collect stats FACT_WEB_CLICKS column(partition);
collect stats FACT_WEB_CLICKS index(member_id);
collect stats FACT_WEB_CLICKS column(page_id);
collect stats FACT_WEB_CLICKS column(partition,member_id);

DIAGNOSTIC VERBOSEEXPLAIN ON FOR SESSION;
DIAGNOSTIC HELPSTATS ON FOR SESSION;

explain
select * from FACT_WEB_CLICKS
where member_id > 0
and tracking_id > 0
and page_id in (70001)
and click_date = date '2013-02-05';

Explanation
-----------------------------------------------------------------------
1) First, we lock a distinct fact_web."pseudo table" for read on a
RowHash to prevent global deadlock for fact_web.FACT_WEB_CLICKS.
2) Next, we lock fact_web.FACT_WEB_CLICKS for read.
3) We do an all-AMPs RETRIEVE step from 2 partitions of
fact_web.FACT_WEB_CLICKS with a condition of (
"(fact_web.FACT_WEB_CLICKS.CLICK_DATE = DATE '2013-02-05')
AND ((fact_web.FACT_WEB_CLICKS.PAGE_ID = 70001) AND
((fact_web.FACT_WEB_CLICKS.MEMBER_ID > 0) AND
(fact_web.FACT_WEB_CLICKS.TRACKING_ID > 0)))") into
Spool 1 (group_amps), which is built locally on the AMPs with
Field1 ("UniqueId"). The size of Spool 1 is estimated with no
confidence to be 116,833,856 rows (14,253,730,432 bytes). Spool
Asgnlist:

With the precise filter clause, the explain plan should hit only 1 partition (partition 12 in this case), but the actual plan also wants to scan the catch-all partition 22. There should be a way to improve partition elimination.

DDL2: Catch-All Bucket with Explicit Criteria

 CREATE MULTISET TABLE FACT_WEB_CLICKS2 ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     FREESPACE = 0 PERCENT,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      WEB_CLICK_ID BIGINT NOT NULL,
      CLICK_DATE DATE FORMAT 'YYYY-MM-DD',
      CLICK_TIME TIMESTAMP(3),
      MEMBER_ID BIGINT NOT NULL,
      PAGE_ID INTEGER NOT NULL,
      SESSION_ID BIGINT NOT NULL,
      TRACKING_ID INTEGER NOT NULL,
      CLICK_SEQ INTEGER,
      LOCALE_ID INTEGER NOT NULL,
      TOTAL_TIME INTEGER,
      PAGE_VIEW_SEQ INTEGER,
      IS_CRAWLER CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC DEFAULT 'N')
PRIMARY INDEX ( MEMBER_ID )
PARTITION BY (
  --RANGE_N(CLICK_DATE BETWEEN DATE '2012-07-01' AND DATE + 10 EACH INTERVAL '1' DAY ),
  CASE_N(
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'Y',
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N' AND  PAGE_ID =  12345 ,
    MEMBER_ID <= 0  AND  IS_CRAWLER =  'N' AND  PAGE_ID <> 12345 , -- catch_all_bucket for other PAGE_ID(s)

    MEMBER_ID >  0  AND  PAGE_ID  IN (12345 ,12355 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (20001 ,20010 ,20011 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (70001 ,70008 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (80001 ,80002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (90001 ,90002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (60001 ,60002 ),
    MEMBER_ID >  0  AND  PAGE_ID  IN (50001 ,50002 ),
    MEMBER_ID >  0  AND  PAGE_ID not in (12345,12355,20001,......,50002)  -- catch_all_bucket for other PAGE_ID(s)
  ),
  CASE_N(TRACKING_ID <=  0 , TRACKING_ID >  0 )
);

explain
select * from FACT_WEB_CLICKS2
where member_id > 0
and tracking_id > 0
and page_id in (12345)
and click_date = date '2013-02-05';

 Explanation
 --------------------------------------------------------------------------
   1) First, we lock a distinct fact_web."pseudo table" for read on a
      RowHash to prevent global deadlock for SCRATCH_DS.FACT_WEB_CLICKS2.
   2) Next, we lock fact_web.FACT_WEB_CLICKS2 for read.
   3) We do an all-AMPs RETRIEVE step from a single partition of
      fact_web.FACT_WEB_CLICKS2 with a condition of (
      "(fact_web.FACT_WEB_CLICKS2.PAGE_ID = 12345) AND
   ......

Now the query hit a single partition. So even though the first DDL achieve exactly the same data distribution among each partition as the second DDL, the following explicit criteria for the catch-all bucket will help optimizer to choose the optimal partition elimination:

  • PAGE_ID <> 12345
  • PAGE_ID not in (12345,12355,20001,……,50002)

, , , , , , ,

Leave a comment