How I Defined a Time Dimension Using a Surrogate Key

 

Jaromir D.B. Nemec

 

The usage of surrogate keys is a general acceptable practice in DWH modelling.

Time dimension (or better a calendar dimension as the considered granularity is a day) seems to be no exception to this rule.

In practice in most cases it turns out that the time dimension

- has a dominant role in the physical design (defining the partitioning of the fact tables) and

- is a frequent used constraint in reports

 

If those two aspects are met a careful compromise between a theoretically clean design and usability must be done.

This leads in most cases to use of natural keys in the time dimension.

The natural time key is some times called a "smart surrogate key" to avoid a clash with the methodologists.

This paper illustrates some aspects leading to this decisions based on Oracle Database 11gR2.

 

Setup

 

Let define the fact table and a time dimension table in two versions. The first design uses a surrogate key time_id as a dimension key.

The second design uses a natural key (DATE format) as dimension key.

The fact table is partitioned of the key of the time dimension, i.e. on surrogate key in the first case, on the natural key in the second one.

 

Lets populate both versions with sample data - see Appendix A bellow.

 

Partitioning Schema

 

The first thing we realise is that the surrogate key approach leads to partitioning of the fact table based on numeric values. This eliminates the benefits of interval partitioning feature for non-trivial intervals such as partitioning on monthly basis.

 

The second thing is even more critical. While defining the range partitioning we practically assume that the surrogate key is sorted in the same way as the natural key.

Anyway if the sorting order of the "meaningless" surrogate key would be random, the range partition would collapse to hash partitioning. This would disable for example the rolling window approach on the fact table.

 

The assumption about the sort order of the surrogate key in the time dimension is the first step towards the "smart surrogate key".

 

Data Access

 

We will investigate following basic queries:

 

A) select data for a particular date

B) select data for a range of dates

 

One-Day Data Access using Surrogate Keys

 

To get the transaction from the fact table for one day following join of the time dimension and the fact table should be done.

 

select t.time_value,dim1_id,measure1

from tab_fact f, time_dim t

where t.time_value = to_date('10.05.2011','dd.mm.yyyy') and

t.time_id = f.time_id;

 

The execution plan is as follows.

--------------------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |

--------------------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 14 (100)| | | |

| 1 | NESTED LOOPS | | 20008 | 468K| 14 (0)| 00:00:01 | | |

| 2 | TABLE ACCESS BY INDEX ROWID| TIME_DIM | 1 | 13 | 2 (0)| 00:00:01 | | |

|* 3 | INDEX RANGE SCAN | TIME_DIM_IDX1 | 1 | | 1 (0)| 00:00:01 | | |

| 4 | PARTITION RANGE ITERATOR | | 19999 | 214K| 12 (0)| 00:00:01 | KEY | KEY |

|* 5 | TABLE ACCESS FULL | TAB_FACT | 19999 | 214K| 12 (0)| 00:00:01 | KEY | KEY |

--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

3 - access("T"."TIME_VALUE"=TO_DATE(' 2011-05-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

5 - filter("T"."TIME_ID"="F"."TIME_ID")

The execution plan is fine, only the relevant partition of the fact table is accessed.

Runtime partition pruning is used - KEY - KEY.

One Day Data Access using Natural Keys

 

The select is very similar to the previous one. The only difference is that the natural key (DATE) is used as a join condition.

 

select t.time_value,dim1_id,measure1

from tab_fact2 f, time_dim2 t

where t.time_value = to_date('10.05.2011','dd.mm.yyyy') and

t.time_value = f.time_value;

 

The execution plan is not very different from the previousone. Only the relevant partition is accessed.

 

--------------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |

--------------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 15 (100)| | | |

| 1 | NESTED LOOPS | | 9999 | 214K| 15 (0)| 00:00:01 | | |

|* 2 | INDEX UNIQUE SCAN | SYS_C0047888 | 1 | 8 | 1 (0)| 00:00:01 | | |

| 3 | PARTITION RANGE SINGLE| | 9999 | 136K| 14 (0)| 00:00:01 | 1 | 1 |

|* 4 | TABLE ACCESS FULL | TAB_FACT2 | 9999 | 136K| 14 (0)| 00:00:01 | 1 | 1 |

--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T"."TIME_VALUE"=TO_DATE(' 2011-05-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

4 - filter("F"."TIME_VALUE"=TO_DATE(' 2011-05-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

 

The subtle but important difference is that in the second case (natural key) the partition pruning is static. The accessed partition (partition 1) is known on parsing time.

This is possible due to the following reasoning made by Oracle Optimizer:

 

t.time_value = to_date('10.05.2011','dd.mm.yyyy') and

t.time_value = f.time_value

 

leads to ===>

 

f.time_value = to_date('10.05.2011','dd.mm.yyyy')

 

This fact allows to estimate correct the cardinality of the result set.

9999 rows is the right answer. Compare with the estimation using the surrogate key leading to a doubled result.

In this case the Optimizer doesn't know the exact partition at parse time (due to the surrogate key) and use as estimation the average number of rows of all partitions.

We see in the simples case there is no big difference, but the natural key behaves better.

 

Date Range Access using Surrogate Keys

 

To get the transaction from the fact table for range of days following join of the time dimension and the fact table should be done.

 

select t.time_value,dim1_id,measure1

from tab_fact f, time_dim t

where t.time_value between to_date('11.05.2010','dd.mm.yyyy') and to_date('10.05.2011','dd.mm.yyyy') and

t.time_id = f.time_id;

 

Lets consider first the execution plan of Oracle 10gR2 to see the main problem of this approach.

 

--------------------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |

--------------------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 49 (100)| | | |

|* 1 | HASH JOIN | | 58374 | 1311K| 49 (5)| 00:00:01 | | |

| 2 | TABLE ACCESS BY INDEX ROWID| TIME_DIM | 366 | 4392 | 3 (0)| 00:00:01 | | |

|* 3 | INDEX RANGE SCAN | TIME_DIM_IDX1 | 366 | | 2 (0)| 00:00:01 | | |

| 4 | PARTITION RANGE ALL | | 58503 | 628K| 45 (3)| 00:00:01 | 1 | 6 |

| 5 | TABLE ACCESS FULL | TAB_FACT | 58503 | 628K| 45 (3)| 00:00:01 | 1 | 6 |

--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - access("T"."TIME_ID"="F"."TIME_ID")

3 - access("T"."TIME_VALUE">=TO_DATE('2010-05-11 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND

"T"."TIME_VALUE"<=TO_DATE('2011-05-10 00:00:00', 'yyyy-mm-dd hh24:mi:ss'))

This execution plan is a catastrophic one. There is no partition pruning on the fact table (all partition 1 to 6 are accessed).

Why?

Our assumption of ordered surrogate keys is of course only "gentleman agreement" - Oracle has no knowledge about it and can't use this it for pruning.

 

The situation is somehow better in Oracle 11gR2 - see the execution plan below.

 

---------------------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |

---------------------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 78 (100)| | | |

|* 1 | HASH JOIN | | 59834 | 1402K| 78 (2)| 00:00:01 | | |

| 2 | PART JOIN FILTER CREATE | :BF0000 | 366 | 4758 | 3 (0)| 00:00:01 | | |

| 3 | TABLE ACCESS BY INDEX ROWID| TIME_DIM | 366 | 4758 | 3 (0)| 00:00:01 | | |

|* 4 | INDEX RANGE SCAN | TIME_DIM_IDX1 | 366 | | 2 (0)| 00:00:01 | | |

| 5 | PARTITION RANGE JOIN-FILTER | | 59997 | 644K| 74 (0)| 00:00:01 |:BF0000|:BF0000|

| 6 | TABLE ACCESS FULL | TAB_FACT | 59997 | 644K| 74 (0)| 00:00:01 |:BF0000|:BF0000|

---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - access("T"."TIME_ID"="F"."TIME_ID")

4 - access("T"."TIME_VALUE">=TO_DATE(' 2010-05-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND

"T"."TIME_VALUE"<=TO_DATE(' 2011-05-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Optimization using the Bloom Filter tries to re-establish the lost information cased by usage of surrogate keys. See [4] for further details.

Date Range Access using Natural Keys

 

The analogous select using natural key leads to straightforward execution plan.

 

select t.time_value,dim1_id,measure1

from tab_fact2 f, time_dim2 t

where t.time_value between to_date('11.05.2010','dd.mm.yyyy') and to_date('10.05.2011','dd.mm.yyyy') and

t.time_value = f.time_value;

 

--------------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |

--------------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 15 (100)| | | |

| 1 | NESTED LOOPS | | 1 | 22 | 15 (7)| 00:00:01 | | |

| 2 | PARTITION RANGE SINGLE| | 9999 | 136K| 14 (0)| 00:00:01 | 1 | 1 |

|* 3 | TABLE ACCESS FULL | TAB_FACT2 | 9999 | 136K| 14 (0)| 00:00:01 | 1 | 1 |

|* 4 | INDEX UNIQUE SCAN | SYS_C0047898 | 1 | 8 | 0 (0)| | | |

--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

3 - filter(("F"."TIME_VALUE">=TO_DATE(' 2010-05-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND

"F"."TIME_VALUE"<=TO_DATE(' 2011-05-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))

4 - access("T"."TIME_VALUE"="F"."TIME_VALUE")

filter(("T"."TIME_VALUE">=TO_DATE(' 2010-05-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND

"T"."TIME_VALUE"<=TO_DATE(' 2011-05-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))

 

The partition pruning is done as expected, even in the pre 11 Oracle releases.

The estimation of the cardinality of the partition is exact, unfortunately the join cardinality is wrong - should be 9999 as well.

 

If we use a direct query constraining the fact table with interval range we see even that problem disappears:

 

select f.time_value,dim1_id,measure1

from tab_fact2 f

where f.time_value between to_date('11.05.2010','dd.mm.yyyy') and to_date('10.05.2011','dd.mm.yyyy');

 

 

----------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |

----------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 14 (100)| | | |

| 1 | PARTITION RANGE SINGLE| | 9999 | 136K| 14 (0)| 00:00:01 | 1 | 1 |

|* 2 | TABLE ACCESS FULL | TAB_FACT2 | 9999 | 136K| 14 (0)| 00:00:01 | 1 | 1 |

----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter(("F"."TIME_VALUE">=TO_DATE(' 2010-05-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss')

AND "F"."TIME_VALUE"<=TO_DATE(' 2011-05-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))

 

In a pre-11g Oracle Version a join with a surrogated time dimension for a time interval leads to an unacceptable execution plan accessing the whole fact table. The natural key based dimension design enforces the correct partition pruning of the fact table.

 

Calendar Holes

 

The second important consideration while designing the time dimension is if you reserve "holes" in the surrogate key schema of the time dimension table to be able to insert "new appearing" days in the future. Remember the ordering assumption discussed above, so if your surrogate key schema is dense you'll get problems to insert a "new day" in the correct order.

Anyway in most cases the answer is no - the next Gregorian Reform will come inevitable, but not in the very near future.

 

Accepting the dense surrogate key schema is the second step towards the smart surrogate keys.

Putting the two assumptions together we get a 1:1 mapping between the surrogate and natural key.

One example of this mapping is illustrated in the select below:

 

select rownum time_id, to_date('01.01.1900','dd.mm.yyyy') +rownum -1 time_value

from dual connect by level < 365.25 * 200;

 

TIME_ID TIME_VALUE

---------------------- -------------------------

1 01.01.1900

2 02.01.1900

3 03.01.1900

4 04.01.1900

. . . .

 

From the mathematical point of view structures that can be mapped 1:1 are isomorphic and therefore exchangeable.

There is no difference between the natural and surrogate key.

Conclusion

 

Should the time (calendar) dimension be considered in the design of a data warehouse?

Sure, the important parts of it are the non-trivial attributes such as holydays.

The time dimension is based on the natural key (DATE format).

 

Motivation - if you can choose between two identical (isomorphic) things choose the one that better fits the requirements and better align with the current state of the technology.

Call the natural key a "smart surrogate key" to avoid idle design discussions.

 

By the way, even the Kimball Group made a move in this design topic towards this pragmatic solution. Read between the lines in

[1], [2] and [3].

 

References

 

[1] http://www.kimballgroup.com/html/designtipsPDF/DesignTips2000/KimballDT5SurrogateKeys.pdf

[2] http://www.kimballgroup.com/html/designtipsPDF/DesignTips2004/KimballDT51LatestThinking.pdf

[3] http://www.kimballgroup.com/html/designtipsPDF/DesignTips2006/KU85SmartDateKeysPartitionFactTables.pdf

[4] http://www.trivadis.com/uploads/tx_cabagdownloadarea/OOW_Partitioning_Your_DWH.pdf

 

 

Last revision 15.5.2011

Jaromir D.B. Nemec is a freelancer specializing in data warehouse and integration solutions. He can be reached at http://www.db-nemec.com


Apendix A Setup of Surrogate Key Dimension

 

-- create time dimesion (day granularity) wuth surrogate key

---

set echo on

create table time_dim

(time_id number not null,

time_value date not null /* smart time value */

);

alter table time_dim add primary key (time_id) rely;

create index time_dim_idx1 on time_dim(time_value);

--

insert into time_dim

select rownum, to_date('01.01.1900','dd.mm.yyyy') +rownum -1

from dual connect by level < 365.25 * 200;

commit;

--

-- fact table

create table tab_fact (

time_id number not null,

dim1_id number not null,

dim2_id number not null,

measure1 number,

measure2 number

)

partition by range (time_id)

(

partition p20110510 values less than (40673),

partition p20110511 values less than (40674),

partition p20110512 values less than (40675),

partition p20110513 values less than (40676),

partition p20110514 values less than (40677),

partition pmax values less than (MAXVALUE)

);

alter table tab_fact

add constraint fk_time_dim

foreign key (time_id) referencing time_dim (time_id) rely disable novalidate;

-- fill 3 partitions

insert into tab_fact

select

40672, -- 10.05.2011

mod(rownum,2),

mod(rownum,7),

100,

200

from dual connect by level < 10000

union all

select

40673, -- 11.05.2011

mod(rownum,2),

mod(rownum,7),

100,

200

from dual connect by level < 20000

union all

select

40674, -- 12.05.2011

mod(rownum,2),

mod(rownum,7),

100,

200

from dual connect by level < 30000

;

commit;

--

exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'time_dim');

exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'tab_fact',granularity=>'all',cascade=>TRUE);

 

 

Setup of Time Dimension with Natural Key

 

-- create time dimesion (day granularity) wuth surrogate key

---

create table time_dim2

(

time_value date not null /* smart time value */

);

alter table time_dim2 add primary key (time_value) rely;

--

insert into time_dim2

select to_date('01.01.1900','dd.mm.yyyy') +rownum -1

from dual connect by level < 365.25 * 200;

--

commit;

--

-- fact table

create table tab_fact2 (

time_value date not null,

dim1_id number not null,

dim2_id number not null,

measure1 number,

measure2 number

)

partition by range (time_value)

(

partition p20110510 values less than (to_date('11.05.2011','dd.mm.yyyy')),

partition p20110511 values less than (to_date('12.05.2011','dd.mm.yyyy')),

partition p20110512 values less than (to_date('13.05.2011','dd.mm.yyyy')),

partition p20110513 values less than (to_date('14.05.2011','dd.mm.yyyy')),

partition p20110514 values less than (to_date('15.05.2011','dd.mm.yyyy')),

partition pmax values less than (MAXVALUE)

);

alter table tab_fact2

add constraint fk_time_dim2

foreign key (time_value) referencing time_dim2 (time_value) rely disable novalidate;

-- fill 3 partitions

insert into tab_fact2

select

to_date('10.05.2011','dd.mm.yyyy'),

mod(rownum,2),

mod(rownum,7),

100,

200

from dual connect by level < 10000

union all

select

to_date('11.05.2011','dd.mm.yyyy'),

mod(rownum,2),

mod(rownum,7),

100,

200

from dual connect by level < 20000

union all

select

to_date('12.05.2011','dd.mm.yyyy'),

mod(rownum,2),

mod(rownum,7),

100,

200

from dual connect by level < 30000

;

commit;

--

exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'time_dim2');

exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'tab_fact2',granularity=>'all',cascade=>TRUE);