Jaromir D.B. Nemec
The usage of surrogate keys
is a general acceptable practice in DWH modelling.
Time dimension (or better a
calendar dimension as the considered granularity is a day) seems to be no
exception to this rule.
In practice in most cases it
turns out that the time dimension
- has a dominant role in the
physical design (defining the partitioning of the fact tables) and
- is a frequent used
constraint in reports
If those two aspects are met
a careful compromise between a theoretically clean design and usability must be
done.
This leads in most cases to
use of natural keys in the time dimension.
The natural time key is some
times called a "smart surrogate key" to avoid a clash with the
methodologists.
This paper illustrates some
aspects leading to this decisions based on Oracle Database 11gR2.
Let define the fact table and
a time dimension table in two versions. The first design uses a surrogate key
time_id as a dimension key.
The second design uses a natural
key (DATE format) as dimension key.
The fact table is partitioned
of the key of the time dimension, i.e. on surrogate key in the first case, on
the natural key in the second one.
Lets populate both versions
with sample data - see Appendix A bellow.
The first thing we realise is
that the surrogate key approach leads to partitioning of the fact table based
on numeric values. This eliminates the benefits of interval partitioning
feature for non-trivial intervals such as partitioning on monthly basis.
The second thing is even more
critical. While defining the range partitioning we practically assume that the
surrogate key is sorted in the same way as the natural key.
Anyway if the sorting order
of the "meaningless" surrogate key would be random, the range
partition would collapse to hash partitioning. This would disable for example
the rolling window approach on the fact table.
The assumption about the sort
order of the surrogate key in the time dimension is the first step towards the
"smart surrogate key".
We will investigate following
basic queries:
A) select data for a
particular date
B) select data for a range of
dates
To get the transaction from
the fact table for one day following join of the time dimension and the fact
table should be done.
select t.time_value,dim1_id,measure1
from tab_fact f, time_dim t
where t.time_value =
to_date('10.05.2011','dd.mm.yyyy') and
t.time_id = f.time_id;
The execution plan is as follows.
--------------------------------------------------------------------------------------------------------------
| Id |
Operation | Name | Rows | Bytes | Cost (%CPU)| Time
| Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | | | 14 (100)| | | |
| 1 | NESTED LOOPS |
| 20008 | 468K| 14
(0)| 00:00:01 | | |
| 2 | TABLE ACCESS BY INDEX ROWID| TIME_DIM |
1 | 13 | 2
(0)| 00:00:01 | | |
|* 3 | INDEX RANGE SCAN | TIME_DIM_IDX1 |
1 | | 1
(0)| 00:00:01 | | |
| 4 | PARTITION RANGE ITERATOR | | 19999 |
214K| 12 (0)| 00:00:01 | KEY | KEY |
|* 5 | TABLE ACCESS FULL | TAB_FACT | 19999 | 214K| 12
(0)| 00:00:01 | KEY | KEY |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 -
access("T"."TIME_VALUE"=TO_DATE(' 2011-05-10 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))
5 -
filter("T"."TIME_ID"="F"."TIME_ID")
The execution plan is fine,
only the relevant partition of the fact table is accessed.
Runtime partition pruning is
used - KEY - KEY.
The select is very similar to
the previous one. The only difference is that the natural key (DATE) is used as
a join condition.
select t.time_value,dim1_id,measure1
from tab_fact2 f, time_dim2 t
where t.time_value =
to_date('10.05.2011','dd.mm.yyyy') and
t.time_value = f.time_value;
The execution plan is not
very different from the previousone. Only the relevant partition is accessed.
--------------------------------------------------------------------------------------------------------
| Id |
Operation | Name | Rows | Bytes | Cost (%CPU)| Time
| Pstart| Pstop |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | | | 15 (100)| | | |
| 1 | NESTED LOOPS |
| 9999 | 214K|
15 (0)| 00:00:01 | |
|
|* 2 | INDEX UNIQUE SCAN | SYS_C0047888 | 1
| 8 | 1 (0)| 00:00:01
| | |
| 3 | PARTITION RANGE SINGLE| | 9999 | 136K| 14
(0)| 00:00:01 | 1 | 1 |
|* 4 | TABLE ACCESS
FULL | TAB_FACT2 |
9999 | 136K| 14
(0)| 00:00:01 | 1 | 1 |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 -
access("T"."TIME_VALUE"=TO_DATE(' 2011-05-10 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))
4 -
filter("F"."TIME_VALUE"=TO_DATE(' 2011-05-10 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))
The subtle but important
difference is that in the second case (natural key) the partition pruning is
static. The accessed partition (partition 1) is known on parsing time.
This is possible due to the following reasoning made by Oracle Optimizer:
t.time_value = to_date('10.05.2011','dd.mm.yyyy') and
t.time_value = f.time_value
leads to ===>
f.time_value = to_date('10.05.2011','dd.mm.yyyy')
This fact allows to estimate
correct the cardinality of the result set.
9999 rows is the right
answer. Compare with the estimation using the surrogate key leading to a
doubled result.
In this case the Optimizer
doesn't know the exact partition at parse time (due to the surrogate key) and
use as estimation the average number of rows of all partitions.
We see in the simples case
there is no big difference, but the natural key behaves better.
To get the transaction from
the fact table for range of days following join of the time dimension and the
fact table should be done.
select t.time_value,dim1_id,measure1
from tab_fact f, time_dim t
where t.time_value between to_date('11.05.2010','dd.mm.yyyy')
and to_date('10.05.2011','dd.mm.yyyy') and
t.time_id = f.time_id;
Lets consider first the
execution plan of Oracle 10gR2 to see the main problem of this approach.
--------------------------------------------------------------------------------------------------------------
| Id |
Operation | Name | Rows | Bytes | Cost (%CPU)| Time
| Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | | | 49 (100)| | | |
|* 1 | HASH JOIN |
| 58374 | 1311K| 49
(5)| 00:00:01 | | |
| 2 | TABLE ACCESS BY INDEX ROWID| TIME_DIM |
366 | 4392 | 3
(0)| 00:00:01 | | |
|* 3 | INDEX RANGE SCAN | TIME_DIM_IDX1 |
366 | | 2
(0)| 00:00:01 | | |
| 4 | PARTITION RANGE ALL | | 58503
| 628K| 45 (3)| 00:00:01
| 1 | 6 |
| 5 | TABLE ACCESS FULL | TAB_FACT | 58503 | 628K| 45
(3)| 00:00:01 | 1 | 6 |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 -
access("T"."TIME_ID"="F"."TIME_ID")
3 -
access("T"."TIME_VALUE">=TO_DATE('2010-05-11 00:00:00',
'yyyy-mm-dd hh24:mi:ss') AND
"T"."TIME_VALUE"<=TO_DATE('2011-05-10 00:00:00',
'yyyy-mm-dd hh24:mi:ss'))
This execution plan is a
catastrophic one. There is no partition pruning on the fact table (all
partition 1 to 6 are accessed).
Why?
Our assumption of ordered
surrogate keys is of course only "gentleman agreement" - Oracle has
no knowledge about it and can't use this it for pruning.
The situation is somehow
better in Oracle 11gR2 - see the
execution plan below.
---------------------------------------------------------------------------------------------------------------
| Id |
Operation |
Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | | | 78 (100)| | | |
|* 1 | HASH JOIN |
| 59834 | 1402K| 78
(2)| 00:00:01 | | |
| 2 | PART JOIN FILTER CREATE | :BF0000 | 366 | 4758 |
3 (0)| 00:00:01 | |
|
| 3 | TABLE ACCESS BY INDEX ROWID| TIME_DIM |
366 | 4758 | 3
(0)| 00:00:01 | | |
|* 4 | INDEX RANGE SCAN | TIME_DIM_IDX1 | 366 |
| 2 (0)| 00:00:01 |
| |
| 5 | PARTITION RANGE JOIN-FILTER | | 59997 | 644K|
74 (0)| 00:00:01
|:BF0000|:BF0000|
| 6 | TABLE ACCESS FULL | TAB_FACT | 59997 | 644K| 74
(0)| 00:00:01 |:BF0000|:BF0000|
---------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 -
access("T"."TIME_ID"="F"."TIME_ID")
4 -
access("T"."TIME_VALUE">=TO_DATE(' 2010-05-11 00:00:00',
'syyyy-mm-dd hh24:mi:ss') AND
"T"."TIME_VALUE"<=TO_DATE(' 2011-05-10 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))
Optimization using the Bloom
Filter tries to re-establish the lost information cased by usage of surrogate
keys. See [4] for further details.
The analogous select using
natural key leads to straightforward execution plan.
select t.time_value,dim1_id,measure1
from tab_fact2 f, time_dim2 t
where t.time_value between to_date('11.05.2010','dd.mm.yyyy')
and to_date('10.05.2011','dd.mm.yyyy') and
t.time_value = f.time_value;
--------------------------------------------------------------------------------------------------------
| Id |
Operation | Name | Rows | Bytes | Cost (%CPU)| Time
| Pstart| Pstop |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | | | 15 (100)| | | |
| 1 | NESTED LOOPS |
| 1 | 22 |
15 (7)| 00:00:01 | |
|
| 2 | PARTITION RANGE SINGLE| | 9999 | 136K| 14
(0)| 00:00:01 | 1 | 1 |
|* 3 | TABLE ACCESS FULL | TAB_FACT2 | 9999 |
136K| 14 (0)| 00:00:01 | 1 | 1 |
|* 4 | INDEX UNIQUE SCAN | SYS_C0047898 | 1
| 8 | 0 (0)| | | |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 -
filter(("F"."TIME_VALUE">=TO_DATE(' 2010-05-11
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"F"."TIME_VALUE"<=TO_DATE(' 2011-05-10 00:00:00',
'syyyy-mm-dd hh24:mi:ss')))
4 -
access("T"."TIME_VALUE"="F"."TIME_VALUE")
filter(("T"."TIME_VALUE">=TO_DATE(' 2010-05-11
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"T"."TIME_VALUE"<=TO_DATE(' 2011-05-10 00:00:00',
'syyyy-mm-dd hh24:mi:ss')))
The partition pruning is done
as expected, even in the pre 11 Oracle releases.
The estimation of the
cardinality of the partition is exact, unfortunately the join cardinality is
wrong - should be 9999 as well.
If we use a direct query
constraining the fact table with interval range we see even that problem
disappears:
select f.time_value,dim1_id,measure1
from tab_fact2 f
where f.time_value between
to_date('11.05.2010','dd.mm.yyyy') and to_date('10.05.2011','dd.mm.yyyy');
----------------------------------------------------------------------------------------------------
| Id |
Operation | Name | Rows
| Bytes | Cost (%CPU)| Time |
Pstart| Pstop |
----------------------------------------------------------------------------------------------------
| 0 | SELECT
STATEMENT | | | | 14 (100)| | | |
| 1 | PARTITION RANGE SINGLE| |
9999 | 136K| 14
(0)| 00:00:01 | 1 | 1 |
|* 2 | TABLE ACCESS FULL | TAB_FACT2 | 9999
| 136K| 14 (0)| 00:00:01
| 1 | 1 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 -
filter(("F"."TIME_VALUE">=TO_DATE(' 2010-05-11
00:00:00', 'syyyy-mm-dd hh24:mi:ss')
AND "F"."TIME_VALUE"<=TO_DATE(' 2011-05-10
00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
In a pre-11g Oracle Version a
join with a surrogated time dimension for a time interval leads to an
unacceptable execution plan accessing the whole fact table. The natural key
based dimension design enforces the correct partition pruning of the fact table.
The second important
consideration while designing the time dimension is if you reserve
"holes" in the surrogate key schema of the time dimension table to be
able to insert "new appearing" days in the future. Remember the
ordering assumption discussed above, so if your surrogate key schema is dense
you'll get problems to insert a "new day" in the correct order.
Anyway in most cases the
answer is no - the next Gregorian Reform will come inevitable, but not in the
very near future.
Accepting the dense surrogate
key schema is the second step towards the smart surrogate keys.
Putting the two assumptions
together we get a 1:1 mapping between the surrogate and natural key.
One example of this mapping
is illustrated in the select below:
select rownum time_id, to_date('01.01.1900','dd.mm.yyyy') +rownum -1 time_value
from dual connect by level < 365.25 * 200;
TIME_ID
TIME_VALUE
---------------------- -------------------------
1
01.01.1900
2
02.01.1900
3
03.01.1900
4
04.01.1900
. .
. .
From the mathematical point
of view structures that can be mapped 1:1 are isomorphic and therefore
exchangeable.
There is no difference
between the natural and surrogate key.
Should the time (calendar)
dimension be considered in the design of a data warehouse?
Sure, the important parts of it
are the non-trivial attributes such as holydays.
The time dimension is based
on the natural key (DATE format).
Motivation - if you can
choose between two identical (isomorphic) things choose the one that better
fits the requirements and better align with the current state of the
technology.
Call the natural key a
"smart surrogate key" to avoid idle design discussions.
By the way, even the Kimball
Group made a move in this design topic towards this pragmatic solution. Read
between the lines in
[1], [2] and [3].
References
[1] http://www.kimballgroup.com/html/designtipsPDF/DesignTips2000/KimballDT5SurrogateKeys.pdf
[2] http://www.kimballgroup.com/html/designtipsPDF/DesignTips2004/KimballDT51LatestThinking.pdf
[4] http://www.trivadis.com/uploads/tx_cabagdownloadarea/OOW_Partitioning_Your_DWH.pdf
Last revision 15.5.2011
Jaromir D.B. Nemec is a freelancer specializing in data warehouse and integration solutions. He can be reached at http://www.db-nemec.com
Apendix A – Setup of Surrogate Key Dimension
-- create time
dimesion (day granularity) wuth surrogate key
---
set echo on
create table time_dim
(time_id number not
null,
time_value date not null /* smart time value
*/
);
alter table time_dim
add primary key (time_id) rely;
create index
time_dim_idx1 on time_dim(time_value);
--
insert into time_dim
select rownum,
to_date('01.01.1900','dd.mm.yyyy') +rownum -1
from dual connect by
level < 365.25 * 200;
commit;
--
-- fact table
create table tab_fact
(
time_id number not
null,
dim1_id number not
null,
dim2_id number not
null,
measure1 number,
measure2 number
)
partition by range
(time_id)
(
partition p20110510
values less than (40673),
partition p20110511
values less than (40674),
partition p20110512
values less than (40675),
partition p20110513
values less than (40676),
partition p20110514
values less than (40677),
partition pmax values
less than (MAXVALUE)
);
alter table tab_fact
add constraint
fk_time_dim
foreign key (time_id)
referencing time_dim (time_id) rely disable
novalidate;
-- fill 3 partitions
insert into tab_fact
select
40672, -- 10.05.2011
mod(rownum,2),
mod(rownum,7),
100,
200
from dual connect by
level < 10000
union all
select
40673, -- 11.05.2011
mod(rownum,2),
mod(rownum,7),
100,
200
from dual connect by
level < 20000
union all
select
40674, -- 12.05.2011
mod(rownum,2),
mod(rownum,7),
100,
200
from dual connect by
level < 30000
;
commit;
--
exec
dbms_stats.gather_table_stats(ownname=>user, tabname=>'time_dim');
exec
dbms_stats.gather_table_stats(ownname=>user,
tabname=>'tab_fact',granularity=>'all',cascade=>TRUE);
Setup of Time Dimension with Natural Key
-- create time
dimesion (day granularity) wuth surrogate key
---
create table
time_dim2
(
time_value date not null /* smart time value
*/
);
alter table time_dim2
add primary key (time_value) rely;
--
insert into time_dim2
select to_date('01.01.1900','dd.mm.yyyy') +rownum
-1
from dual connect by
level < 365.25 * 200;
--
commit;
--
-- fact table
create table
tab_fact2 (
time_value date not
null,
dim1_id number not
null,
dim2_id number not
null,
measure1 number,
measure2 number
)
partition by range
(time_value)
(
partition p20110510
values less than (to_date('11.05.2011','dd.mm.yyyy')),
partition p20110511
values less than (to_date('12.05.2011','dd.mm.yyyy')),
partition p20110512
values less than (to_date('13.05.2011','dd.mm.yyyy')),
partition p20110513
values less than (to_date('14.05.2011','dd.mm.yyyy')),
partition p20110514
values less than (to_date('15.05.2011','dd.mm.yyyy')),
partition pmax values
less than (MAXVALUE)
);
alter table tab_fact2
add constraint
fk_time_dim2
foreign key
(time_value) referencing time_dim2 (time_value) rely disable novalidate;
-- fill 3 partitions
insert into tab_fact2
select
to_date('10.05.2011','dd.mm.yyyy'),
mod(rownum,2),
mod(rownum,7),
100,
200
from dual connect by
level < 10000
union all
select
to_date('11.05.2011','dd.mm.yyyy'),
mod(rownum,2),
mod(rownum,7),
100,
200
from dual connect by
level < 20000
union all
select
to_date('12.05.2011','dd.mm.yyyy'),
mod(rownum,2),
mod(rownum,7),
100,
200
from dual connect by
level < 30000
;
commit;
--
exec
dbms_stats.gather_table_stats(ownname=>user, tabname=>'time_dim2');
exec
dbms_stats.gather_table_stats(ownname=>user,
tabname=>'tab_fact2',granularity=>'all',cascade=>TRUE);