We have seen how
to disconnect dimension and fact object processing to increase
performance
With this strategy the question of referential integrity arises.
Referential integrity can be impacted by a number of factors not least – late arriving reference data for dimensions – incorrect data in facts or dimensions resulting in an invalid key.
Let us consider the dimension object account address keyed on account number
Account no |
Address |
Startpoint |
Natural_key |
Generation |
1 |
1 Acacia Avenue |
01/01/2000 |
c4ca4238a0b923820dcc509a6f75849b |
1 |
2 |
7 Block Lane |
01/01/2000 |
c81e728d9d4c2f636f067f89cc14862c |
1 |
1 |
101 High Street |
02/02/2010 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
Presented as
Account no |
Address |
Startpoint |
Natural_key |
Generation |
2 |
7 Block Lane |
01/01/2000 |
c81e728d9d4c2f636f067f89cc14862c |
1 |
1 |
101 High Street |
02/02/2010 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
And a fact object of
Account no |
invoice |
Tax point |
Value |
Value to date |
1 |
1 |
01/02/2000 |
100 |
100 |
2 |
1 |
01/02/2000 |
20 |
20 |
3 |
1 |
02/03/2010 |
30 |
30 |
When the fact is processed the resultant object of
Account no Hkey |
invoice |
Tax point |
Value |
Value to date |
c81e728d9d4c2f636f067f89cc14862c |
1 |
01/02/2000 |
100 |
100 |
c4ca4238a0b923820dcc509a6f75849b |
1 |
01/02/2000 |
20 |
20 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
1 |
02/03/2010 |
30 |
30 |
We now have a referential integrity issue as the account 3 has no instance in the dimension object so joining the two objects would exclude account 3 from the results.
Buckets
To eliminate referential integrity issues we use buckets
Buckets are simply objects that contain the natural key elements and the hash key as generated by the fact for a specific dimension object as well as the date or timestamp of the fact instance.
In this simple case the account address bucket object would take the form of:
Account no Hkey |
Account No |
Start date |
c81e728d9d4c2f636f067f89cc14862c |
1 |
01/02/2000 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
01/02/2000 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
3 |
02/03/2010 |
As each fact is processed the bucket is populated with all the keys and natural key elements
So if there are a number of facts processed that are related to this dimension then we may end up after they are all completed with an object containing elements of the following:
Account no Hkey |
Account No |
Start date |
c81e728d9d4c2f636f067f89cc14862c |
1 |
01/02/2000 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
01/02/2000 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
3 |
02/03/2010 |
c81e728d9d4c2f636f067f89cc14862c |
1 |
01/02/2000 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
01/02/2000 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
3 |
02/04/2011 |
Dummies
Using the bucket we take a unique list of elements where the hkey does not exist in the dimension object thus accepting the earliest date or timestamp available :
Account no Hkey |
Account No |
Start date |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
3 |
02/03/2010 |
We then create the missing instance into the dimension object as a dummy instance with the earliest start date (or current timestamp if not available) with a generation of 0
Account no |
Address |
Startpoint |
DHkey |
Generation |
1 |
1 Acacia Avenue |
01/01/2000 |
c4ca4238a0b923820dcc509a6f75849b |
1 |
2 |
7 Block Lane |
01/01/2000 |
c81e728d9d4c2f636f067f89cc14862c |
1 |
1 |
101 High Street |
02/02/2010 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
3 |
DUMMY |
02/03/2010 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
0 |
The Dimension object would thus be presented as
Account no |
Address |
Startpoint |
DHkey |
Generation |
2 |
7 Block Lane |
01/01/2000 |
c81e728d9d4c2f636f067f89cc14862c |
1 |
1 |
101 High Street |
02/02/2010 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
3 |
DUMMY |
02/03/2010 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
0 |
In this way we have eliminated the referential integrity issue and we can easily identify the exceptions by the generation number.
Late Arriving Reference
Having resolved the referential integrity issue we may or indeed should receive in the future the correct reference data
Instances derived from the dimension object build process always have a positive non zero generation applied to them.
So say we receive the data for account 3 to use the same example – the process would create a valid instance in the dimension object with the process timestamp as the start point
Account no |
Address |
Startpoint |
DHkey |
Generation |
1 |
1 Acacia Avenue |
01/01/2000 |
c4ca4238a0b923820dcc509a6f75849b |
1 |
2 |
7 Block Lane |
01/01/2000 |
c81e728d9d4c2f636f067f89cc14862c |
1 |
1 |
101 High Street |
02/02/2010 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
3 |
DUMMY |
02/03/2010 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
0 |
3 |
12 Acme Road |
03/04/2011 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
1 |
This object would then be presented out as :
Account no |
Address |
Startpoint |
DHkey |
Generation |
2 |
7 Block Lane |
01/01/2000 |
c81e728d9d4c2f636f067f89cc14862c |
1 |
1 |
101 High Street |
02/02/2010 |
c4ca4238a0b923820dcc509a6f75849b |
2 |
3 |
12 Acme Road |
03/04/2011 |
eccbc87e4b5ce2fe28308fd9f2a7baf3 |
1 |
Now when joining the related fact objects those elements that previously returned DUMMY address data would now return the valid address element without any reprocessing.