Referential Integrity – Buckets and Dummies


We have seen how to disconnect dimension and fact object processing to increase performance

With this strategy the question of referential integrity arises.



Referential integrity can be impacted by a number of factors not least – late arriving reference data for dimensions – incorrect data in facts or dimensions resulting in an invalid key.



Let us consider the dimension object account address keyed on account number


Account no

Address

Startpoint

Natural_key

Generation

1

1 Acacia Avenue

01/01/2000

c4ca4238a0b923820dcc509a6f75849b

1

2

7 Block Lane

01/01/2000

c81e728d9d4c2f636f067f89cc14862c

1

1

101 High Street

02/02/2010

c4ca4238a0b923820dcc509a6f75849b

2



Presented as

Account no

Address

Startpoint

Natural_key

Generation

2

7 Block Lane

01/01/2000

c81e728d9d4c2f636f067f89cc14862c

1

1

101 High Street

02/02/2010

c4ca4238a0b923820dcc509a6f75849b

2



And a fact object of

Account no

invoice

Tax point

Value

Value to date

1

1

01/02/2000

100

100

2

1

01/02/2000

20

20

3

1

02/03/2010

30

30



When the fact is processed the resultant object of

 

Account no Hkey

invoice

Tax point

Value

Value to date

c81e728d9d4c2f636f067f89cc14862c

1

01/02/2000

100

100

c4ca4238a0b923820dcc509a6f75849b

1

01/02/2000

20

20

eccbc87e4b5ce2fe28308fd9f2a7baf3

1

02/03/2010

30

30

 

We now have a referential integrity issue as the account 3 has no instance in the dimension object so joining the two objects would exclude account 3 from the results.

 

Buckets

To eliminate referential integrity issues we use buckets

Buckets are simply objects that contain the natural key elements and the hash key as generated by the fact for a specific dimension object as well as the date or timestamp of the fact instance.

In this simple case the account address bucket object would take the form of:



Account no Hkey

Account No

Start date

c81e728d9d4c2f636f067f89cc14862c

1

01/02/2000

c4ca4238a0b923820dcc509a6f75849b

2

01/02/2000

eccbc87e4b5ce2fe28308fd9f2a7baf3

3

02/03/2010



As each fact is processed the bucket is populated with all the keys and natural key elements

So if there are a number of facts processed that are related to this dimension then we may end up after they are all completed with an object containing elements of the following:

Account no Hkey

Account No

Start date

c81e728d9d4c2f636f067f89cc14862c

1

01/02/2000

c4ca4238a0b923820dcc509a6f75849b

2

01/02/2000

eccbc87e4b5ce2fe28308fd9f2a7baf3

3

02/03/2010

c81e728d9d4c2f636f067f89cc14862c

1

01/02/2000

c4ca4238a0b923820dcc509a6f75849b

2

01/02/2000

eccbc87e4b5ce2fe28308fd9f2a7baf3

3

02/04/2011



Dummies



Using the bucket we take a unique list of elements where the hkey does not exist in the dimension object thus accepting the earliest date or timestamp available :



Account no Hkey

Account No

Start date

eccbc87e4b5ce2fe28308fd9f2a7baf3

3

02/03/2010



We then create the missing instance into the dimension object as a dummy instance with the earliest start date (or current timestamp if not available) with a generation of 0



Account no

Address

Startpoint

DHkey

Generation

1

1 Acacia Avenue

01/01/2000

c4ca4238a0b923820dcc509a6f75849b

1

2

7 Block Lane

01/01/2000

c81e728d9d4c2f636f067f89cc14862c

1

1

101 High Street

02/02/2010

c4ca4238a0b923820dcc509a6f75849b

2

3

DUMMY

02/03/2010

eccbc87e4b5ce2fe28308fd9f2a7baf3

0

The Dimension object would thus be presented as

 

Account no

Address

Startpoint

DHkey

Generation

2

7 Block Lane

01/01/2000

c81e728d9d4c2f636f067f89cc14862c

1

1

101 High Street

02/02/2010

c4ca4238a0b923820dcc509a6f75849b

2

3

DUMMY

02/03/2010

eccbc87e4b5ce2fe28308fd9f2a7baf3

0



In this way we have eliminated the referential integrity issue and we can easily identify the exceptions by the generation number.


Late Arriving Reference

Having resolved the referential integrity issue we may or indeed should receive in the future the correct reference data

Instances derived from the dimension object build process always have a positive non zero generation applied to them.

 

So say we receive the data for account 3 to use the same example – the process would create a valid instance in the dimension object with the process timestamp as the start point


Account no

Address

Startpoint

DHkey

Generation

1

1 Acacia Avenue

01/01/2000

c4ca4238a0b923820dcc509a6f75849b

1

2

7 Block Lane

01/01/2000

c81e728d9d4c2f636f067f89cc14862c

1

1

101 High Street

02/02/2010

c4ca4238a0b923820dcc509a6f75849b

2

3

DUMMY

02/03/2010

eccbc87e4b5ce2fe28308fd9f2a7baf3

0

3

12 Acme Road

03/04/2011

eccbc87e4b5ce2fe28308fd9f2a7baf3

1



This object would then be presented out as :

 

Account no

Address

Startpoint

DHkey

Generation

2

7 Block Lane

01/01/2000

c81e728d9d4c2f636f067f89cc14862c

1

1

101 High Street

02/02/2010

c4ca4238a0b923820dcc509a6f75849b

2

3

12 Acme Road

03/04/2011

eccbc87e4b5ce2fe28308fd9f2a7baf3

1

 

Now when joining the related fact objects those elements that previously returned DUMMY address data would now return the valid address element without any reprocessing.

 

Contents   Next