In part 1 of this post, I introduced the idea of map tables. These tables serve as an abstraction between the source systems and the entities in the data warehouse. In this post, I will describe how you can use the maps to correct for source key errors.
Using the taxonomy of key pathologies described earlier, I will walk you through some examples of map usage.
Some conventions I will use in populated tables illustrated in this blog:
- Source Keys are in italics
- Entity Keys (in the Entities schema) are coloured
- Keys and column values that are irrelevant to the argument have been grayed out so you cant read them
- Values that receive UPDATE statements are marked in green highlight.
In order of the taxonomy, let us start with the composite key.
Fixing Composite Keys
Recall that we use the Entities schema to hold the “master key” for every entity and that we don’t rely on any source to be a master. The map table is a way to transform any key from a source to that master key.
For example, let us assume our sales system supplies Promotion keys as a composite of SKU and and PromotionDate. Because some sneaky programmer couldn’t be bothered to create a new key for a promotion and instead argued that a single SKU is only ever on promotion on one date at a time (this, as it happens, is a good example of trying to argue for doing the wrong thing).
In this case, we would have a map like this:
CREATE TABLE Maps.Promotion_Sales ( Map_ID INT IDENTITY(1,1) NOT NULL , SKU INT NOT NULL , PromotionDate DATETIME NOT NULL , ValidFrom DATETIME NOT NULL , ValidTo DATETIME NOT NULL , IsCurrent BIT NOT NULL , IsDeleted BIT NOT NULL , EK_Promotion INT NOT NULL ) /* Index to enforce source uniqueness */ ALTER TABLE Maps.Products_Inventory ADD CONSTRAINT UK_SKU_PromotionDate UNIQUE CLUSTERED (SKU, PromotionDate, ValidFrom) /* Add standard indexes here as per other blog entry */
And our Dimension/Entity table will be:
CREATE TABLE Dim.Promotion ( , EK_Promotion INT NOT NULL PRIMARY KEY , Name VARCHAR(50) NOT NULL DEFAULT ('') , CreatedDate DATETIME NOT NULL DEFAULT (GETDATE()) )
And no, we will NOT store the SKU or promotion in this table, they are either redundant (the SKU is already in the sales data) or we don’t want to maintain the muddle-headed design of the source in the warehouse.
Example population of the tables if we receive the first promotion data on 2011-01-01
Dim.Promotion
EK_Promotion | Name | CreatedDate |
1 | Big Week Sale | 2011-01-01 |
2 | Good Deals Today | 2011-01-01 |
Maps.Promotion_Sales
Map ID |
SKU | Promotion Date |
Valid From |
Valid To |
Is Current |
Is Deleted |
EK_Promotion |
5 | 43 | 2011-01-01 | 2011-01-01 | 2011-01-07 | true | false | 1 |
6 | 43 | 2011-01-01 | 2011-01-01 | 2011-01-02 | true | false | 2 |
Apart from making the fact tables linked to this entity smaller, we can now work around our poorly arguing programmer when the business wants to have more than one SKU on the same promotion. Let us say that SKU = 87 is not on the “Good Deals Today” promotion. We can now update the Maps, without wreaking havoc with the Entities:
Maps.Promotion_Sales (updated to reflect the new reality)
Map ID |
SKU | Promotion Date |
Valid From |
Valid To |
Is Current |
Is Deleted |
EK_Promotion |
5 | 43 | 2011-01-01 | 2011-01-01 | 2011-01-07 | true | false | 1 |
6 | 43 | 2011-01-01 | 2011-01-01 | 2011-01-02 | true | false | 2 |
7 | 87 | 2011-01-01 | 2011-01-01 | 2011-01-02 | true | false | 2 |
Fixing Zombie Keys
Assume our inventory system has zombie keys. Recall that they use EAN as the key over there. At the initial load of our warehouse, let us say on 2010-06-01, we receive this staging table from them:
Staging.Products_Inventory (on 2010-06-01)
EAN | Weight |
978-0465026562 | 1.0 |
978-0201485417 | 2.2 |
This will cause the following population of the map:
Maps.Products_Inventory (on 2010-06-01)
Map ID |
EAN | Valid From |
Valid To |
Is Current |
Is Deleted |
EK_Product |
51 | 978-0465026562 | 2010-06-01 | 9999-01-01 | true | false | 1 |
52 | 978-0201485417 | 2010-06-01 | 9999-01-01 | true | false | 2 |
Now, let us assume that EAN = 978-0465026562 goes out of stock on 2010-08-01 and that the inventory system cleans up the record. We will now receive this staged data:
Staging.Products_Inventory (on 2010-08-01)
EAN | Weight |
978-0201485417 | 2.2 |
We must conclude that the key is dead, and we update the map to reflect this.
Maps.Products_Inventory (on 2010-08-01)
Map ID |
EAN | Valid From |
Valid To |
Is Current |
Is Deleted |
EK_Product |
51 | 978-0465026562 | 2010-06-01 | 2010-08-01 | false | false | 1 |
52 | 978-0201485417 | 2010-06-01 | 9999-01-01 | true | false | 2 |
Let us assume we have agreed with the source that it has zombie key behaviour. Because of this, we leave IsDeleted = false. We know that Product_Key = 1 is supposed to be dead, but it may return from the grave.
The supplier for EAN = 978-0465026562 delivers us a new stock three months later. On 2011-11-01 we again observe the EAN in the staging tables, and we reflect this in the map like this:
Maps.Products_Inventory (on 2010-11-01)
Map ID |
EAN | Valid From |
Valid To |
Is Current |
Is Deleted |
EK_Product |
51 | 978-0465026562 | 2010-06-01 | 2010-08-01 | false | false | 1 |
52 | 978-0201485417 | 2010-06-01 | 9999-01-01 | true | false | 2 |
53 | 978-0465026562 | 2010-11-01 | 9999-01-01 | true | false | 1 |
At this point you may ask: why didn’t we just update the ValidTo date of Map_ID = 51 and set it to 2010-11-01, instead of adding a new row? The reason we track the “dead state” of the key are twofold:
- We want to be able to detect wrong data from the inventory system. If a fact arrives when the key should be dead, we can flag this.
- We want to be able to handle any dolly keys arriving in the interval 2010-08-01 and 2010-11-01
- Recall that we have a unique index in place that enforces the rule:
“a source key cannot have TWO different meanings at the same point in time”
- . Let us see this rule in action…
Fixing Dolly Keys
After our warehouse goes into production, the guys down in inventory find an error in the reports. After some investigation, it is discovered that on 2010-10-13 a new guy unpacked a crate containing a book called: “Gödel, Escher, Bach: Ein Endloses Geflochtenes Band”. The new guy, understanding German, understood that this was most likely the book with EAN = 978-0465026562, so he created a new entry in the inventory system, temporarily making the EAN appear in the system – but pointing to the WRONG entity. The book was returned the day after to the supplier, because the inventory guys realised it was in the wrong language, after this, the batch clean process gets rid of the EAN again.
Image may be NSFW.
Clik here to view.However, the guys down in inventory care about returns, they want to track them and report on errors. We are faced with the problem of an EAN temporarily representing the wrong entity. Think this example is far fetched? I have seen it several times, on keys a lot of people would personally care about.
Fortunately, our Map structure can handle this. We start by looking up the Product_Key of the German edition of Gödel, Escher, Bach – let us assume it is Product_Key = 42. We can now handle our Dolly key like this:
Maps.Products_Inventory (updated after 2010-11-01)
Map ID |
EAN | Valid From |
Valid To |
Is Current |
Is Deleted |
EK_Product |
51 | 978-0465026562 | 2010-06-01 | 2010-08-01 | false | false | 1 |
52 | 978-0201485417 | 2010-06-01 | 9999-01-01 | true | false | 2 |
53 | 978-0465026562 | 2010-11-01 | 9999-01-01 | true | false | 1 |
54 | 978-0465026562 | 2010-10-13 | 2010-10-14 | false | true | 42 |
Notice that, unlike the zombies that are the norm of this source, we marked the Dolly key as IsDeleted = true, we don’t expect this sheep to come back from the grave
(zombie sheep, now THERE is an idea for a new game in the app store).
Merging keys Together – Handling Multi Keys
To wrap up this blog entry about the usage of map tables, let me illustrate another use for them. There are cases where the source system carries two different keys for the same entity. This CAN be the result of an error (often the case with customer data), but there are situations where having two keys for an entity is a legitimate use case.
For example, our inventory system may have multi EAN numbers for the “same” book. This happens because books can exist in multiple editions. Examples of these differences are: paperback, hardcover, reprints and new editions. From the perspective of the inventory system, and often the customer, we may even care about the different editions. But perhaps this is deemed irrelevant from a business reporting perspective. If I drill into the Douglas Hoffstadter in my Product Dimension – I will probably WANT to see all English editions of “Gödel, Escher, Bach” as the same row.
Fortunately, the merging of source rows can be handled by maps. It WAS actually the case that “Gödel, Escher, Bach” had a paperback edition in 1999 (which I happen to own) with another EAN = 1313353195. We can reflect this in our map:
Maps.Products_Inventory (updated after 2010-11-01, old paperback added)
Map ID |
EAN | Valid From |
Valid To |
Is Current |
Is Deleted |
EK_Product |
51 | 978-0465026562 | 2010-06-01 | 2010-08-01 | false | false | 1 |
52 | 978-0201485417 | 2010-06-01 | 9999-01-01 | true | false | 2 |
53 | 978-0465026562 | 2010-11-01 | 9999-01-01 | true | false | 1 |
54 | 978-0465026562 | 2010-10-13 | 2010-10-14 | false | true | 42 |
55 | 1313353195 | 1999-01-01 | 9999-01-01 | true | true | 1 |
Notice that we leave IsCurrent = true. This is a perfectly valid EAN to receive from the source system. I may simply have been lingering around on the dusty shelves for a long time.
Summary – Transforming Source Keys to Real Keys
In this blog, I have shown you how you can use map tables to handle key errors in the sources. For those of you that currently use type 2 dimension history for this purpose – you may wish to consider the implications of polluting your end user data with this type of “source noise”. How would you use a type2 dimension table when sources are replaced or errors occur that you don’t want to reflect in the end user data?
Consider also what it means for the maintainability of the warehouse to adopt the poor key choices in the source as an integrated part of the end user reporting data. I think I have argued that there has to be a de-coupling between the “history reporting” that end-users want, and the tracking and handling of source keys. It is my claim that the data model is simplified by not mixing these two concepts together in the same table. It is also my claim that you should always adopt the map tables, with the structures described here, as an integral part of the ETL process in the data warehouse.
As Jørgen Guldmann pointed out in a previous comment, having the structures in place for mapping does not mean you can find a reliable source of the data needed to handle the transformation. The maps allow you to change the sources to the format they should have had in the first place, even if that means manually updating the map tables or having someone hired to correct errors by adding rows to the map (as we saw with the Dolly key example). It is worth observing that if you can classify which key pathologies a source has, you can often automate the maintenance of the map, and only manually insert the exceptions you find as you iterate towards the completely cleansed Enterprise Data Warehouse.
At a point in the future, you may find the holy grail and end up with no source pathologies at all, maybe even a “master source” for all entities. The maps can serve as the requirement definition for that process. Furthermore, the maps can be used as update structures for sources that want to comply with a better key structure. Some people would call this master data management, and by proactively building maps as part of your data warehouse model, you have placed yourself in a position to lead that argument with a data driven approach.
Just don’t be surprised that you end up BEING the master data system once you start building mapping structures. But this is a good thing, the problem does not go away just because you push it around – with maps, you have a way to quantify and encapsulate it that does not ruin your brilliant data model.
PS: The book I have used as an example happens to be one of the best books ever written on the relationship between music, philosophy, art and mathematics. I rally cannot recommend it enough.
The post Transforming Source Keys to Real Keys – Part 2: Using Maps To Fix Key Problems appeared first on Fighting Bad Data Modeling.