Why Data Warehouse?
Datawarehouse is important in modern technology world for several reasons as below:
- To ensure the consistency and quality of data.
- To ensure the governance of data for a large enterprises.
- To analyze data from different sources(homo and hetero)
- To separate data from transactional to analytical model
- To improve the speed and efficiency of accessing data
- To help in arriving decision making system
Transactional Versus Reporting
|Transactional Database Design||Reporting Database Design (DW)|
|Performance is more about Data manipulation statement||Performance is about retrieving Data|
|Locking/Blocking must be minimal||Locking is not a concern|
|Small set of Data retrieval||Large set of Data retrieval|
|Normalization of data||De Normalized form of data|
“Dimensional Modelling is a design technique for databases intended to support end-user queries in a data warehouse.”
— Ralph Kimball
Dimensional modelling features as best to answer the questions the business would like to understand; not on the operational system. For eg. business might want to get the sales order for last year or growth over a period of times etc. Dimensional modelling ensures an easy and efficient way of data retrieval.
Dimensional Modelling primarily consists of two types of entities: Dimension and Fact.
A measurable matrix that is normally used to calculate the number of sales, quantity or sales order etc. It contains the Primary key of Dimension associated with. A Fact can contain facts and pointer to dimension.
A descriptive information of a measurement in Fact. There will be more number of Dimension tables than Fact. A dimension can contain many dimension attributes.
How do we understand the Dimension & Fact
One of the widely appreciated way is based on 7Ws of Data Warehouse design on a data story. The question words are:
How, What, When , Who, Where and Why – determines Dimensions.
How many/much – determines Fact.
A pictorial representation of star schema as below. Few points to be noted are:
- Fact is pointing to Dimension.
- Dimensions are descriptive and denormalized form, not pointing to another dimension.
- Dimension Keys are normally surrogate keys, not generated by source, but load process. This is to ensure the change in the source is not going to impact the data warehouse system.
- Dimension attributes determines the granularity called grains of Facts.
- Fact is pointing to a Dimension.
- Dimension can point to another Dimension as shown in the below picture.
- Good for Transaction Database design, however not so efficient one for Data warehouse (unless its proven).
- It can incur performance issues due to several join requirement to get the required data that business asked in reporting purposes.
Bus matrix is a key item in the requirement phase of Data warehouse systems. It provides information and relation of dimensions and subject areas in a matrix form. The matrix is generated from various sources of truth by asking questions like user interviews/surveys etc. It provides a clear picture of shared/common Dimensions used by various subject areas and its dependencies easily. This is important for any DW project to keep this ready and up-to-date.
The below picture is a representation of bus matrix on AdventureWorks DW prepared in an Excel.
Changing the Dimension values & its significance
Occasionally there could be some changes happen even for Dimension, though its rare. So, its important to understand how the system should behave to the modification and these are called slowly changing dimensions.
3 Different Types of SCD (Slowly Changing Dimensions)
- Type 1
This is the most simplest form of SCD. Its is easier to implement. In principle, it just updates the current value with new value. The disadvantage is that we will lose the ability of understanding the previous data, history data, as the update is in-place.
- Type 2
This type tracks the historical data as a snapshot. For eg: This can be implemented with versioning concepts with start date and end date, hence at any point of times, the reports can roll up the data accordingly based on the versioning. The best way to get historical information, however, more complex to be implemented.
- Type 3
This type is an easier one than Type2, while keeping the historical data. As an example, keeping the old and new values by adding an extra column for the changed dimension value.
If you enjoyed this blog post, feel free to share it with your friends!