Overview of ETL Architecture
In a data warehouse, one of the main parts of the entire system is the ETL process. ETL is the system that reads data from the source system, transforms the data according to the business logic, and finally loads it into the warehouse. While fetching data from the sources can seem to be an easy task, it isn't always the case. Some systems are made up of various data sources, which make the overall ETL architecture quite complex to be implemented and maintained.
One of the main challenges that I often face while working as a BI Engineer is how to design the correct architecture for the ETL process. Well, to be honest, there is nothing called the correct architecture. The best architecture actually depends on the requirements of how the other components are designed or set up. In my opinion, whenever there is a need to design an ETL process, the most important part would be to understand what are the data sources and how are they going to be connected.
In ETL architecture, there are a different number of ways in which we can design the SSIS packages. One such approach is to build the entire package in a modular way, such that we keep the dependencies for each data marts to the minimum. This is usually achieved by building separate child packages for each data marts and then integrating all these packages into a master package that will handle the overall package execution. The main advantage of building a modular architecture is that the individual child packages can be executed independently in case there are some data quality issues with one of the marts without having to execute any other packages.
For the purpose of the demonstration, I'll take an example of a simple ETL application that is going to populate a data warehouse. The warehouse has two main subject areas (Sales and Orders) that we want to populate. It is possible that there might be some dimensions that are specific to each of the marts and also shared dimensions as well. We will design the packages for each mart in such a way that the shared dimensions will be included in both the packages. However, in this case, an important thing to make sure is that there must be some logic for the dimensions which restrict them from re-populating the same data again and again if the packages are re-executed.
The structure for all the child packages will be similar which will follow the execution sequence as below:
- Extract - In this phase, we will fetch all data from each of the sources into our staging layer.
- Preparation - This is a validation phase before implementing the business logic to ensure data quality is good. In this phase, the source data is being cleaned and validated that there aren't any orphan records, which might break referential integrity.
- Load & Transform - In this phase, the data is transformed and then loaded into the physical warehouse tables.
- Process OLAP - Once the data is available in the warehouse, it is now time we can process the OLAP cubes which will be the presentation layer for the entire analytics application.
This sequence diagram explains how the entire ETL process is executed from the master package. At first, the master package is executed and it triggers the Extract phase for all the individual child packages. This will ensure that we have all the data from the source available in our staging layer before further processing. Then, the Validation is started, which is followed by the Transform & Load phase. Finally, the individual cubes are processed from the master package which refreshes the presentation layer with the latest data.
Now, if we encounter any data issues with the Sales mart, we can simply go ahead and clean up the entire Sales data from the warehouse and the staging layer. This time, instead of executing the master package, we will execute only the Sales package. This will repopulate the data into the warehouse without affecting any other data marts.
This is a very simple example of how modularity is achieved using SSIS. The child packages can be executed sequentially or in parallel which again depends upon the infrastructure and also the business requirements. Building ETL packages in a modular way help to maintain less dependency on other packages and also reduces the maintenance of the packages in the long run.