In this Article I’m going to demonstrate the heterogenous systems integration and building the BI system and mainly talks about the DELTA load issues and how to overcome.
How to compare the Source table and Target table when could not be able to find a proper way to identify the changes in the Source table using SSIS ETL Tool.
SAP S/4HANA is an Enterprise Resource Planning (ERP) software package meant to cover all day-to-day processes of an enterprise E.g.: order-to-cash, procure-to-pay, finance & controlling, request-to-service and core capabilities.
SAP HANA is a column-oriented, in-memory relational database that combines OLAP and OLTP operations into a single system.
SAP Landscape Transformation (SLT) Replication is a trigger-based data replication method in HANA system. It is a perfect solution for replicating real-time data or schedule-based replication from SAP and non-SAP sources.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the management functions offered by the database including backups, patching, upgrading, and monitoring with very minimum user involvement.
SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and data transformations solutions. Used to integrated and establish the pipeline for ETL and solve complex business problems by copying or downloading files, loading data warehouses, cleansing, and mining data.
Power BI is an interactive data visualization software product developed by Microsoft with a primary focus on business intelligence.
Let me first talk about the business requirement. We have more than 20 different Point-of-Sale (POS) data from different online retailers like Target, Walmart, Amazon, Macy's Kohl's, and JC Penney etc. apart from this, the major business transactions will happen in SAP S/4HANA and Business users required the BI reports for analysis purposes.
The source systems are multiple like SAP S/4HANA (OLTP), Files, Target System is Azure SQL (OLAP), the integration/ETL tools used are SAP SLT and MS SSIS, and finally the reporting tool is Power BI.
During this end to end process (from source data to reporting) there were several problems because of the heterogeneous systems, the main hurdles are Duplicate records in target system and Power BI reports, and users are unable to take right decision at right time.
The main reason for this is we don’t have direct DELTA capture mechanism for the data which is coming from SAP S/4HANA, all tables don’t have timestamp, though it is there, but hard deletions are happing in SAP, and until SQL On-Prem (in above diagram) the data will maintain correctly, because SAP SLT will take care the hard deletions in source and it also adjust the data in SQL On-Prem, but in outbound flow to Azure SQL, it is SSIS tool and it will not adjust the data automatically.
After research, it comes to know that, there are 4 methods to capture the source data changes.
1. Timestamp-based changes
2. Database Trigger based
3. Compare the source and target table and find the changes
4. Database log based
In this article I’m going to talk about 3rd scenario.
What are the reasons to select this approach?
1. Cannot find a timestamp in the source table or there is no other way to find the changes in the source table
2. Due to limitation at source side we are unable to configure the DB triggers to capture the changes in source
Benefits of this approach
1. Minimum dependency on the DBMS
2. To cut the loading time i.e. rather Full loads, go for Delta
3. Improve users confidence while analyzing data using the reports.
4. Easy maintenance
The main challenge of this approach is, difficult to use for a large-scale table.
For example, below is the source table and there is not a proper method to identify the changes. So, the steps to implement the Source and Target table comparison will be explained step by step.
As a first step, we need to create a SSIS project and a package. Then I will explain how to configure the Source connection.
OLE DB Source for source table: This is to bring the Source table data into the SSIS platform, and the required configuration will be explained below. In the SQL command, we need to select all required columns and you can see how to use the HASHBYTES function to get the HASH VALUE for each record. And the dataset must be ordered using the UNIQUE or PRIMARY KEY. In this example, I used the PRIMARY KEY for ordering the dataset. And using the NOLOCK keyword in the SQL command we can prevent the table blocking.
Then OLE DB Source columns need to be sorted and configured in column order in the ADVANCE EDITOR window of the control. The following images explained the required changes.
OLE DB Source for target table: In here we must follow the same steps that are followed in the SOURCE table. The only difference is no need to select all columns and the UNIQUE KEY and HASH VALUE is required in the SQL COMMAND.
MERGE JOIN: Then the source and target components join using the MERGE JOIN task. And all required configurations are explained in the following images. The SSIS automatically matches the join keys based on the column order that we configured in the OLE DB SOURCE controls. Then need to select join type as FULL OUTER JOIN and select all source and target columns.
CONDITIONAL SPLIT: We use a Conditional Split task to separate the input dataset using custom Conditions that are explained in the following image as new records, updated records, and deleted records.
OLE DB Destination Task for Destination Tables: Now we need to configure 3 OLE DB Destination tasks for new, updated, and deleted records. The following 3 tables are used to store the above datasets,
- Dim_CustomerDetails (Final Table): used for the new records.
- Dim_CustomerDetails_Updates: used for keeping updated data until merged with the final table.
- Dim_CustomerDetails_Deletes: Used for keeping deleted records until deleted the original records in the final table.
The required steps are explained in the following images.
We can execute the DATA FLOW TASK and validate the steps that completed so far. Then can be observed if they are working correctly or not. This is the initial execution. So, all source data is considered new records and inserted into the Destination table because there were no records in the Destination table.
I’m going to do the following changes to the Source using the following SQL commands.
- Insert a new record
- Update the existing record
- Delete a record from the table
You can see the difference between the Source and Destination tables
The Data Flow Task is executed again. Then you can observe all changes inserted into the relevant tables that I explained earlier in the document.
Now, we need to create a Stored Procedure to merge the changes with the Final Table, and after the merging need to truncate the staging tables. Then we can add an EXECUTE SQL TASK after the DATA FLOW TASK to execute the procedure.
After executing the Procedure you can observe the all changes are merged with the Final table in the following image.
When we work with heterogenous systems like SAP S/4HANA, Azure SQL and ETL tools, it is always good to check these kind of scenarios to avoid duplicate data in DB and reports. This approach helped a lot without spending any additional cost and complex configurations like change logs etc.. and also easy to use and maintain.
Hope this article is helpful in designing efficient ETL system/pipelines that will save a lot of time, money and efforts.