SQLServerCentral Article

ETL Performance Auditing - Part 1: Introduction to ETL Auditing

,

Introduction

You are a SQL Developer or BI architect, and you get news that your scheduled production ETL Job(s) that were running “perfectly” are taking “longer” and “longer” to run. The problem is there are over 300 individual child SSIS packages that stage data, load a Data Warehouse (WH), load Data Marts,  process SSAS Cubes and manage your Database and Cube Partitions.

Some of the questions that could be asked would be: What is “longer”, or what is the benchmark for that complaint? Which of the ETL phases and individual packages within the ETL phases are causing the job to take longer? Which SSIS tasks and pipeline components on the individual Dataflow tasks are causing the packages to take longer?  Or could this simply be as a result of increase in amount of data being pulled?

ETL Auditing is a broad topic that include aspect like validating and monitoring the amount of data as it is moved through various phases. In these series however, I will explain how to setup a comprehensive system to generate, capture monitor ETL execution cycle audit trail for the simplest to the most complex ETL setup you may have to enable you answer some of the questions outlined above if they come up.

This is a three part series on ETL auditing. The various articles cover:

  • Part 1 – Introduction to ETL Auditing
  • Part 2 – Auditing Simple Setups 
  • Part 3 – Auditing Complex Setups  

Complexities of ETL Architecture

ETL setups can be as simple as one SSIS package with multiple tasks running in parallel or sequentially to some of the complex setups very common especially in BI Systems. In BI parlance the term end-to-end is very prevalent, what this normally means is that, by definition, a complete BI project often employs most or the entire SQL Server suite (Engine, SSIS, SSAS and SSRS). What this further means is that it is possible to have SSIS packages moving data through various BI phases and also processing tasks on or for these servers. Typical end-to-end ETL or SSIS package setups for BI systems may include one or more of the ETL Phases shown in Figure 1 below. From a Source to a Warehouse (WH) to mart and to a cube.

Fig 1: Showing a Sample BI ETL setup

The simple ETL setup in fig 1 above can easily become complicated in enterprise BI systems. This is because in enterprise BI systems, ETL architecture may include various staging phases not outlined in the diagram above. Secondly, since such system normally involves various businesses and disparate data sources, ETL processes in each of the ETL Phases may need to be executed multiple times and monitored for the various systems either in parallel or sequentially. As a result not only would you have execution cycles for ETL phases, SSIS packages, tasks and pipeline components to capture and monitor, but depending on the latency of your BI system, in only a day you could have numerous amounts of such executions and cycles to try and piece together.  This is why in certain cases you might not have choice but to have a system that can help make sense of such complex ETL setups regardless of how you deploy your SSIS Packages.

Runtime Execution Intervals

Besides monitoring the data being moved, in ETL audit, primarily what you will want to do is to keep track of how long ETL phases, batches, SSIS Packages, SSIS Tasks and SSIS Pipeline components takes to complete during executions.

I have seen people capture these metrics for ETL Batches and Packages by adding steps to the SSIS packages and tasks to capture start- time and end-times of cycles within ETL processes. Essentially, to find Runtime Interval for batches and packages, various audit steps are incorporated in SSIS Packages and Tasks as shown in Figs 2 and 3 below to capture start-times and end-times for later evaluation.

ETL Batch Cycles

    

Figure 2: ETL setup showing audit steps to capture Runtime Batch execution Intervals

Package Execution Cycles

 Figure 3: Package Container showing audit steps to capture Package Runtime execution Interval

The problem with this approach is that it is difficult to setup across many ETL Phases especially where each phase has many packages. It is also nearly impossible to capture execution interval at SSIS Tasks and Pipeline Component level using only this approach.

The approach I am going to demonstrate will use mostly SQL scripts and SSIS log table(s) to effectively accomplish such endeavors not matter how complicated your ETL architecture is.

Gathering SSIS Package Information

There is a wealth of information that can be obtained from SSIS log tables; the problem is that data in the log tables are not normalized. The secret to unlocking this wealth of information is creating tables with information on all your SSIS packages and tasks. Armed with this information, extracting information from log tables is very easy.

SSIS creates a unique id (GUID) for every package and task at design-time. These IDs are the SourceIds logged into SSIS log tables for events enabled for that particular SSIS element. This means that to effectively establish the hierarchy of events in the log table you must create a table with combinations of SSIS Element Names and Ids to reflect their relationships as they exist in your package structure as shown in the examples in Tables 1 and 2  below.

Table1:  SSISPackage table showing a list of package Name and their IDs (Package Guid)

Tables 2:  SSISPackageTask table showing Package-Task relations.

For smaller set of packages this can be done by copying the names and ids from the Package or Task directly. To manually copy this information right click on the SSIS item, click on Properties and copy the item Name and id (GUID) as shown in fig 4 below.

Fig 4: Showing a Package ID and Name Properties

For bigger ETL setups with many of packages and tasks, you might need a system that loops through all your packages to obtain this information. For instance you can loop through the .dtsx files with XML or use C# or VB and the SISS object Model to achieve this objective, a topic I will leave for another discussion.

Enabling Logging to Capture ETL Execution Cycle.

We are going to use the out-of-box SSIS logging mechanism to capture Runtime Execution Intervals of SSIS elements (Packages, Tasks and Pipeline Components) within your ETL setups.

Many events can be enabled for logging but for these demonstrations, we are going to focus on the minimal events that will enable us to estimate runtime execution intervals for most SSIS package ETL setups.  The three events we will use to estimate execution times are the OnPreExecute, OnPostExecute and OnError Events. At least these three events must be enabled for all SSIS packages and tasks that will be part of your ETL auditing and monitoring process. The the OnPreValidate and OnPostValidate Events are fired once at the beginning and at the end respectively for each SSIS element. These two events will be complementary to the auditing steps described in figs 3 and 4 above, in other words we will use log times of OnPreValidate and OnPostValidate Events to capture Starttime and Endtimes respectively. 

As you will notice later, any other Events that is also enabled that either fires before or after these two events will also be incorporated in our estimation process. We will use the OnError Event to identify failed executions. I will address Pipeline Event logging later in the series.

Conclusion

By the the end of this Part you should have enabled logging using the SQL Server log Provider, and also gathered the Names and IDs of your packages and tasks.

In the subsequent parts of this series I am going to demonstrate how to use SQL scripts and the information above to capture all ETL execution cycles for any SSIS ETL setup you may have. The approach does not require adding various audit task steps to your SSIS Packages, it is not impacted by the number of ETL Phases and Packages within your setup and does not limit Tasks and Pipeline Component level auditing.

Stay tuned.

This is a three part series on ETL auditing. The various articles cover:

  • Part 1 – Introduction to ETL Auditing
  • Part 2 – Auditing Simple Setups 
  • Part 3 – Auditing Complex Setups  

Rate

3.58 (40)

You rated this post out of 5. Change rating

Share

Share

Rate

3.58 (40)

You rated this post out of 5. Change rating