SQLServerCentral Article

ETL Performance Auditing - Part 3: Auditing Data Warehouse Phase

,

Introduction

You are a SQL Developer or BI architect, and you get news that your scheduled production ETL Job(s) that were running “perfectly” are taking “longer” and “longer” to run. The problem is there are over 300 individual child SSIS packages that stage data, load a Data Warehouse (WH), load Data Marts,  process SSAS Cubes and manage your Database and Cube Partitions.

Some of the questions that could be asked would be: What is “longer”, or what is the benchmark for that complaint? Which of the ETL phases and individual packages within the ETL phases are causing the job to take longer? Which SSIS tasks and pipeline components on the individual Dataflow tasks are causing the packages to take longer?  Or could this simply be as a result of increase in amount of data being pulled?

In part 1, we looked at some prerequisites for setting up a comprehensive ETL Audit system namely:

  • Gathering all your SSIS Package information such as packages, tasks and pipeline component names and IDs.
  • Enabling the minimal necessary logging required to capture the ETL execution cycle.

In part 2, we looked at the auditing the Staging Phase of our hypothetical ETL architecture. In Part 3 we are going to look at the Warehouse Phase of this same ETL architecture. Before we proceed, let’s revisit the sample ETL architecture that we using for these series.

This is a three part series on ETL auditing. The various articles cover:

Sample ETL Project

To demonstrate some complexities in setting up ETL audit systems using the SSIS logging option, we are going to assume an ETL Process that loads a Data Warehouse (WH) for a company with two businesses, one in Singapore and another in the UK, each with a separate data source. From the two sources, data is first staged on separate databases before it is finally moved into the WH. The ETL setup for the two businesses is as shown fig 2 below.

Fig 1: Showing our hypothetical ETL Setup.

Auditing the WH Phase

In this part of the series we are going to focus on the ETL processes in the WH phase of the architecture shown in fig 1. From the Staging Phase, data is pulled from two separate staging databases into the WH database. From the ETL architecture shown on fig 1, let’s further analyze and summarize what’s involved in the WH phase.

WH Phase

The WH phases consists of

  • One set of WH Packages is deployed to the WH Server and is scheduled to sequentially load data from the UK and Singapore staging databases into the WH in no particular order.
  • Logging is enabled on the WH Packages into one SSIS log table sysssislog_WH on the WH database.

The key point to note here is that in the WH Phase, events are logged into one SSIS log table (sysssislog_WH) from both the UK and Singapore staging database by the same set of packages in no particular order.

The problem is that, after a couple of day if we query the sysssislog_WH table and outline consecutive batches we cannot tell the batches that belong to Singapore and those that belong to the UK system. Why? Because on some days Singapore was loaded into the WH first and on other days UK was loaded first. Such irregular load sequence could be even more complicated if you have a couple more system participating in this flip-flop ETL process?

Establishing Execution Framework within the WH Phase

To be able to enable a well-coordinated audit system we must outline the ETL process lifecycles within the WH phase in our ETL set up. For the purposes of this exercise, we are going further assume that;

  • The WH is of a daily latency. This means that in a day we expect two separate and successful batch executions in the WH Phase, one from the Singapore staging database to the WH and another from the UK staging database to the WH.
  • A successful batch execution is when both the first and last package loading the WH phase successfully completes together in a run.
  • ETL batch execution is restarted if a package fails before the completion of the first and last package together.

All source, staging and WH databases are SQL Server databases.

Estimating Batch Execution Cycles 

As defined above, a successful batch execution is when both the first and last package in the WH Phase successfully completes together in a run. To determine execution intervals for successful WH Batches, we will estimate the time between the first event logged by the “First Package” and the last event logged by the “Last Package” excluding failed execution.

For instance, the logic in listing 1 below will output all successful batches in any of the WH ETL Phases.

Listing 1 WH Batch Execution Cycles

 with FailedExecs
          as
          (
                   select Distinct
                             s.executionid
                   from dbo.sysssislog_WH  s
                   where event='OnError'
          ),
          FirstPackage
          as
          (        
                  select
                     ROW_NUMBER() over ( partition by s.sourceid
                     order by min(s.Starttime) ) Batchid
                    ,min(s.Starttime) as  StartTime
                    ,cast(convert(char(8),min(s.Starttime) ,112) as int) as StartDatekey
                    ,s.executionid
                    ,s.source
                  from dbo.sysssislog_WH s
                  where s.source like 'First Package Name'
                   and event not like ('User:%')
                   and s.executionid not in (select executionid from FailedExecs)
                   group by s.executionid ,s.source,s.sourceid
          ),
          LastPackage
          as
          (
          select
                      ROW_NUMBER() over ( partition by s.source
                      order by max(s.Endtime) ) Batchid
                    ,max(s.Endtime) as EndTime
                    ,cast(convert(char(8),min(s.Endtime) ,112) as int) as EndDatekey
                   ,s.executionid
                   ,s.source
          from dbo.sysssislog_WH s
          where s.source like 'Last Package Name'
          and event not like ('User:%')
          and s.executionid not in (select executionid from FailedExecs)
          group by s.executionid ,s.source,s.sourceid
          )
          select Distinct
                   N'Warehouse' as [ETLPhase]
                   ,f.Batchid as BatchId
                   ,f.StartTime as StartTime
                   ,f.StartDatekey as StartDatekey
                   ,l.EndTime         as        EndTime  
                   ,l.EndDatekey      as EndDatekey
                   --,d.DayNumberDesc
                   --,d.WeekNumberDesc
                   --,d.MonthNumberDesc
                   -- ,d.YearNumberDesc        
                   ,datediff (second,f.StartTime,l.EndTime ) as EstBatchExecTime_Sec
                  ,datediff (MINUTE,f.StartTime,l.EndTime ) as EstBatchExecTime_Min
          from FirstPackage f
          inner join LastPackage l
          on f.Batchid=l.Batchid
     
          --inner join [Dim].[Date] d
          --on d.DateISOName=f.StartDatekey

Let’s look at the two normal and Possible WH SSIS package setups:

  1. Where you have only one package called “OnePackage” in the WH Phase executing various tasks sequentially or in parallel to load the WH.
  2. Where you have a one parent package called “ParentPackage” in the WH Phase calling and processing many child packages either sequentially or in parallel to load the WH.

In the first case the name “First Package Name” and “Last Package Name” in the logic in listing1 are both replaced by “OnePackage”. Similarly in the second case by replacing “First  Package Name” and “Last Package Name” with the parent package name “ParentPackage”. The logic will capture the execution cycles for all the child packages called by the parent.

Based on the WH execution framework defined above, in 10 days the logic in Listing 1 above will isolate 20 batches each representing a successful daily execution for the UK and Singapore systems. The logic also will returns the estimated time it took for each of the batches to complete, the day the execution occurred, and the actual time each batch started and completed. However, there is still the question of which batch belongs to which system.

Identifying System Batches

To be able to associate the batches outline in the WH above with right source systems, we are going to identify successful executionids that belongs to each of the source systems and log it into the table below at runtime. We will then use the table to map batches to their respective source systems.

CREATE TABLE [Aud].[SystemExecIdentifier](
                [SystemName] [nvarchar](128) NULL,
                [ExecutionGuid] uniqueidentifier NULL,
                PackageName [int] NULL,
>) ON [PRIMARY]

To implement this step, run the script above to create Aud.SystemExecIdentifier table. After that each time we run the “ParentPackage” or the “OnePackage” Package (refer to the “Estimating Batch Execution Cycles” section above), we will log the executionid and the associated source system into the Aud.SystemExecIdentifier table. This executionid and the source system combination should be enough to help us identify which batches that belongs to the UK and those that belong to the Singapore systems from the log table.

To log the executionid and the associated source system into this table we are going to use the simple stored procedure (SP) in listing 3 below. This SP will be called once by the parent packages above during each execution. A good place to call this SP is on the PostExecute event of the package through a SQL Task.

Listing 3: Store Procedure to Capture package ExecutionInstanceGUID and Source Load System

Create procedure [Aud].[usp_LogSystemExecGuid]
        @ExecutionGuid uniqueidentifier,
        @PackageName nvarchar(200,
        @SourceSystem nvarchar(200)
AS
 BEGIN
  insert into [Aud].[SystemExecIdentifier](SourceSystem, ExecutionGuid, PackageName)  
   Values (@ExecutionGuid,@PackageName, @SourceSystem)
 END;

When setting up the SQL Task to call the SP, two system variables and one user variable should be passed to the task that will call the SP. The two system variables passed are the “ExecutionInstanceGUID” and “PackageName”, which should be passed as the @ExecutionGuid, @PackageName parameters respectively. A third user variable indicating which source system running the package is passed to the SP as the third parameter.

With the necessary data captured in the Aud.SystemExecIdentifier table, the logic in listing 1 can be modified to associated all Batchids with their respect source system by joining to the Aud.WHSystemExecIdentifier table as shown modified version in Listing 2 below

Listing 2 WH batch execution cycle by source system

with FailedExecs
as
(
select Distinct
s.executionid
from dbo.sysssislog_WH s
where event='OnError'
),
FirstPackage
as
( 
select
ROW_NUMBER() over ( partition by s.sourceid 
order by min(s.Starttime) ) Batchid
,min(s.Starttime) as StartTime
,cast(convert(char(8),min(s.Starttime) ,112) as int) as StartDatekey
,s.executionid
,s.source
,m.SourceSystem
from dbo.sysssislog_WH s
inner join Aud.SystemExecIdentifier m
on s.executionid=m.executionid 
where s.source like 'First Package Name'
and event not like ('User:%')
and s.executionid not in (select executionid from FailedExecs)
group by s.executionid ,s.source,s.sourceid
),
LastPackage
as
(
select 
ROW_NUMBER() over ( partition by s.source
order by max(s.Endtime) ) Batchid
,max(s.Endtime) as EndTime
,cast(convert(char(8),min(s.Endtime) ,112) as int) as EndDatekey
,s.executionid
,s.source
from dbo.sysssislog_WH s
where s.source like 'Last Package Name'
and event not like ('User:%')
and s.executionid not in (select executionid from FailedExecs)
group by s.executionid ,s.source,s.sourceid
)
--insert into aud.ETLPhaseBatches
select Distinct
f.SourceSystem
,N'Warehouse' as [ETLPhase]
,f.Batchid as BatchId
,f.StartTime as StartTime
,f.StartDatekey as StartDatekey
,l.EndTime as EndTime 
,l.EndDatekey as EndDatekey
--,d.DayNumberDesc
--,d.WeekNumberDesc
--,d.MonthNumberDesc
-- ,d.YearNumberDesc
,datediff (second,f.StartTime,l.EndTime ) as EstBatchExecTime_Sec
,datediff (MINUTE,f.StartTime,l.EndTime ) as EstBatchExecTime_Min
from FirstPackage f
inner join LastPackage l
on f.Batchid=l.Batchid
--inner join [Dim].[Date] d
--on d.DateISOName=f.StartDatekey

The key pieces of infomation in the revised logic are the lines where the WH logging table is joined to the Aud.SystemExecIdentifier to extract the source system for each batch as shown below.

 ,m.SourceSystem                          

from dbo.sysssislog_WH s                 

 inner join Aud.WHSystemExecIdentifier m      

  on s.executionid=m.executionid

After identifying successfull batches and the associated source systems, all package and task executions can also be tied to the right source system.

WH Batch Reports

The logic in listing 2 above can be joined to a time dimension table (commented out in the logic) to generate various batch reports. By joining the logic to a time dimension will allow you to generate a time parameterized report for various phases and sources.

Fig 2 shows the report parameters for the “Singapore” and “UK” systems in the WH ETL phase.

Fig 2:  EH Batch Cycle Report Parameters

The resulting sample batch report graph is displayed in fig 3 below.


Fig 3: Parameterized report showing daily batch execution cycles for the WH ETL Phase and source systems within a period of time.

The sample report illustrates daily batch execution cycles for the “Singapore” and “UK” systems in the WH ETL phase. The graph shows the batch executions within a 3-week period for both system.  

Estimating  Package Execution Cycles

 In this section we will estimate the time intervals for all Package executions within the WH.  To estimate this metric for packages we will estimate the time between the first and last event logged by each package. The script to estimate these time intervals is shown in listing 3 below.

Listing 3: SSIS Package execution cycles in the WH

with FailedExecs
as
(
select Distinct
s.executionid
from sysssislog_WH s
where event='OnError'
),
PkgBatches
as
(
select
ROW_NUMBER() over ( partition by s.sourceid
order by min(s.Starttime) ) Batchid
,cast(convert(char(8),min(s.Starttime) ,112) as int) as Datekey
,s.sourceid
,s.executionid
,p.PackageName
from sysssislog_WH s
inner join Aud.DimSSISPackage p
on s.sourceid =p.PackageId
where s.executionid not in (select executionid from FailedExecs)
and s.event not like ('User:%')
group by s.executionid,s.sourceid ,p.PackageName
)
-- insert into aud.PackageBatches
select
N'Warehouse' as [ETLPhase]
,p.PackageName
,s.sourceid
,s.executionid
,P.Batchid as BatchId
,min(s.starttime) as starttime
,Max(s.endtime) as endtime
,p.Datekey
--,d.DayNumberDesc
--,d.WeekNumberDesc
--,d.MonthNumberDesc
-- ,d.YearNumberDesc 
,datediff (second, min(s.starttime) , Max(s.endtime)) as EstExecTime_Sec
,datediff (minute, min(s.starttime) , Max(s.endtime)) as EstExecTime_Min
from sysssislog_WH s
inner join PkgBatches p
on s.executionid=p.executionid
and s.sourceid =p.sourceid
--inner join [Dim].[Date] d
-- on d.DateISOName=p.Datekey
where s.executionid not in (select executionid from FailedExecs)
group by
p.Datekey
,s.sourceid
,s.executionid
,p.PackageName
,p.Batchid
,year(s.StartTime)
,month(s.StartTime)
,day(s.StartTime)
--,d.DayNumberDesc
--,d.WeekNumberDesc
--,d.MonthNumberDesc
-- ,d.YearNumberDesc 

The key pieces of information in Listing 3 is that we join the log table to the Aud.DimSSISPackage table, which holds package information such as packages names and their related IDs (Refer to the “Gathering SSIS Package Information” section in Part 1) of our ETL setup.

Note that the package execution cycles in Listing 3 above are not associated with any source system at this point. We can do this by associating package execution cycles with the batch-source system cycles from listing 2. Now assuming the batch exec cycle results from listing 2 is inserted into the table aud.ETLPhaseBatches  and the package exec cycle results from listing 3 are inserted into the table aud.PackageBatches. By joining the two tables, package executions can be matched to source system-batches as shown in listing 4 below.

Listing 4:  Matching Package Exec Cycles  to System Batches Cycles.

--insert into aud.SystemPackageBatches
select   
         a.ETLPhase 
        ,b.SourceSystem
        ,a.PackageName
        ,b.Batchid as SystemPackageBatchId
        ,a.starttime
        ,a.endtime
        ,a.Datekey
        ,a.executionId
        ,a.d.DayNumberDesc
        ,a.d.WeekNumberDesc
        ,a.MonthNumberDesc
        ,a.YearNumberDesc   
        ,a.EstExecTime_Sec as PackageEstExecTime_Sec        ,b.BatchID as PhaseBatchID
From aud.PackageBatches  a
inner join from aud.ETLPhaseBatches  b
  On a.starttime>=b.starttime
  and a.endtime <=b.Endtime

  

WH Package Reports

The logic in listings has been joined to a time dimension table (commented out in the logic) to help generate various system-package execution reports.  Refer to part 2 for examples of possible system-package batch reports.

Estimating Task level Execution Intervals

To estimate the execution cycle interval for tasks, we will estimate the time between the first event logged by each task and the last event logged by the same task. Note that we will exclude all failed executions in both cases. The script to do this is in Listing 5 below. 

Listing 5: SSIS Task Batches cycles from the WH

with FailedExecs
          as
          (
                   select Distinct
                             s.executionid
                   from sysssislog_WH s
                    where event='OnError'
          ),
          TaskBatches
          as
          (
                   select
                             N'Warehouse' as [ETLPhase]
                             ,ROW_NUMBER() over ( partition by s.sourceid
                              order by min(s.Starttime) ) Batchid
                              ,cast(convert(char(8),min(s.Starttime) ,112) as int) as Datekey
                            ,s.sourceid
                             ,s.executionid
                             ,t.TaskName
                             ,p.PackageName
                   from sysssislog_Singapore s
                   inner join Aud.DimSSISPackageTasks t
                      on t.TaskId=s.sourceid
                   inner join Aud.DimSSISPackage p
                       on t.PackageId=p.PackageId
                   where s.executionid not in (select executionid from FailedExecs)
                       and s.event not like ('User:%')
                   group by s.executionid,s.sourceid ,t.TaskName,p.PackageName
          )
--insert into Aud.ETLTaskBatches
select
           t.[ETLPhase]
          ,p.[SourceSystem]
          ,t.TaskName
          ,t.PackageName
          --,s.sourceid
          --,s.executionid
          ,t.Batchid as BatchId
          ,min(s.starttime) as starttime
          ,Max(s.endtime) as endtime
          ,t.Datekey
          --,d.DayNumberDesc
          --,d.WeekNumberDesc
          --,d.MonthNumberDesc
          --,d.YearNumberDesc
          ,datediff (second, min(s.starttime) , Max(s.endtime)) as EstExecTime_Sec
          ,datediff (minute, min(s.starttime) , Max(s.endtime)) as EstExecTime_Min
from sysssislog_WH s
inner join TaskBatches t
    on s.executionid=t.executionid
    and s.sourceid =t.sourceid
inner join aud.SystemPackageBatches p
on p.executionid=t.executionid
-- inner join [Dim].[Date] d
 -- on d.DateISOName=t.Datekey
group by
          t.Datekey
          ,s.sourceid
          ,s.executionid
          ,t.Batchid
          ,t.TaskName
          ,t.PackageName
          ,year(s.StartTime)
          ,month(s.StartTime)
          ,day(s.StartTime)
          --,d.DayNumberDesc
          --,d.WeekNumberDesc
          --,d.MonthNumberDesc
           --,d.YearNumberDesc

The key lines in Listing 5 are where we join the WH log table to the Aud.DimSSISPackage table, which holds all packages names and their related IDs and also to the Aud.DimSSISPackageTasks table, which holds all task information such as task names, task IDs and their related package IDs (Refer to the “Gathering SSIS Package Information” section in Part 1) of our ETL setup.

We finally join to the aud.SystemPackageBatches table loaded in listing 4 to associate the task executions cycles to the appropriate packages and source systems.

WH Task Reports

The logic in Listing 5 has been joined to a time dimension table (commented out in the logic) to help generate various task execution cycle reports. Refer to part 2 for samples of such reports.

Conclusion

In this 3 part Series we've looked at how to setup as audit system to capture SSIS ETL execution cycles.

The examples and logic in these series shows that it is possible to setup ETL execution audit using SSIS logs no matter how complicated your ETL setup, even though some setup may require a few extra steps in addition to those outlined here. Note that the logic oulined in this document can be rewritten in many different ways. They are seperated into various step in this document to help clearly explain the various executions cycles and systems involved.

A good audit trail, like the type discussed in this series, enables you to do both selective and tagerted ETL performance analysis. Troubleshhooting all ETL performance issues may however require generation and capture of some other performance metrics in your audit system. Audit trails captured and stored through processes outlined in this series will help you identify the SSIS elements performing poorly. 

Depending on the nature or the size your ETL setup, such a comprehensive system might not be necessary. On the other hand, if your an architect or developer overseeing an SSIS ETL setup with numerious phases and packages loading data daily to streaming data in real-time, then this could be a solution to monitoring your ETL execution cycles and performance.

Rate

4.86 (7)

You rated this post out of 5. Change rating

Share

Share

Rate

4.86 (7)

You rated this post out of 5. Change rating