SQLServerCentral Article

SSIS - Reusable Benchmark Harness

,

You need to run a number of benchmarks, testing each with different parameters and data sizes, not to mention algorithms, while documenting run times and run parameters, over night unattended operation is a given, and just about now your task is getting more than a bit painful, and (hey, it sure has happened to me), if you're not fully automated, by now you're starting to loose some run results out of either exhaustion or boredom.

To the rescue, this Swiss Army Knife Gizmo provides a framework that gives you a compact way to specify what needs to happen, and then executes the benchmarks accordingly. In keeping with its Swissness, it can of course be equally useful for any production work that has similar requirements. Let's cut our teeth on:

A Minimal Benchmark

Requirements

  • Run a benchmark twice, using different settings for the variable NumberOfRecords
  • Document the runs by writing benchmark and system information, as well as elapsed times to a CSV text file

The solution builds upon a previous article, so if the Getting silly for loop doesn't ring a bell, you probably want to check that out. The reusable gizmo consists of a for loop and the Logic script:

Benchmark implementation

  • Pop our Benchmark into the gizmo - for demo purposes this is just a script displaying the value of NumberOfRecords
  • Add the variable NumberOfRecords to the for loop scope and to Logic.ReadWriteVariables
  • Create the Precedence constraint with the expression: (@BitMask & 2) != 0
  • Add text describing the benchmark and output to the variables SystemInfo, RunInfo, BenchmarkInfo, and ReportBaseFilename
  • Add only the code marked in green to Logic (the script has additional supporting code described later.) Bits.Benchmark must match in value with the number "2" used for the precedence constraint.
    Enum Bits
        NoBitsSet = 0
        ExitLoop = 1
        Benchmark = 2
    End Enum
    Public Sub Main()
        If FSM.Empty Then
            FSM.SetVariablesToLog("NumberOfRecords")
            FSM.AddState("Benchmark", Bits.Benchmark, "NumberOfRecords", 100000)
            FSM.AddState("Benchmark", Bits.Benchmark, "NumberOfRecords", 500000)
        End If
        FSM.ExecuteState()  ' Perform the actions specified for the current state
        Dts.TaskResult = Dts.Results.Success
    End Sub

On package execution:

  • The for loop does two iterations, corresponding to the two FSM.AddState() calls
  • The Logic script will ensure Benchmark runs twice, setting NumberOfRecords appropriately each time
  • The Logic script also outputs descriptive info and elapsed times into the file specified with ReportBaseFilename:
    StartTime,CurrentTime,Type,NumberOfRecords,ElapsedTime,Info
    2005-09-08 17:47:01.000000,2005-09-08 17:47:01.667174,"Start",0,,""
    2005-09-08 17:47:01.000000,2005-09-08 17:47:01.667174,"RunInfo",0,,"Article final run."
    2005-09-08 17:47:01.000000,2005-09-08 17:47:01.667174,"BenchmarkInfo",0,,"Run a script 
        twice with different values for the variable NumberOfRecords."
    2005-09-08 17:47:01.000000,2005-09-08 17:47:01.667174,"SystemInfo",0,,"Windows XP SP2. 
        SQL Server 2005 Developer Edition, June CTP. Single CPU AMD 1800+. 1.5GB memory. 
        Data and log files on a single disks, simple recovery model."
    2005-09-08 17:47:01.000000,2005-09-08 17:47:03.910400,"Benchmark",100000,2.2231968,""
    2005-09-08 17:47:01.000000,2005-09-08 17:47:05.662920,"Benchmark",500000,1.75252,""
    2005-09-08 17:47:01.000000,2005-09-08 17:47:05.662920,"FinishElapsed",500000,4.66292,""
Tip: By default the date and time of starting the package is added to the output filename, creating a new output file for each run. To get multiple runs to output to the same file, simply remove the date and time portion from the filename (i.e. edit the Logic script where LogFilename is initialized.)Tip: The results are easy to import into other tools, such as Excel, for further processing and presentation, although you might need to adjust the date format to match with what your tool expects.

A Larger Benchmark

This simulates a more realistic set of benchmarks, with placeholders for creating and deleting test data, resetting the environment between benchmarks, and varying multiple benchmark parameters, including using a property expression on a key data flow property.

Larger requirements

  • Run Initialize, First Benchmark and Second Benchmark with different data sizes (100000, 400000, 700000, 1000000 records)
  • For each data size, measure which value for the data flow property DefaultBufferMaxRows provides the best performance (10000, 90000, 180000 rows)
  • Run Reset Environment before running each individual benchmark
  • Run Cleanup after running all other tasks
  • Document the runs by writing benchmark and system information, as well as elapsed times to a CSV text file

Laid out serially without reuse, this still fairly simple example would require 53 tasks!

Larger benchmark implementation

  • Add our own five tasks to the gizmo, adding a unique power of 2 number to the name
  • Add the variables NumberOfRecords and DefaultBufferMaxRows to the for loop scope and to Logic.ReadWriteVariables
  • Create the Precedence constraint expressions, exchanging "1024" for the power of 2 number you picked for each task above: (@BitMask & 1024) != 0
  • Create a property expression on First Benchmark, mapping the variable DefaultBufferMaxRows to the property DefaultBufferMaxRows
  • Add text describing the benchmark and output to the variables SystemInfo, RunInfo, BenchmarkInfo, and ReportBaseFilename
  • Add only the code marked in green to Logic. The Bits must match in value with the number used for the precedence constraints.
    Enum Bits   ' Separate the harness states from your benchmark states
        NoBitsSet = 0
        ExitLoop = 1
        Initialize = 2
        Cleanup = 4
        ResetEnvironment = 8
        FirstBenchmark = 1024
        SecondBenchmark = 2048
    End Enum
    Public Sub Main()
        If FSM.Empty Then
            FSM.SetVariablesToLog("NumberOfRecords", "DefaultBufferMaxRows")
            Dim records, bufferRows As Int32
            For records = 100000 To 1000000 Step 300000
                FSM.AddState(Nothing, Bits.Initialize, "NumberOfRecords", records)
                For bufferRows = 20000 To 200000 Step 90000
                    FSM.AddState(Nothing, Bits.ResetEnvironment)
                    FSM.AddState("FirstBenchmark", Bits.FirstBenchmark, _
                        "DefaultBufferMaxRows", bufferRows)
                    FSM.AddState(Nothing, Bits.ResetEnvironment)
                    FSM.AddState("SecondBenchmark", Bits.SecondBenchmark)
                Next
            Next
            FSM.AddState(Nothing, Bits.Cleanup)
        End If
        FSM.ExecuteState()  ' Perform the actions specified for the current state
        Dts.TaskResult = Dts.Results.Success
    End Sub

On package execution:

  • The for loop does 53 iterations, corresponding to the 53 FSM.AddState() calls that are made on the first iteration
  • For each iteration, FSM.ExecuteState() sets a bit in the variable BitMask to make the corresponding precedence expression and task trigger
  • The Logic script also outputs descriptive info and elapsed times into the file specified with ReportBaseFilename (not shown here)

As we've seen, very little effort is needed to configure, run and output benchmark information. If you're absolutely dying to start using this, skip ahead and download the gizmo, otherwise check out the gory details making this possible, i.e. the:

Support Code in Logic

NB: The word State is sprinkled throughout the code - a state can be seen simply as one iteration of the for loop, together with the variables to set etc. for that iteration.

The Logic script needs to remember settings between invocations by the for loop. The script task does not allow static script variables that keep their value between invocations, so instead we're using a Class FSMData object stored in a package variable called User::FSMData to keep the necessary information.

Tip: User::FSMData has the extremely useful type System.Object, which can store objects of any type. The Data() property below shows how to allocate and type cast the object.Tip:

With StateList As New List(Of State) on the other hand, we can use generics, new in .NET v2.0, removing the need for type casting when accessing the objects.

FSMData.StateList has one item of Class State for each state added with FSM.AddState(), and the FSMData.CurrentState() property is the iterator for these states.

    Private Shared ReadOnly Property Data() As FSMData
        Get
            If Dts.Variables("FSMData").Value.GetType().ToString() = "System.Object" Then
                Dts.Variables("FSMData").Value = New FSMData()
            End If
            Return CType(Dts.Variables("FSMData").Value, FSMData)
        End Get
    End Property
    Friend Class FSMData
        Private CurrentStateValue As Int32
        Friend StateList As New List(Of State)
        Friend StateStartTime As DateTime
        Friend VariablesToLog As String()
        Friend LogFilename As String
        Sub New()
            CurrentState = 0
            LogFilename = Dts.Variables("ReportBaseFilename").Value.ToString() & _
                CDate(Dts.Variables("StartTime").Value).ToString("yyyy-MM-dd_HH_mm_ss") _
                & ".txt"
        End Sub
        Public Property CurrentState() As Int32
            Get
                Return CurrentStateValue
            End Get
            Set(ByVal value As Int32)
                CurrentStateValue = value
            End Set
        End Property
    End Class
    Friend Class State
        Friend Name As String
        Friend BitMask As Bits
        Friend Variables As Object()
        Sub New(ByVal _n As String, ByVal _b As Bits, ByVal ParamArray _v As Object())
            Name = _n
            BitMask = _b
            Variables = _v
        End Sub
    End Class

In A minimal benchmark we saw how AddState() was used to add benchmark runs and variable values. Note that the setVariables parameter can contain many <Variable name, Variable value> pairs, even with different value types!

ExecuteState() is responsible for calling appropriate logging subroutines, setting package variables as specified by AddState() calls, and incrementing the state iterator CurrentState.

    ' stateName = Nothing => Don't report elapsed time for this state
    Public Shared Sub AddState(ByVal stateName As String, ByVal bitMask As Bits, _
            ByVal ParamArray setVariables As Object())
        Data.StateList.Add(New State(stateName, bitMask, setVariables))
    End Sub
    Public Shared Sub ExecuteState()
        If Data.CurrentState = 0 Then
            LogInitial()
        End If
        ' Log elapsed time of previous state
        If Not FSM.Empty And Data.CurrentState > 0 Then
            LogLine(Data.StateList(Data.CurrentState - 1).Name, _
                FSM.ElapsedTime.ToString(), "")
        End If
        If Data.CurrentState >= Data.StateList.Count Then
            LogLine("FinishElapsed", (Now() - _
                CDate(Dts.Variables("StartTime").Value)).TotalSeconds.ToString(), "")
            Dts.Variables("BitMask").Value = Bits.ExitLoop
        Else
            Dts.Variables("BitMask").Value = Data.StateList(Data.CurrentState).BitMask
            Dim i As Int32
            For i = 0 To Data.StateList(Data.CurrentState).Variables.Length - 1 Step 2
                Dts.Variables(Data.StateList(Data.CurrentState).Variables(i).ToString())_
                    .Value = Data.StateList(Data.CurrentState).Variables(i + 1)
            Next
            Data.StateStartTime = Now()
            Data.CurrentState += 1    ' Set the next state
        End If
    End Sub

That covers the core functionality, and there's just a few additional helper routines such as the logging of info to disk, check just the logic script for those, or the whole package.

Super Gizmo

More ways to use or extend the framework:

  • Currently the number and range of variable values is controlled solely by the Logic script. For maximum flexibility, the from/to/step values in the Logic.Main() for loops could themselves be controlled with package variables, allowing them to easily be specified differently for each Super Gizmo package invocation.
  • Integration Services provides a configuration type, the Parent Package Variable configuration, for passing values from parent to child packages. Using that, together with an Execute Package Task, you can benchmark external packages.
  • I elected to output results to a text file using the File class. You might consider using the built-in logging facilities, store output in the database etc., check out the Dts.Log() function.
  • Load states, variables etc. from an external file or database.
  • Property expressions can only be set on tasks, not on Data Flow components. This means you can't use a variable to change say the CacheType of a particular Lookup transform inside a data flow task. You can of course use the task switching abilities we've covered to switch between different data flow tasks, thereby optimizing performance.
  • To go back to previous states depending on some script code you add, I suggest:
    • Have AddState() return the number of the state added, and use that for any branching logic you add
    • Add a parameter nextState to AddState() that specifies the number of the next state (instead of the currently implemented monotonously increasing state number), or a predefined constant State.Next, State.Same, State.Previous
    • Either add your logic directly in Main(), or get funky adding delegates to states

Conclusion

  • Currently, out-of-box SSIS lacks some obvious features for reuse, but makes up for it with easy to use flexibility, allowing the creation of the functionality you need
  • The described framework is very useful for running multiple tasks many times with different variable settings
  • It can also be used for production work, in which case controlling the variable from/to/step values from outside the Logic script is possibly a very useful extension

In short, don't leave home without it !-)


Kristian Wedberg is a Business Intelligence consultant living outside London, UK. He can be reached at this email address: trk2061 [at] wedberg.name


Disclaimer: Opinions and comments expressed herein are my own, and does not necessarily represent those of my employer.

Resources

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating