Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

  • Good and very clear article. But described solution - to "speed up" ETL - is like shooting mosquito from cannon ball. Much more faster is using HASHBYTES sql server native function. And for Goodness sake, dont recommend using OLEDB Command to speed up anything. Use data sets and staging tables.

  • djphatic (6/25/2012)


    For testing purposes I created a spreadsheet with 3 columns, Column A, B and C each with the values A B C respectively in ~60,000 rows and used this as my data source.

    Using the script below:

    ' Microsoft SQL Server Integration Services Script Component

    ' Write scripts using Microsoft Visual Basic 2008.

    ' ScriptMain is the entry point class of the script.

    Imports System

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Imports Microsoft.SqlServer.Dts.Pipeline

    Imports System.Text

    Imports System.Security.Cryptography

    <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _

    <CLSCompliant(False)> _

    Public Class ScriptMain

    Inherits UserComponent

    Private InputBuffer As PipelineBuffer

    Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As PipelineBuffer)

    InputBuffer = Buffer

    MyBase.ProcessInput(InputID, Buffer)

    End Sub

    Public Overrides Sub PreExecute()

    MyBase.PreExecute()

    '

    ' Add your code here for preprocessing or remove if not needed

    ''

    End Sub

    Public Overrides Sub PostExecute()

    MyBase.PostExecute()

    '

    ' Add your code here for postprocessing or remove if not needed

    ' You can set read/write variables here, for example:

    ' Me.Variables.MyIntVar = 100

    ''

    End Sub

    Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)

    Dim counter As Integer = 0

    Dim values As New StringBuilder

    For counter = 0 To inputBuffer.ColumnCount - 1

    Dim value As Object

    value = inputBuffer.Item(counter)

    values.Append(value)

    Next

    Row.Concat = values.ToString

    Row.HashValue = CreateHash(values.ToString())

    End Sub

    Public Shared Function CreateHash(ByVal data As String) As String

    Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)

    Dim md5 As MD5 = New MD5CryptoServiceProvider()

    Dim hashedData As Byte() = md5.ComputeHash(dataToHash)

    RNGCryptoServiceProvider.Create().GetBytes(dataToHash)

    Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)

    Return s

    End Function

    End Class

    I get the following results:

    RowRow 1 onwards

    Column AA

    Column BB

    Column CC

    HashValuecWd0tTqSuYUlu6F5oVnMcQ==

    ConcatABC

    RowRow 11167 onwards

    Column AA

    Column BB

    Column CC

    HashValueEP2rhBgzqDEc0qmEjWlQBA==

    ConcatABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    RowRow 24194 onwards

    Column AA

    Column BB

    Column CC

    HashValuedLHcMhE2RJZ6ew7Jd8oezQ==

    ConcatABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    RowRow 37221 onwards

    Column AA

    Column BB

    Column CC

    HashValue93k+mMDI1x5OoZ0cjtz7Hw==

    ConcatABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    RowRow 50248 onwards

    Column AA

    Column BB

    Column CC

    HashValueRHGVCjMX9usJb160IvP6RQ==

    ConcatABC93k+mMDI1x5OoZ0cjtz7Hw==ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    RowRow 58881 onwards

    Column AA

    Column BB

    Column CC

    HashValuecWd0tTqSuYUlu6F5oVnMcQ==

    ConcatABC

    Looking at the concat results the previous buffers HashValue is getting include in the InputSource for some reason. When I add the line Row.ColumnCount = InputBuffer.ColumnCount the value is 5 for each row even though I've only selected 3 columns as part of InputSource.

    Hello, I tried to replicate your example. I created dbo.testMD5_Source and testMD5_Target tables, loaded 60000 records in source with same values A,B,C using below code:

    CREATE TABLE dbo.testMD5_Source(ID INT IDENTITY(1,1), A VARCHAR(1),B VARCHAR(1),C VARCHAR(1))

    CREATE TABLE dbo.testMD5_Target(ID INT, A VARCHAR(1),B VARCHAR(1),C VARCHAR(1),CheckSum NVARCHAR(35));

    DECLARE @Counter INT=1

    WHILE @Counter<=6000

    BEGIN

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');

    SELECT @Counter=@Counter+1;

    END

    SELECT COUNT(1) FROM dbo.testMD5_Source --60,000

    --after 1st execution of package

    SELECT COUNT(1) FROM dbo.testMD5_Target --60,000

    SELECT DISTINCT A,B,C,CHECKSUM FROM dbo.testMD5_Target --A B C pLWCYR7/9KIyBsc7004Ftg==

    --after 2nd execution of package

    SELECT COUNT(1) FROM dbo.testMD5_Target --1,20,000

    SELECT DISTINCT A,B,C,CHECKSUM FROM dbo.testMD5_Target --A B C pLWCYR7/9KIyBsc7004Ftg==

    --after 3rd execution of package

    SELECT COUNT(1) FROM dbo.testMD5_Target --1,80,000

    SELECT DISTINCT A,B,C,CHECKSUM FROM dbo.testMD5_Target --A B C pLWCYR7/9KIyBsc7004Ftg==

    I didn't observe any issues. Same checksum was calculated for all records. I tested this by executing the package thrice.

    I have made below changes to the code to compute md5 checksum:

    1) Appended delimiter after each column. This is avoid computing same hash for (A NULL NULL), (NULL A NULL) and (NULL NULL A).

    2) Converted string to upper case before computing checksum. This is to avoid computing different hash for (A B C) and (a b c)

    Below is the complete code to compute md5 checksum:

    #Region "Help: Introduction to the Script Component"

    ' The Script Component allows you to perform virtually any operation that can be accomplished in

    ' a .Net application within the context of an Integration Services data flow.

    ' Expand the other regions which have "Help" prefixes for examples of specific ways to use

    ' Integration Services features within this script component.

    #End Region

    #Region "Imports"

    Imports System

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Imports Microsoft.SqlServer.Dts.Pipeline

    Imports System.Text

    Imports System.Security.Cryptography

    #End Region

    ' This is the class to which to add your code. Do not change the name, attributes, or parent

    ' of this class.

    <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _

    <CLSCompliant(False)> _

    Public Class ScriptMain

    Inherits UserComponent

    #Region "Help: Using Integration Services variables and parameters"

    'To use a variable in this script, first ensure that the variable has been added to

    'either the list contained in the ReadOnlyVariables property or the list contained in

    'the ReadWriteVariables property of this script component, according to whether or not your

    'code needs to write into the variable. To do so, save this script, close this instance of

    'Visual Studio, and update the ReadOnlyVariables and ReadWriteVariables properties.

    'To use a parameter in this script, follow the same steps. Parameters are always read-only.

    'Example of reading from a variable or parameter:

    ' startTime = Variables.MyStartTime;

    'Example of writing to a variable:

    ' Variables.myStringVariable = "new value";

    #End Region

    #Region "Help: Using Integration Services Connnection Managers"

    'Some types of connection managers can be used in this script component. See the help topic

    '"Working with Connection Managers Programatically" for details.

    'To use a connection manager in this script, first ensure that the connection manager has

    'been added to either the list of connection managers on the Connection Managers page of the

    'script component editor. To add the connection manager, save this script, close this instance of

    ' Visual Studio, and add the Connection Manager to the list.

    'If the component needs to hold a connection open while processing rows, override the

    'AcquireConnections and ReleaseConnections methods.

    'Example of using an ADO.Net connection manager to acquire a SqlConnection:

    ' Dim rawConnection As Object = Connections.SalesDB.AcquireConnection(Transaction)

    ' Dim salesDBConn As SqlConnection = CType(rawConnection, SqlConnection)

    'Example of using a File connection manager to acquire a file path:

    ' Dim rawConnection As Object = Connections.Prices_zip.AcquireConnection(Transaction)

    ' Dim filePath As String = CType(rawConnection, String)

    'Example of releasing a connection manager:

    ' Connections.SalesDB.ReleaseConnection(rawConnection)

    #End Region

    #Region "Help: Firing Integration Services Events"

    'This script component can fire events.

    'Example of firing an error event:

    ' ComponentMetaData.FireError(10, "Process Values", "Bad value", "", 0, cancel)

    'Example of firing an information event:

    ' ComponentMetaData.FireInformation(10, "Process Values", "Processing has started", "", 0, fireAgain)

    'Example of firing a warning event:

    ' ComponentMetaData.FireWarning(10, "Process Values", "No rows were received", "", 0)

    #End Region

    Private inputBuffer As PipelineBuffer

    Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer)

    inputBuffer = Buffer

    MyBase.ProcessInput(InputID, Buffer)

    End Sub

    Public Shared Function CreateHash(ByVal data As String) As String

    Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)

    Dim md5 As MD5 = New MD5CryptoServiceProvider()

    Dim hashedData As Byte() = md5.ComputeHash(dataToHash)

    RNGCryptoServiceProvider.Create().GetBytes(dataToHash)

    Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)

    Return s

    End Function

    'This method is called once, before rows begin to be processed in the data flow.

    '

    'You can remove this method if you don't need to do anything here.

    Public Overrides Sub PreExecute()

    MyBase.PreExecute()

    '

    ' Add your code here

    '

    End Sub

    ' This method is called after all the rows have passed through this component.

    '

    ' You can delete this method if you don't need to do anything here.

    Public Overrides Sub PostExecute()

    MyBase.PostExecute()

    '

    ' Add your code here

    '

    End Sub

    'This method is called once for every row that passes through the component from InputSource.

    '

    'Example of reading a value from a column in the the row:

    ' zipCode = Row.ZipCode

    '

    'Example of writing a value to a column in the row:

    ' Row.ZipCode = zipCode

    Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)

    Dim counter As Integer = 0

    Dim values As New StringBuilder

    For counter = 0 To inputBuffer.ColumnCount - 1

    Dim value As Object

    value = inputBuffer.Item(counter)

    values.Append(value)

    values.Append("Þ")

    Next

    Row.HashValue = CreateHash(values.ToString().ToUpper())

    End Sub

    End Class

    Thanks,

    Swapnil

  • Hi,

    I am new to SSIS. Trying to use the given logic in the Script compment.

    But i am getting the following error.

    'HashValue' is not a member of 'vbproj.InputSourceBuffer'

    Could you please let me know the fix for it.

    Many thanks

  • hi,

    Can u Please tell me how to find the deleted records from source.

  • Thanks for this tutorial, it was very helpful for my ETL project. I put instances of the task in at various places, it did work great. Now finally, I get a "System.IndexOutOfRangeException" claiming my index were out of bounds for the array, processing ProcessInput on the pipeline buffer - for the second instance of the task I have in a data flow container (the first one works fine). Is there anything I could look at to find the problem? I never used VB outside of this project. Another data flow container works fine with a similar arrangement of the two stacked tasks.
    Should I post my script (minor tweaks to the original converting it to SHA256 and using column seperators)?
    My error message is in German, I can post it as well but the gist should be covered above, I guess.
    Hope anyone is still monitoring this thread...

    Cheers,
    Oliver.

  • sabyasm - Thursday, September 2, 2010 10:22 AM

    [font="Comic Sans MS"]This is an excellent article.However - I am facing a peculier problem that sometime the generated MD5 hashval is NOT same for the same data. This is happening intermittantly - say for 5 days it is working fine and not working next day. Also - I have noticed this to happen only if the data volume is high (5M). For around first 50K records it generated the hashValue same as previous day (say day 0) - but for rest of records it has generated different hasValues. It shows rest of the 5M records to be delta. Next day - it rectifies itself and create hashValue same as what we had had for day 0.It looks like:(columnA) (columnB) (columnC) columnHashA,B,C, hashValue1 ----> day 0A,B,C, hashValue2 ----> day 1 (record inserted)(I forced all insert even if the hashvalue same)A,B,C, hashValue1 ----> day 2( I have 16+ columns - combination of varchar, int, datetime. Some of the columns contain Null value and the hashValue column is of nvarchar(35)I was unable to replicate this from one environment to another. I loaded those 3 days data into TEST and got desired result (generated ~800 delta records).This got me thinking of any environment related issue? Is it related to some sort of buffer overflow (note around ~50K records generated correct hashvalue)? I have loaded more than 500 files ranging from 1000 records to 5M reocrds and the problem has happened only 3 times with the files with 5M records. However - I have loaded the 5M file around 20 times without any issue.Any clue - or suggestion for doing the debug will be highly appreciated.[/font]

    So use a different algorithm than MDF. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • o.schoen - Friday, July 27, 2018 6:25 AM

    Thanks for this tutorial, it was very helpful for my ETL project. I put instances of the task in at various places, it did work great. Now finally, I get a "System.IndexOutOfRangeException" claiming my index were out of bounds for the array, processing ProcessInput on the pipeline buffer - for the second instance of the task I have in a data flow container (the first one works fine). Is there anything I could look at to find the problem? I never used VB outside of this project. Another data flow container works fine with a similar arrangement of the two stacked tasks.
    Should I post my script (minor tweaks to the original converting it to SHA256 and using column seperators)?
    My error message is in German, I can post it as well but the gist should be covered above, I guess.
    Hope anyone is still monitoring this thread...

    Cheers,
    Oliver.

    After a weekend's break, some researching and half a day playing around: it seems the problem is that I copied the script task in the first place. Removing all copied duplicates and generating the same things over from scratch solved the pipline "index out of range" errors.

    Another note on the same script: I found that the columns are ordered in the system not by the position in the Inputs list but by the position in the original table. So if you use the same task to generate the same hash on columns with the same data, but assembled by SQL requests or whatever that alter the column order, you will have different hashes. Since altering the lookups somewhere upstairs in the data path is not the most transparent way to solve this (and who knows, rules for ordering might change at some point) I went back to explicitly calling the input columns in the script - in the end this eliminated the need for the pipeline buffer redirection. I did not check whether that affects throughput (or above problem), but it simplifies the script. By now it looks like this:

    #Region "References"

    ' This script has been developed inline with the SQL Server Central tutorial

    'http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/69766/

    ' by Brett Flippin

    ' Also mind some notes from thisarticle by Joost van Rossum:

    'http://microsoft-ssis.blogspot.com/2017/10/calculating-hash-values-in-ssis.html

    ' Changes and errors are all mine:O.Schoen, 2018-07-30

    #End Region

     

    #Region "Imports"

    Imports System.Text

    Imports System.Security.Cryptography

    #End Region

     

    ' This is the class to which to addyour code.  Do not change the name,attributes, or parent

    ' of this class.

    <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute>

    <CLSCompliant(False)>

    Public Class ScriptMain

       Inherits UserComponent

     

       'Calculating hash string from input string

       Public Shared Function CreateHash(ByVal Data As String) As String

           Dim DataToHash As Byte() = (NewUnicodeEncoding()).GetBytes(Data)

           Dim Coder As SHA256 = NewSHA256CryptoServiceProvider()

           Dim HashedData As Byte() =Coder.ComputeHash(DataToHash)

           RandomNumberGenerator.Create().GetBytes(DataToHash)

           Dim S As String = Convert.ToBase64String(HashedData,Base64FormattingOptions.None)

           Return S

       End Function

     

       'Thismethod is called once for every row that passes through the component fromInColumns.

       Public Overrides SubInColumns_ProcessInputRow(ByVal Row As InColumnsBuffer)

           Dim Value As Object

           Dim Values As New StringBuilder

           Value = Row.IDCol1

           Values.Append(Value)

           Values.Append("|") ' as separator against field overlap

           Value = Row.IDCol2

           Values.Append(Value)

           Values.Append("|") ' as separator against field overlap

           Value = Row.IDCol3

           Values.Append(Value)

           Row.HashValue = CreateHash(Values.ToString())

           'Row.HashValue = Values.ToString() 'for testing

       End Sub

     

    End Class


    There may be more elegant ways to go in a determined way through a number of columns, at least this is robust.
    I set the "IDCol<n>" headers as aliases in the cript task's input columns assignment, alternatively rename these to the straight input columns names.

Viewing 7 posts - 61 through 66 (of 66 total)

You must be logged in to reply to this topic. Login to reply