Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««34567

Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes Expand / Collapse
Author
Message
Posted Sunday, June 9, 2013 8:23 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, August 12, 2014 6:48 AM
Points: 1, Visits: 8
Good and very clear article. But described solution - to "speed up" ETL - is like shooting mosquito from cannon ball. Much more faster is using HASHBYTES sql server native function. And for Goodness sake, dont recommend using OLEDB Command to speed up anything. Use data sets and staging tables.
Post #1461338
Posted Wednesday, October 16, 2013 12:47 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, August 4, 2014 5:29 AM
Points: 23, Visits: 46
djphatic (6/25/2012)
For testing purposes I created a spreadsheet with 3 columns, Column A, B and C each with the values A B C respectively in ~60,000 rows and used this as my data source.

Using the script below:

' Microsoft SQL Server Integration Services Script Component
' Write scripts using Microsoft Visual Basic 2008.
' ScriptMain is the entry point class of the script.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports Microsoft.SqlServer.Dts.Pipeline
Imports System.Text
Imports System.Security.Cryptography

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Private InputBuffer As PipelineBuffer

Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As PipelineBuffer)
InputBuffer = Buffer
MyBase.ProcessInput(InputID, Buffer)
End Sub

Public Overrides Sub PreExecute()
MyBase.PreExecute()
'
' Add your code here for preprocessing or remove if not needed
''
End Sub

Public Overrides Sub PostExecute()
MyBase.PostExecute()
'
' Add your code here for postprocessing or remove if not needed
' You can set read/write variables here, for example:
' Me.Variables.MyIntVar = 100
''
End Sub

Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)
Dim counter As Integer = 0
Dim values As New StringBuilder

For counter = 0 To inputBuffer.ColumnCount - 1
Dim value As Object
value = inputBuffer.Item(counter)
values.Append(value)
Next
Row.Concat = values.ToString
Row.HashValue = CreateHash(values.ToString())

End Sub

Public Shared Function CreateHash(ByVal data As String) As String
Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim md5 As MD5 = New MD5CryptoServiceProvider()
Dim hashedData As Byte() = md5.ComputeHash(dataToHash)
RNGCryptoServiceProvider.Create().GetBytes(dataToHash)
Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)
Return s
End Function

End Class

I get the following results:

Row Row 1 onwards
Column A A
Column B B
Column C C
HashValue cWd0tTqSuYUlu6F5oVnMcQ==
Concat ABC

Row Row 11167 onwards
Column A A
Column B B
Column C C
HashValue EP2rhBgzqDEc0qmEjWlQBA==
Concat ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 24194 onwards
Column A A
Column B B
Column C C
HashValue dLHcMhE2RJZ6ew7Jd8oezQ==
Concat ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 37221 onwards
Column A A
Column B B
Column C C
HashValue 93k+mMDI1x5OoZ0cjtz7Hw==
Concat ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 50248 onwards
Column A A
Column B B
Column C C
HashValue RHGVCjMX9usJb160IvP6RQ==
Concat ABC93k+mMDI1x5OoZ0cjtz7Hw==ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 58881 onwards
Column A A
Column B B
Column C C
HashValue cWd0tTqSuYUlu6F5oVnMcQ==
Concat ABC

Looking at the concat results the previous buffers HashValue is getting include in the InputSource for some reason. When I add the line Row.ColumnCount = InputBuffer.ColumnCount the value is 5 for each row even though I've only selected 3 columns as part of InputSource.


Hello, I tried to replicate your example. I created dbo.testMD5_Source and testMD5_Target tables, loaded 60000 records in source with same values A,B,C using below code:
CREATE TABLE dbo.testMD5_Source(ID INT IDENTITY(1,1), A VARCHAR(1),B VARCHAR(1),C VARCHAR(1))
CREATE TABLE dbo.testMD5_Target(ID INT, A VARCHAR(1),B VARCHAR(1),C VARCHAR(1),CheckSum NVARCHAR(35));
DECLARE @Counter INT=1
WHILE @Counter<=6000
BEGIN
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
INSERT INTO dbo.testMD5_Source(A,B,C) VALUES('A','B','C');
SELECT @Counter=@Counter+1;
END
SELECT COUNT(1) FROM dbo.testMD5_Source --60,000

--after 1st execution of package
SELECT COUNT(1) FROM dbo.testMD5_Target --60,000
SELECT DISTINCT A,B,C,CHECKSUM FROM dbo.testMD5_Target --A B C pLWCYR7/9KIyBsc7004Ftg==

--after 2nd execution of package
SELECT COUNT(1) FROM dbo.testMD5_Target --1,20,000
SELECT DISTINCT A,B,C,CHECKSUM FROM dbo.testMD5_Target --A B C pLWCYR7/9KIyBsc7004Ftg==

--after 3rd execution of package
SELECT COUNT(1) FROM dbo.testMD5_Target --1,80,000
SELECT DISTINCT A,B,C,CHECKSUM FROM dbo.testMD5_Target --A B C pLWCYR7/9KIyBsc7004Ftg==

I didn't observe any issues. Same checksum was calculated for all records. I tested this by executing the package thrice.
I have made below changes to the code to compute md5 checksum:
1) Appended delimiter after each column. This is avoid computing same hash for (A NULL NULL), (NULL A NULL) and (NULL NULL A).
2) Converted string to upper case before computing checksum. This is to avoid computing different hash for (A B C) and (a b c)

Below is the complete code to compute md5 checksum:
#Region "Help:  Introduction to the Script Component"
' The Script Component allows you to perform virtually any operation that can be accomplished in
' a .Net application within the context of an Integration Services data flow.

' Expand the other regions which have "Help" prefixes for examples of specific ways to use
' Integration Services features within this script component.
#End Region

#Region "Imports"
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports Microsoft.SqlServer.Dts.Pipeline
Imports System.Text
Imports System.Security.Cryptography
#End Region

' This is the class to which to add your code. Do not change the name, attributes, or parent
' of this class.
<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent

#Region "Help: Using Integration Services variables and parameters"
'To use a variable in this script, first ensure that the variable has been added to
'either the list contained in the ReadOnlyVariables property or the list contained in
'the ReadWriteVariables property of this script component, according to whether or not your
'code needs to write into the variable. To do so, save this script, close this instance of
'Visual Studio, and update the ReadOnlyVariables and ReadWriteVariables properties.
'To use a parameter in this script, follow the same steps. Parameters are always read-only.

'Example of reading from a variable or parameter:
' startTime = Variables.MyStartTime;

'Example of writing to a variable:
' Variables.myStringVariable = "new value";
#End Region

#Region "Help: Using Integration Services Connnection Managers"
'Some types of connection managers can be used in this script component. See the help topic
'"Working with Connection Managers Programatically" for details.

'To use a connection manager in this script, first ensure that the connection manager has
'been added to either the list of connection managers on the Connection Managers page of the
'script component editor. To add the connection manager, save this script, close this instance of
' Visual Studio, and add the Connection Manager to the list.

'If the component needs to hold a connection open while processing rows, override the
'AcquireConnections and ReleaseConnections methods.

'Example of using an ADO.Net connection manager to acquire a SqlConnection:
' Dim rawConnection As Object = Connections.SalesDB.AcquireConnection(Transaction)
' Dim salesDBConn As SqlConnection = CType(rawConnection, SqlConnection)

'Example of using a File connection manager to acquire a file path:
' Dim rawConnection As Object = Connections.Prices_zip.AcquireConnection(Transaction)
' Dim filePath As String = CType(rawConnection, String)

'Example of releasing a connection manager:
' Connections.SalesDB.ReleaseConnection(rawConnection)
#End Region

#Region "Help: Firing Integration Services Events"
'This script component can fire events.

'Example of firing an error event:
' ComponentMetaData.FireError(10, "Process Values", "Bad value", "", 0, cancel)

'Example of firing an information event:
' ComponentMetaData.FireInformation(10, "Process Values", "Processing has started", "", 0, fireAgain)

'Example of firing a warning event:
' ComponentMetaData.FireWarning(10, "Process Values", "No rows were received", "", 0)

#End Region

Private inputBuffer As PipelineBuffer

Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer)
inputBuffer = Buffer
MyBase.ProcessInput(InputID, Buffer)
End Sub

Public Shared Function CreateHash(ByVal data As String) As String
Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim md5 As MD5 = New MD5CryptoServiceProvider()
Dim hashedData As Byte() = md5.ComputeHash(dataToHash)
RNGCryptoServiceProvider.Create().GetBytes(dataToHash)
Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)
Return s
End Function
'This method is called once, before rows begin to be processed in the data flow.
'
'You can remove this method if you don't need to do anything here.
Public Overrides Sub PreExecute()
MyBase.PreExecute()
'
' Add your code here
'
End Sub

' This method is called after all the rows have passed through this component.
'
' You can delete this method if you don't need to do anything here.
Public Overrides Sub PostExecute()
MyBase.PostExecute()
'
' Add your code here
'
End Sub

'This method is called once for every row that passes through the component from InputSource.
'
'Example of reading a value from a column in the the row:
' zipCode = Row.ZipCode
'
'Example of writing a value to a column in the row:
' Row.ZipCode = zipCode
Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)
Dim counter As Integer = 0
Dim values As New StringBuilder

For counter = 0 To inputBuffer.ColumnCount - 1
Dim value As Object
value = inputBuffer.Item(counter)
values.Append(value)
values.Append("Þ")
Next
Row.HashValue = CreateHash(values.ToString().ToUpper())
End Sub

End Class

Thanks,
Swapnil
Post #1505077
Posted Thursday, March 27, 2014 7:00 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, June 11, 2014 1:21 AM
Points: 1, Visits: 19
Hi,
I am new to SSIS. Trying to use the given logic in the Script compment.
But i am getting the following error.

'HashValue' is not a member of 'vbproj.InputSourceBuffer'

Could you please let me know the fix for it.

Many thanks
Post #1555403
Posted Wednesday, July 30, 2014 3:33 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Saturday, October 18, 2014 1:33 AM
Points: 2, Visits: 55
hi,

Can u Please tell me how to find the deleted records from source.
Post #1597631
« Prev Topic | Next Topic »

Add to briefcase «««34567

Permissions Expand / Collapse