Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««34567»»

Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes Expand / Collapse
Author
Message
Posted Thursday, September 2, 2010 2:12 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, November 4, 2014 4:16 AM
Points: 346, Visits: 519


Thanks!

I am going to do that too!

Meantime I am removing 4 columns on pipeline which will always be fixed string and see if this helps anyway.

Let me thank you once more for this article. I saw it through RSS feed and bookmarked a while ago. It came really handy when we needed to use this 5M vs 5M data compare where any of these 15-16 column value can change. This made the end to end process run in less than 30 mins while it used to take more than 16 hrs!




--
Sabya
Post #979885
Posted Friday, September 3, 2010 5:23 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Thursday, December 18, 2014 1:06 PM
Points: 86, Visits: 219
Nice article.

Consider adding the checksum column as an INCLUDE column to the PK index on the table. Your milage may vary because:
1) The lookup transformation component only needs to read the PK index, not the actual data rows. [good for this process.]
2) The PK index is larger because of the checksum values. [a minus--hurts storage and any access using the PK.]
3) The PK index entry must be updated with a new checksum value whenever the data gets updated. Since the checksum value is always the same size, this will not fragment the index pages unless you use the "update trigger sets checksum column to null" idea. [a minus for any process that updates the data.]


David Lathrop
DBA
WA Dept of Health
Post #980555
Posted Thursday, September 23, 2010 2:48 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, September 28, 2010 1:26 PM
Points: 3, Visits: 22
i wonder if you can do incremental load from MysQL to SQl server 2008. I used ADO .NET source connecting to MySQL and I could not do connection to Script component. I read another article Anatomy of an Incremental Load and i cannot use lookup transformation editor , it produced "cannot link data because of data type mismatch" error when I try mach MYSQl key to SQL server 2008 table? how I can convert unsigned 4 bite unsigned interger from mySQL to bigint key in MS SQL?

thank you!



Post #992367
Posted Friday, September 24, 2010 1:59 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, February 11, 2013 8:43 AM
Points: 5, Visits: 138
Nice article, wish you could write SSIS for our team
Post #993075
Posted Friday, September 24, 2010 9:30 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, December 18, 2014 12:38 PM
Points: 8, Visits: 735
I actually resolved the problem. not sure about performance. I'll add my comments later when I get results. thank you all.
Post #993182
Posted Monday, March 19, 2012 3:56 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, November 19, 2014 3:58 PM
Points: 10, Visits: 217
awesome article , however, I have a different scenario. What about the source and target tables are really large?
close to 1 billion records? I am not sure if it's realistic to go through each record using script component to
calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would pretty
much suck most of the memory as it contains over 1 billions records.
Any comments on large tables?

thanks

Hui
Post #1269256
Posted Monday, March 19, 2012 4:02 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, May 10, 2013 4:31 PM
Points: 49, Visits: 135
huishi.ca (3/19/2012)
awesome article , however, I have a different scenario. What about the source and target tables are really large?
close to 1 billion records? I am not sure if it's realistic to go through each record using script component to
calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would pretty
much suck most of the memory as it contains over 1 billions records.
Any comments on large tables?

thanks

Hui


For extremely large tables it might make more sense to implement change capturing in the source system to limit the number of records you have to work with in the package. If that's not an option this method will only really work with an SSIS box that has a ton of memory. You can modify the caching in the lookup as well to take pressure of the memory.

Also I'd highly recommend using SSIS Multiple Hash instead of this method, it's much easier to incorporate into your ETL infrastructure than a custom script.
Post #1269258
Posted Monday, June 25, 2012 8:39 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, December 10, 2014 9:39 AM
Points: 6, Visits: 414
sabyasm (9/2/2010)

Brett,

Thanks for the quick reply.

Problem is - I don't have the control on the environment I am working and I am not allowed to use any custom component. So I am stuck with VB.NET script component.

Do you have any idea why this might have happened? Anything related to memory buffer?

Can you provide any alternate code snippet that might help.


I am having the same issue with this script. In each buffer, rows with the same values have a different MD5 value to the previous buffer.

I've requested the Multiple Hash component to be installed but in the meantime it would be great to get this to work correctly.
Post #1320667
Posted Monday, June 25, 2012 11:27 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, December 10, 2014 9:39 AM
Points: 6, Visits: 414
For testing purposes I created a spreadsheet with 3 columns, Column A, B and C each with the values A B C respectively in ~60,000 rows and used this as my data source.

Using the script below:

' Microsoft SQL Server Integration Services Script Component
' Write scripts using Microsoft Visual Basic 2008.
' ScriptMain is the entry point class of the script.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports Microsoft.SqlServer.Dts.Pipeline
Imports System.Text
Imports System.Security.Cryptography

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Private InputBuffer As PipelineBuffer

Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As PipelineBuffer)
InputBuffer = Buffer
MyBase.ProcessInput(InputID, Buffer)
End Sub

Public Overrides Sub PreExecute()
MyBase.PreExecute()
'
' Add your code here for preprocessing or remove if not needed
''
End Sub

Public Overrides Sub PostExecute()
MyBase.PostExecute()
'
' Add your code here for postprocessing or remove if not needed
' You can set read/write variables here, for example:
' Me.Variables.MyIntVar = 100
''
End Sub

Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)
Dim counter As Integer = 0
Dim values As New StringBuilder

For counter = 0 To inputBuffer.ColumnCount - 1
Dim value As Object
value = inputBuffer.Item(counter)
values.Append(value)
Next
Row.Concat = values.ToString
Row.HashValue = CreateHash(values.ToString())

End Sub

Public Shared Function CreateHash(ByVal data As String) As String
Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim md5 As MD5 = New MD5CryptoServiceProvider()
Dim hashedData As Byte() = md5.ComputeHash(dataToHash)
RNGCryptoServiceProvider.Create().GetBytes(dataToHash)
Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)
Return s
End Function

End Class

I get the following results:

Row Row 1 onwards
Column A A
Column B B
Column C C
HashValue cWd0tTqSuYUlu6F5oVnMcQ==
Concat ABC

Row Row 11167 onwards
Column A A
Column B B
Column C C
HashValue EP2rhBgzqDEc0qmEjWlQBA==
Concat ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 24194 onwards
Column A A
Column B B
Column C C
HashValue dLHcMhE2RJZ6ew7Jd8oezQ==
Concat ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 37221 onwards
Column A A
Column B B
Column C C
HashValue 93k+mMDI1x5OoZ0cjtz7Hw==
Concat ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 50248 onwards
Column A A
Column B B
Column C C
HashValue RHGVCjMX9usJb160IvP6RQ==
Concat ABC93k+mMDI1x5OoZ0cjtz7Hw==ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

Row Row 58881 onwards
Column A A
Column B B
Column C C
HashValue cWd0tTqSuYUlu6F5oVnMcQ==
Concat ABC

Looking at the concat results the previous buffers HashValue is getting include in the InputSource for some reason. When I add the line Row.ColumnCount = InputBuffer.ColumnCount the value is 5 for each row even though I've only selected 3 columns as part of InputSource.
Post #1320806
Posted Tuesday, October 30, 2012 9:11 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, December 8, 2014 1:14 PM
Points: 10, Visits: 53
I have implemented the code suggested and I am running into some trouble where every 3rd or 4th time I run the package, I get the same hash for columns that have not changed. Shouldn't I get the same hash "every" time I run the package if the columns have not changed?

These tables outside of any OLTP environment so there is no change in the source data.

I appreciate any help.
Post #1378840
« Prev Topic | Next Topic »

Add to briefcase «««34567»»

Permissions Expand / Collapse