﻿<?xml version='1.0' encoding='UTF-8'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>SQLServerCentral / Article Discussions / Article Discussions by Author / Discuss content posted by Brett Flippin  / Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes / Latest Posts</title><generator>InstantForum.NET v2.9.0</generator><description>SQLServerCentral</description><link>http://www.sqlservercentral.com/Forums/</link><webMaster>notifications@sqlservercentral.com</webMaster><lastBuildDate>Wed, 19 Jun 2013 11:45:53 GMT</lastBuildDate><ttl>20</ttl><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Good and very clear article. But described solution - to "speed up" ETL - is like shooting mosquito from cannon ball. Much more faster is using HASHBYTES sql server native function. And for Goodness sake, dont recommend using OLEDB Command to speed up anything. Use data sets and staging tables.</description><pubDate>Sun, 09 Jun 2013 08:23:17 GMT</pubDate><dc:creator>r.krylowicz</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>I have implemented the code suggested and I am running into some trouble where every 3rd or 4th time I run the package, I get the same hash for columns that have not changed.  Shouldn't I get the same hash "every" time I run the package if the columns have not changed?These tables outside of any OLTP environment so there is no change in the source data.I appreciate any help.</description><pubDate>Tue, 30 Oct 2012 09:11:36 GMT</pubDate><dc:creator>radro23</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>For testing purposes I created a spreadsheet with 3 columns, Column A, B and C each with the values A B C respectively in ~60,000 rows and used this as my data source.Using the script below:[code="vb"]' Microsoft SQL Server Integration Services Script Component' Write scripts using Microsoft Visual Basic 2008.' ScriptMain is the entry point class of the script.Imports SystemImports System.DataImports System.MathImports Microsoft.SqlServer.Dts.Pipeline.WrapperImports Microsoft.SqlServer.Dts.Runtime.WrapperImports Microsoft.SqlServer.Dts.PipelineImports System.TextImports System.Security.Cryptography&amp;lt;Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()&amp;gt; _&amp;lt;CLSCompliant(False)&amp;gt; _Public Class ScriptMain    Inherits UserComponent    Private InputBuffer As PipelineBuffer    Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As PipelineBuffer)        InputBuffer = Buffer        MyBase.ProcessInput(InputID, Buffer)    End Sub    Public Overrides Sub PreExecute()        MyBase.PreExecute()        '        ' Add your code here for preprocessing or remove if not needed        ''    End Sub    Public Overrides Sub PostExecute()        MyBase.PostExecute()        '        ' Add your code here for postprocessing or remove if not needed        ' You can set read/write variables here, for example:        ' Me.Variables.MyIntVar = 100        ''    End Sub    Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)        Dim counter As Integer = 0        Dim values As New StringBuilder        For counter = 0 To inputBuffer.ColumnCount - 1            Dim value As Object            value = inputBuffer.Item(counter)            values.Append(value)        Next        Row.Concat = values.ToString        Row.HashValue = CreateHash(values.ToString())    End Sub    Public Shared Function CreateHash(ByVal data As String) As String        Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)        Dim md5 As MD5 = New MD5CryptoServiceProvider()        Dim hashedData As Byte() = md5.ComputeHash(dataToHash)        RNGCryptoServiceProvider.Create().GetBytes(dataToHash)        Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)        Return s    End FunctionEnd Class[/code]I get the following results:Row	Row 1 onwardsColumn A	AColumn B	BColumn C	CHashValue	cWd0tTqSuYUlu6F5oVnMcQ==Concat	ABC	Row	Row 11167 onwardsColumn A	AColumn B	BColumn C	CHashValue	EP2rhBgzqDEc0qmEjWlQBA==Concat	ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC	Row	Row 24194 onwardsColumn A	AColumn B	BColumn C	CHashValue	dLHcMhE2RJZ6ew7Jd8oezQ==Concat	ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC	Row	Row 37221 onwardsColumn A	AColumn B	BColumn C	CHashValue	93k+mMDI1x5OoZ0cjtz7Hw==Concat	ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC	Row	Row 50248 onwardsColumn A	AColumn B	BColumn C	CHashValue	RHGVCjMX9usJb160IvP6RQ==Concat	ABC93k+mMDI1x5OoZ0cjtz7Hw==ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC	Row	Row 58881 onwardsColumn A	AColumn B	BColumn C	CHashValue	cWd0tTqSuYUlu6F5oVnMcQ==Concat	ABCLooking at the concat results the previous buffers HashValue is getting include in the InputSource for some reason. When I add the line Row.ColumnCount = InputBuffer.ColumnCount the value is 5 for each row even though I've only selected 3 columns as part of InputSource.</description><pubDate>Mon, 25 Jun 2012 11:27:02 GMT</pubDate><dc:creator>djphatic</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]sabyasm (9/2/2010)[/b][hr][font="Comic Sans MS"]Brett,Thanks for the quick reply. Problem is - I don't have the control on the environment I am working and I am not allowed to use any custom component. So I am stuck with VB.NET script component.Do you have any idea why this might have happened? Anything related to memory buffer?Can you provide any alternate code snippet that might help.[/font][/quote]I am having the same issue with this script. In each buffer, rows with the same values have a different MD5 value to the previous buffer.I've requested the Multiple Hash component to be installed but in the meantime it would be great to get this to work correctly.</description><pubDate>Mon, 25 Jun 2012 08:39:25 GMT</pubDate><dc:creator>djphatic</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]huishi.ca (3/19/2012)[/b][hr]awesome article , however, I have a different scenario. What about the source and target tables are really large? close to 1 billion records? I am not sure if it's realistic to go through each record using script component to calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would prettymuch suck most of the memory as it contains over 1 billions records.  Any comments on large tables?thanksHui[/quote]For extremely large tables it might make more sense to implement change capturing in the source system to limit the number of records you have to work with in the package. If that's not an option this method will only really work with an SSIS box that has a ton of memory. You can modify the caching in the lookup as well to take pressure of the memory.Also I'd highly recommend using SSIS Multiple Hash instead of this method, it's much easier to incorporate into your ETL infrastructure than a custom script.</description><pubDate>Mon, 19 Mar 2012 16:02:40 GMT</pubDate><dc:creator>Brett Flippin</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>awesome article , however, I have a different scenario. What about the source and target tables are really large? close to 1 billion records? I am not sure if it's realistic to go through each record using script component to calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would prettymuch suck most of the memory as it contains over 1 billions records.  Any comments on large tables?thanksHui</description><pubDate>Mon, 19 Mar 2012 15:56:31 GMT</pubDate><dc:creator>huishi.ca</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>I actually resolved the problem. not sure about performance. I'll add my comments later when I get results.  thank you all.</description><pubDate>Fri, 24 Sep 2010 21:30:05 GMT</pubDate><dc:creator>dellabayr</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Nice article, wish you could write SSIS for our team</description><pubDate>Fri, 24 Sep 2010 13:59:41 GMT</pubDate><dc:creator>bolyard</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>i wonder if you can do incremental load from MysQL to SQl server 2008. I used ADO .NET source connecting to MySQL and I could not do connection to Script component. I read another article Anatomy of an Incremental Load and i cannot use lookup transformation editor , it produced "cannot link data because of data type mismatch" error when I try mach MYSQl key to SQL server 2008 table? how I can convert unsigned 4 bite unsigned interger from mySQL to bigint key in MS SQL? thank you!</description><pubDate>Thu, 23 Sep 2010 14:48:54 GMT</pubDate><dc:creator>ddbvizel</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Nice article.Consider adding the checksum column as an INCLUDE column to the PK index on the table. Your milage may vary because:1) The lookup transformation component only needs to read the PK index, not the actual data rows. [good for this process.]2) The PK index is larger because of the checksum values. [a minus--hurts storage and any access using the PK.]3) The PK index entry must be updated with a new checksum value whenever the data gets updated. Since the checksum value is always the same size, this will not fragment the index pages unless you use the "update trigger sets checksum column to null" idea. [a minus for any process that updates the data.]</description><pubDate>Fri, 03 Sep 2010 17:23:25 GMT</pubDate><dc:creator>DLathrop</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[font="Comic Sans MS"]Thanks!I am going to do that too!Meantime I am removing 4 columns on pipeline which will always be fixed string and see if this helps anyway.Let me thank you once more for this article. I saw it through RSS feed and bookmarked a while ago. It came really handy when we needed to use this 5M vs 5M data compare where any of these 15-16 column value can change. This made the end to end process run in less than 30 mins while it used to take more than 16 hrs![/font]</description><pubDate>Thu, 02 Sep 2010 14:12:49 GMT</pubDate><dc:creator>sabyasm</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>I will do a little more testing on it during this long holiday weekend and see what I might be able to come up with.One thing you could try is using a different type of hash such as SHA instead of MD5. I provided an example earlier in the thread.I don't think it's a buffer issue, though it could be if you are experiencing problems at the same threshold every time.</description><pubDate>Thu, 02 Sep 2010 13:51:35 GMT</pubDate><dc:creator>Brett Flippin</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[font="Comic Sans MS"]Brett,Thanks for the quick reply. Problem is - I don't have the control on the environment I am working and I am not allowed to use any custom component. So I am stuck with VB.NET script component.Do you have any idea why this might have happened? Anything related to memory buffer?Can you provide any alternate code snippet that might help.[/font]</description><pubDate>Thu, 02 Sep 2010 13:44:42 GMT</pubDate><dc:creator>sabyasm</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>I've noticed that sometimes in certain character fields this can be a problem.  The best solution I've found, and what I'm currently using in my environment is the SSIS Multiple Hash data flow task.  It seems to be a little more reliable when dealing with these fields.The only change you'll have to make is to change the HashValue columns to the varbinary data type of whatever length of whatever hash you choose to use.  And you'll need to convert the new binary hash value to text in the conditional split for comparison as the conditional split cannot compare binary values.[url=http://ssismhash.codeplex.com/][/url]</description><pubDate>Thu, 02 Sep 2010 10:36:18 GMT</pubDate><dc:creator>Brett Flippin</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[font="Comic Sans MS"]This is an excellent article.However - I am facing a peculier problem that sometime the generated MD5 hashval is NOT same for the same data. This is happening intermittantly - say for 5 days it is working fine and not working next day. Also - I have noticed this to happen only if the data volume is high (5M). For around first 50K records it generated the hashValue same as previous day (say day 0) - but for rest of records it has generated different hasValues. It shows rest of the 5M records to be delta. Next day - it rectifies itself and create hashValue same as what we had had for day 0.It looks like:(columnA) (columnB) (columnC) columnHashA,B,C, hashValue1 ----&amp;gt; day 0A,B,C, hashValue2 ----&amp;gt; day 1 (record inserted)(I forced all insert even if the hashvalue same)A,B,C, hashValue1 ----&amp;gt; day 2( I have 16+ columns - combination of varchar, int, datetime. Some of the columns contain Null value and the hashValue column is of nvarchar(35)I was unable to replicate this from one environment to another. I loaded those 3 days data into TEST and got desired result (generated ~800 delta records).This got me thinking of any environment related issue? Is it related to some sort of buffer overflow (note around ~50K records generated correct hashvalue)? I have loaded more than 500 files ranging from 1000 records to 5M reocrds and the problem has happened only 3 times with the files with 5M records. However - I have loaded the 5M file around 20 times without any issue.Any clue - or suggestion for doing the debug will be highly appreciated.[/font]</description><pubDate>Thu, 02 Sep 2010 10:22:16 GMT</pubDate><dc:creator>sabyasm</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Thanks for the quick response.</description><pubDate>Wed, 14 Jul 2010 20:53:26 GMT</pubDate><dc:creator>Pravasis</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]pravasis (7/14/2010)[/b][hr]Its really a nice artice, but I think can fail in the following scenario.1)There are more than two columns 2)Atleast two records are there.3)Alternately each column has null values and the other column has the value contained in the previous record.Ex.Col-A	Col-B	Col-C	Col-D2	1	111	NULL3	1	NULL	111Both the records will generate same hashcode.[/quote]You can prevent this by using coalesce or isnull for the columns that could contain NULL values.</description><pubDate>Wed, 14 Jul 2010 05:33:23 GMT</pubDate><dc:creator>Brett Flippin</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Its really a nice artice, but I think can fail in the following scenario.1)There are more than two columns 2)Atleast two records are there.3)Alternately each column has null values and the other column has the value contained in the previous record.Ex.Col-A	Col-B	Col-C	Col-D2	1	111	NULL3	1	NULL	111Both the records will generate same hashcode.</description><pubDate>Wed, 14 Jul 2010 03:07:45 GMT</pubDate><dc:creator>Pravasis</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Hello Brett,Thank you for the wonderful article.I was trying to install your example but for some reason its not working for me. Is there any possibility that you can share the full code in your script component with us.Thanks</description><pubDate>Tue, 13 Jul 2010 07:56:00 GMT</pubDate><dc:creator>Neeraj Gupta-485504</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Hey, great article, just two quick comments, there is a SSISMultipleHash task on Codeplex that is very nice for this, (I'm not the developer, but I use it), you can find it here [url=http://ssismhash.codeplex.com/]http://ssismhash.codeplex.com/[/url].Then on the idea of doing the hash in the source, instead of in SSIS flow.  I think that is a really good idea, and can make it so that you don't have to store the hash in the database table.  There are some considerations to look at to make sure that the hash algorithm is always the same([url=http://weblogs.sqlteam.com/mladenp/archive/2009/04/28/Comparing-SQL-Server-HASHBYTES-function-and-.Net-hashing.aspx]http://weblogs.sqlteam.com/mladenp/archive/2009/04/28/Comparing-SQL-Server-HASHBYTES-function-and-.Net-hashing.aspx[/url]).  Also, if you store the hash in the database, then you might want to add a trigger to the table that will set the hash to null if a change is made to the record, that doesn't change the hash.  ie. a change made outside of your ssis package.  If the hash doesn't get changed, then the ssis package won't recognize that the two records are out of sync.  Hope that makes sense!</description><pubDate>Tue, 15 Jun 2010 22:20:14 GMT</pubDate><dc:creator>mwpage</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]Todd McDermid (6/14/2010)[/b][hr][quote][b]Jason Lees-299789 (6/14/2010)[/b][hr]...a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.[/quote]Check out the Multiple Hash component: [url=http://ssismhash.codeplex.com/]SSIS Multiple Hash Component[/url].[/quote]Thanks for the pointer Todd, I'll check it out.</description><pubDate>Mon, 14 Jun 2010 22:39:41 GMT</pubDate><dc:creator>Jason-299789</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]Jason Lees-299789 (6/14/2010)[/b][hr]...a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.[/quote]Check out the Multiple Hash component: [url=http://ssismhash.codeplex.com/]SSIS Multiple Hash Component[/url].</description><pubDate>Mon, 14 Jun 2010 15:22:35 GMT</pubDate><dc:creator>Todd McDermid</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]Nadrek (6/14/2010)[/b][hr][quote][b]Brett Flippin (6/10/2010)[/b]...Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination.  ...[/quote]If you actually build a second table, work on that, then drop the older table and sp_rename the newer (or use synonyms if you're not still on SQL Server 2000), then you can indeed fail gracefully and have full availability during almost the entire update.[/quote]That could work, but having database object or table structure manipulation be part of your ETL process wouldn't be optimal in my book. Would you really want to rebuild indexes or recreate statistics after each ETL run? What about if you're partitioning?I could see that working in a very very limited set of circumstances but I don't know that even then I'd recommend it.</description><pubDate>Mon, 14 Jun 2010 12:49:26 GMT</pubDate><dc:creator>Brett Flippin</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]Brett Flippin (6/10/2010)[/b]...Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination.  ...[/quote]If you actually build a second table, work on that, then drop the older table and sp_rename the newer (or use synonyms if you're not still on SQL Server 2000), then you can indeed fail gracefully and have full availability during almost the entire update.</description><pubDate>Mon, 14 Jun 2010 12:25:33 GMT</pubDate><dc:creator>Nadrek</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>UMG, I've used the Binary_CheckSum function as well and normally keep it as a calculated column that forms part of the Business key Index as an include.However there is a down side in using the Binary_Checksum, in that if you are dealing with Decimals that change by a fact of 10, eg 1001.00, becomes 100.10 or 10.01 (or the reverse) then the BCS doesnt recognise these as changes thus the Checksum remains the same. Not an ideal situation when you are checking financials.The MD5, or SHA1 are much better, however my concern is in regard to reusability of the code you have to cut and paste this in to every data stream, a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.</description><pubDate>Mon, 14 Jun 2010 04:10:27 GMT</pubDate><dc:creator>Jason-299789</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]da-zero (6/10/2010)[/b][hr]The Truncate/insert method will only work if your source data also keeps history. If not, you're stuck with incremental load.Furthermore, at the beginning truncate/insert will surely work faster than the incremental load, but what if you have 300,000 records a day? After two years, the incremental load will be faster as it will only have to process those 300,000 records. Your insert statement will have to insert in the worst case 100+ billion records.[/quote]Agreed on these points. Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination.  An incremental ETL process will be slower in some cases but your data integrity and availability is much much higher.  This article is about making the incremental ETL as fast as it can be by checking only a hash for changes instead of checking every column.</description><pubDate>Thu, 10 Jun 2010 08:28:27 GMT</pubDate><dc:creator>Brett Flippin</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]aruzhans (6/9/2010)[/b][hr]Excellent article! Ready to use to mentor development team!Great elegant use of Conditional Split in Incremental loadsHave you compared the performance of this method with more traditional checksum method. The checksum column is created in destination table and compared with run time checksum calc of input data. The match ignored, mismatch results in update of entire row. What do you think are the advantages of this method vs checksumThanks[/quote]The advantages in using MD5 or SHA-1 over traditional checksum is simply the lower chance of hash collisions, which is going to be small either way if you are using the Hash + ID column to identify your records.</description><pubDate>Thu, 10 Jun 2010 08:21:49 GMT</pubDate><dc:creator>Brett Flippin</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>The Truncate/insert method will only work if your source data also keeps history. If not, you're stuck with incremental load.Furthermore, at the beginning truncate/insert will surely work faster than the incremental load, but what if you have 300,000 records a day? After two years, the incremental load will be faster as it will only have to process those 300,000 records. Your insert statement will have to insert in the worst case 100+ billion records.</description><pubDate>Thu, 10 Jun 2010 04:27:44 GMT</pubDate><dc:creator>Koen Verbeeck</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Hi,I have tried out this. I still observed truncate\insert is much faster that incremental load using MD5 process. For 800000 records, truncate\insert took 33sec to load while incremental process took 60sec. Please comment on this, if i am missing anything while following incremental process posted.ThanksVinay</description><pubDate>Thu, 10 Jun 2010 04:06:28 GMT</pubDate><dc:creator>vinay-474901</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Excelente article.Congratulations</description><pubDate>Wed, 09 Jun 2010 15:15:50 GMT</pubDate><dc:creator>mario.felix</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Excellent article! Ready to use to mentor development team!Great elegant use of Conditional Split in Incremental loadsHave you compared the performance of this method with more traditional checksum method. The checksum column is created in destination table and compared with run time checksum calc of input data. The match ignored, mismatch results in update of entire row. What do you think are the advantages of this method vs checksumThanks</description><pubDate>Wed, 09 Jun 2010 12:16:31 GMT</pubDate><dc:creator>aruzhans</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]Neeraj Gupta-485504 (6/8/2010)[/b][hr]Could you please elaborate the use of binary checksum function.[/quote]Here is a link to BOL for it: [url=http://msdn.microsoft.com/en-us/library/ms173784.aspx]BINARY_CHECKSUM[/url]Essentially you could do something like:[code="SQL"]UPDATE ASET LastName = B.LastName FROM Table1 A  INNER JOIN Table2 B    ON A.PK = B.PKWHERE  (SELECT BINARY_CHECKSUM(*) FROM Table1 WHERE PK = A.PK) &amp;lt;&amp;gt; (SELECT BINARY_CHECKSUM(*) FROM Table2 WHERE PK = A.PK);[/code]That probably isn't the best way to do it, you could get rid of the sub-queries by specifying column names like:[code="SQL"]UPDATE ASET LastName = B.LastNameFROM Table1 A  INNER JOIN Table2 B    ON A.PK = B.PKWHERE  BINARY_CHECKSUM(A.LastName) &amp;lt;&amp;gt; BINARY_CHECKSUM(B.LastName);[/code]You could use the same idea with a MERGE statement to handle the whole load process.</description><pubDate>Wed, 09 Jun 2010 10:37:58 GMT</pubDate><dc:creator>UMG Developer</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Interesting read with relation to this topic:     [url=http://sqlblog.com/blogs/michael_coles/archive/2009/04/12/let-s-hash-a-blob.aspx][i]Let's Hash a BLOB[/i] by Michael Coles[/url]Perhaps more applicable to hashing techniques outside of ETL loading, or those paranoid about hash collisions, but still interesting stuff.  Mr. Coles also includes some .NET code for using stronger hashes than the MD5 and SHA1 covered thus far.</description><pubDate>Wed, 09 Jun 2010 07:08:25 GMT</pubDate><dc:creator>Wes W.</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>I've used a similar approach like this before and found one slight problem with this...It is all well when your destination table - that is the table in the lookup - is rather small (record count wise).To cache a table with 1 million records won't take that long. But now if you take a table with 500 mil records...Try to cache that. This step can and will cause some major slow downs.So take this scenario:You're doing a daily load - With a delta size of +- 100,000 records. To pull 100,000 records is fairly fast.Your destination table already have 500 mil records. You can use the lookup component in one of couple of ways: 1. Cached (which means that the full 500 mil records will be cached into memory) This will take most of your memory - and will take some time. By the time that you've cached everything you could have loaded the 100,000 records...2. No cache - this approach might be a bit faster in this scenario - but you will that your network traffic can jump up a bit (depends on where your package is running)3. Partial cache - I've only looked at this once or so - so can't comment on performance on this.So in my experience this works like a dream on smaller tables. But very big tables it might not work that well.So if you have a workaround, or a some advice on how to cater for big tables - please shout!</description><pubDate>Wed, 09 Jun 2010 06:08:06 GMT</pubDate><dc:creator>Wandrag</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Greate article and I also use the Multiple Hash component that Todd stated [u][url=http://ssismhash.codeplex.com/]http://ssismhash.codeplex.com/[/url][/u] but I like the script component use.  This would be great for a simple Type 1 SCD, but obviously would need to be expanded upon for a Type 2 and Todd has a great component for that as well [u][url=http://kimballscd.codeplex.com/]http://kimballscd.codeplex.com/[/url][/u].  If you were doing a Type 1 you could typically leverage some sort of last update field or possibly change data capture and then simply update the changed records and just bypass the use of a hash.</description><pubDate>Wed, 09 Jun 2010 04:52:56 GMT</pubDate><dc:creator>denglishbi</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>An untested idea for minimizing hash collisions could be, having 2 hash columns. One for hashing columns in ascending and one in descending order. If you have a hash collisions in ascending order the chances for also having it in the descending order for the collision rows, should be nearly impossible, unless the values are the same in all columns</description><pubDate>Wed, 09 Jun 2010 04:47:01 GMT</pubDate><dc:creator>Ligtorn</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>An untested idea for minimizing hash collisions could be, having 2 hash columns. One for hashing columns in ascending and one in descending order. If you have a hash collisions in ascending order the chances for also having it in the descending order for the collision rows, should be nearly impossible, unless the values are the same in all columns</description><pubDate>Wed, 09 Jun 2010 04:47:00 GMT</pubDate><dc:creator>Ligtorn</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Could you please elaborate the use of binary checksum function.Thank you</description><pubDate>Tue, 08 Jun 2010 18:56:22 GMT</pubDate><dc:creator>Neeraj Gupta-485504</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>[quote][b]Neeraj Gupta-485504 (6/8/2010)[/b][hr]Very nice article.We could also use Merge statement in Sql Server 2008 to achieve this. Any thoughts on that.[/quote]I think you are correct, if you have direct access to both as a table you could use the MERGE statement along with the BINARY_CHECKSUM function to do the entire process of INSERTing, UPDATEing, and DELETEing.</description><pubDate>Tue, 08 Jun 2010 18:05:03 GMT</pubDate><dc:creator>UMG Developer</dc:creator></item><item><title>RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes</title><link>http://www.sqlservercentral.com/Forums/Topic933776-2712-1.aspx</link><description>Very nice article.We could also use Merge statement in Sql Server 2008 to achieve this. Any thoughts on that.</description><pubDate>Tue, 08 Jun 2010 17:33:00 GMT</pubDate><dc:creator>Neeraj Gupta-485504</dc:creator></item></channel></rss>