Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Incremental load using Hashvalue Expand / Collapse
Author
Message
Posted Tuesday, January 28, 2014 1:50 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Today @ 5:14 AM
Points: 119, Visits: 486
Hi everybody,

Background: we receive every day customer data, including new customer and changes to previous customer. The new customer are inserted in our CRM system using the cozyroc component. The changes in previous customer data must be analysed and only updated in CRM if the changes are relevant.
A relevant change is for us a change in a relevant column. The customer table, where we store the incoming data has 10 columns, and only 5 are relevant to the CRM system. So if we receive updates only in the non relevant columns we should discard these rows in our incremental load.

Initial solution: add a column to store a hash value with the sum of the relevant columns and then calculate the hash value of the incoming data in order to compare those values
hashvalue = 
Convert(nvarchar(35),hashbytes('MD5', ISNULL(Col1,'') + ISNULL(Col2, '2000-01-01')+...ISNULL(Col5, '')))

The data type of the hashvalue column is nvarchar(35).

The strange thing is that the stored value are weird, something like " ೢムό䜨غ䟊"
and when new data come the new calculated hashvalue never matches the previous one

The question:
I try with a hashvalue column of type binary(16) and seems to work, but my question is why I cannot used the nvarchar solution?

It is also painful because I cannot compare binary values using a conditional split, and I have to convert again the binary values to DT_WSTR for the comparison.

Any comment would be appreciated.

Kind Regards,


Paul Hernández
http://hernandezpaul.wordpress.com/
https://twitter.com/paul_eng
Post #1535332
Posted Tuesday, January 28, 2014 7:57 AM
SSChasing Mays

SSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing MaysSSChasing Mays

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 8:15 AM
Points: 628, Visits: 2,129
I'm a bit confused as to why you need the hash value. Why not select those 5 records in your lookup, and test for changes to any of them in your conditional split?
Post #1535460
Posted Tuesday, January 28, 2014 9:37 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 9:37 AM
Points: 157, Visits: 294
Paul,

It's isn't it the convert that throwing it off? When I've done this sort of thing before I've always had by hash as a computed column (obviously persisted and indexed appropriately). Such as the simplified example below.

I'm confused as to why you'd do a conditional split, would it not be more performant to do an update based on the key value where the hash doesn't match?

Applologies if I'm missing something.

IF OBJECT_ID('dbo.Emp') IS NOT NULL
DROP TABLE dbo.Emp
GO

CREATE TABLE dbo.Emp
(
Id INT PRIMARY KEY
,Forename VARCHAR(50)
,Surname VARCHAR(50)
,JobTitle VARCHAR(50)
,HashCol AS HASHBYTES('MD5',Forename + Surname + JobTitle)
)
GO

INSERT INTO dbo.Emp
VALUES(1,'Bob','Smith','IT Manager')
,(2,'Harry','Hill','DBa')
,(3,'Chris','Brown','Programmer')
GO

UPDATE dbo.Emp
SET Forename = 'Robert'
WHERE HASHBYTES('MD5', 'BobSmithIT Manager') = HashCol
GO

SELECT * FROM dbo.Emp
GO
Post #1535539
Posted Tuesday, January 28, 2014 10:09 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 8:52 AM
Points: 5,101, Visits: 11,903
Perhaps you could try this.

1) Send all of your potential updates to a staging table.

2) Do a Merge from the staging table to perform the update using a source something like

(Select * from staging)
Except
(Select relevant cols from target join staging on col = col)

The except condition should remove the identical rows without fuss.



Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.
Post #1535705
Posted Wednesday, January 29, 2014 12:50 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Today @ 5:14 AM
Points: 119, Visits: 486
Hi Folks,

Thanks for your answers.

@Nevyn: my case is actually more complex, as an example I said 10 columns, 5 relevant and 5 non relevant, but I have a table with more than 40 columns, with at least 25 relevant columns

@Farlzy: that´s an elegant solution and I can solve almost everything in the database, I mean, more in the SQL Server and less in SSIS.

@Phil: you always give useful answers I have chosen your solution, in fact I already have a Staging DB to control the information exchange between our Data Mart and the CRM. I think your solution is not only simple to implement but also has a better performance. I´m a fan of the merge command.

Kind Regards,


Paul Hernández
http://hernandezpaul.wordpress.com/
https://twitter.com/paul_eng
Post #1535741
Posted Wednesday, January 29, 2014 10:30 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Monday, July 28, 2014 7:12 PM
Points: 2,148, Visits: 487
I agree with Phil's approach.

SELECT
EXCEPT
SELECT

is a tried and true way of doing a column specific "table diff". I use hashing as a last resort for extremely large and wide dimensions with many type-2 columns. Personally, I love Todd McDermid's SCD Transform (free on codeplex), but in this situation, simplicity rules...especially if there are not many on your team that know and can support SSIS.

Unfortunately, SQL Server does NOT include the ANSI standard "CORRESPONDING BY" functionality yet....maybe someday....
Post #1536030
Posted Wednesday, January 29, 2014 12:47 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 2:04 AM
Points: 5,383, Visits: 7,457
I've done this before, also with the Persisted computed column for only relevant fields, though I usually use a key-checksum combination.

What I typically do is load the key/checksum combination into a lookup component, then feed in changes from the source determined by their delta detection (usually a datetime field and a 'last run' field in my configurations) and determine those that don't match for the combination. In a lookup, the unmatched error rows can be easily re-directed, and those are what I feed to my staging table. The matching rows are allowed to rot in dev/null.

After that, once they're in the staging structure, I do what Phil recommends and MERGE those I know are changed. It just eases up on the processing and since I have to move the data between source and target anyway, a bit of transformation along the way is far more lightweight to the computations than processing 2-3x the rows on the target.

Just keep the lookup tight. If you're past 30 bytes across or so, it's far too large for easy cache'ing on umpteen million row targets, at least on my equipment.

EDIT: I should mention that yes, I know checksum can have varchar continuation concerns, where 'abc','def' is the same value as 'ab','cdef'. If I can't split my varchars with integers I'll usually append a pipe in the calculation function, so:

CheckSumVal AS CHECKSUM( VCol1, IntCol1, VCol2, '|' + VCol3, '|' + VCol4)




- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
Post #1536080
Posted Wednesday, January 29, 2014 1:05 PM


Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Tuesday, July 22, 2014 4:12 PM
Points: 758, Visits: 1,031
There's one issue with using the EXCEPT approach as the data source for a subsequent MERGE: it becomes impossible to tell from within the MERGE if rows have been deleted from the data source (as they are combined with rows that are identical across both the source and target). If that's okay (you may not care about deletes or you may know that the source never has deletes) then fine, but if not, you are left with the problem of dealing with deletes separately.

An alternative is to use MERGE itself to do the work for you. You end up with doing the following:

MERGE INTO B
USING A
ON A.key = B.key
WHEN MATCHED AND (
A.x is null and B.x is not null or
A.x is not null and B.x is null or
A.x <> B.x or
A.y is null and B.y is not null or
A.y is not null and B.y is null or
A.y <> B.y
)
THEN -- update
WHEN NOT MATCHED BY TARGET
THEN -- insert
WHEN NOT MATCHED BY SOURCE
THEN -- delete

Okay, so that's more code you have to write, unless you take into account the additional code you have to write to handle deletes using the other approach.

I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match. If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.
Post #1536088
Posted Wednesday, January 29, 2014 1:19 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 2:04 AM
Points: 5,383, Visits: 7,457
Bruce W Cassidy (1/29/2014)

I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match.

A persisted calculated column is relatively lightweight, and doesn't require continuous computation. The key being it's persisted. Also INTs are small.

If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.

True, it depends on change volume expectations of the columns required vs. the columns that aren't transferred. If it's low, I probably wouldn't pursue the removal of them. If it's high removing it entirely is relatively inexpensive.



- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
Post #1536091
Posted Wednesday, January 29, 2014 1:32 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Monday, July 28, 2014 7:12 PM
Points: 2,148, Visits: 487
Checksum by itself is risky.

Checksum in combination with the business key is less risky.

HASHBYTES is the best for determining total and complete uniqueness if you are willing to deal with the overhead.

Many approaches exist for SCDs depending on data volumes, load windows, and a few other factors.
Post #1536096
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse