Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Fast normalization of a denormalized data source, using surrogate IDs


Fast normalization of a denormalized data source, using surrogate IDs

Author
Message
Adam Aspin
Adam Aspin
SSC-Enthusiastic
SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)

Group: General Forum Members
Points: 183 Visits: 1028
Comments posted to this topic are about the item Fast normalization of a denormalized data source, using surrogate IDs
admin-499013
admin-499013
Grasshopper
Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)

Group: General Forum Members
Points: 23 Visits: 114
Very interesting approach. Without more test data I cant say is its any faster than straight inserts (below) but a nicely written article. Many thanks for taking the time to share.


INSERT INTO dbo.NT_Client(ClientName,Town)
   SELECT DISTINCT ClientName,Town FROM dbo.NT_SourceData
   
INSERT INTO dbo.NT_OrderHeader (Client_ID,OrderNumber,OrderDate)
   SELECT DISTINCT c.ID,d.OrderNumber,d.OrderDate
      FROM dbo.NT_SourceData d INNER JOIN dbo.NT_Client c ON c.ClientName=d.ClientName AND c.Town=d.Town
      
      
INSERT INTO dbo.NT_OrderDetail (OrderHeader_ID,ItemNo,ItemQuantity,ItemCost,ItemTotal)
   SELECT DISTINCT h.ID,   d.ItemID,d.ItemQty,d.ItemCost,d.ItemTotal
      FROM dbo.NT_SourceData d INNER JOIN dbo.NT_OrderHeader h ON d.OrderNumber=h.OrderNumber
vk-kirov
vk-kirov
Hall of Fame
Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)

Group: General Forum Members
Points: 3494 Visits: 4408
Nice example of set-based data processing.

I have a few minor notes here:

1)
SELECT @Client_ID = ISNULL(IDENT_CURRENT('NT_Client'),0) + 1
SELECT @OrderHeader_ID = ISNULL(IDENT_CURRENT('NT_OrderHeader'),0) + 1

Note the use of ISNULL - this allows you to provide an initial identity value even if the table is empty.

The IDENT_CURRENT() function returns the identity seed value if a table is empty. It can be shown by this example:
create table #test (id int identity (10,3))
select ident_current('#test') -- the table is empty, the result is 10
insert #test default values -- the table has one row, the result is 10
select ident_current('#test')
insert #test default values
select ident_current('#test') -- the table has two rows, the result is 13


This behavior is identical at least in MSSQL 2000, 2005, and 2008.

2) It's better to insert data into the tables in one transaction:
BEGIN TRANSACTION
INSERT INTO dbo.NT_Client ...
INSERT INTO dbo.NT_OrderHeader ...
INSERT INTO dbo.NT_OrderDetail ...
COMMIT TRANSACTION


If something fails during the inserts, we'll not get incomplete data (order headers without order details etc.).

3) If data are loaded to the system that is currently in use and someone adds new rows to the tables, you may receive an error like this: "Violation of PRIMARY KEY constraint 'PK_NT_OrderHeader'. Cannot insert duplicate key in object 'dbo.NT_OrderHeader'."

For example, you have orders with header IDs = 1, 2, 3, get @OrderHeader_ID = 4, then prepare #Tmp_SourceData. The value '4' will be amongst the header IDs in #Tmp_SourceData. At the same time some user creates a new order with Header_ID = 4. You'll receive the above error message while inserting into dbo.NT_OrderHeader.

This issue may be avoided in this way (supposing that no more than 1000 clients and 10000 orders may be created by other users during our data load):
SELECT @Client_ID = IDENT_CURRENT('NT_Client') + 1000
SELECT @OrderHeader_ID = IDENT_CURRENT('NT_OrderHeader') + 10000



Edit: grammar error
sgtwilko
sgtwilko
SSC Rookie
SSC Rookie (32 reputation)SSC Rookie (32 reputation)SSC Rookie (32 reputation)SSC Rookie (32 reputation)SSC Rookie (32 reputation)SSC Rookie (32 reputation)SSC Rookie (32 reputation)SSC Rookie (32 reputation)

Group: General Forum Members
Points: 32 Visits: 177
I have to say that my first reaction to reading this article was that it was complete overkill and that the solution should be to insert into the client table then either join onto that table to do the following inserts (as has been proposed above by admin-499013) or by updating the source data table with the IDs using a join from NT_Client onto NT_SourceData so to still keep it all set based.

This just gets repeated for the following tables, but is only more useful than the approach by admin-499013 if you are either wanting to use the source data again, or if you are after more error checking so each step can be checked as being correct and compared against the source data.

This approach also allows the system to carry on running without you requiring exclusive access to stop other users from upsetting the existing IDs as you are running in the data.


Adam Aspin, can you explain why myself and admin-499013 would be wrong in our approach?

I can see the advantage of your approach compared to a cursor, but it does come with a serious impact upon the users of the system being denied whilst the data is being run-in. This is not a problem with the distinct insert and the re-join approach.



SQLRNNR
SQLRNNR
SSC-Insane
SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)SSC-Insane (21K reputation)

Group: General Forum Members
Points: 21071 Visits: 18259
Thanks for the article. Nice idea and well written.



Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server, MVP


SQL RNNR

Posting Performance Based Questions - Gail Shaw

Adam Aspin
Adam Aspin
SSC-Enthusiastic
SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)

Group: General Forum Members
Points: 183 Visits: 1028
Thanks to all for the comments, and for isolating the points needing clarification.

Certainly, this is not the only way to do things, and it is only a suggestion, wich probably only suits a very limited set of circumstances. So let me clarify the circumstances:

* This was part of an ETL process, where it was the only process in a sequence using the server - so no other active users. I certainly would not use it in a multi-user scenario. Locking & blocking would be inevitable.
* The box had many processors & large amounts of memory - and as is recommended practice, one tempdb file per core on a fast array

Sorry to have to be vague about exact 'specs, my client is nervy about any disclosure, and I have to get articles vetted...(!)

And some more context - the destination tables contain 100 million-odd records at this stage in the process.

So, isolating the ID sequencing to the source data (and avoiding joins to destination tables), as well as use of temp tables was more hard work to set up - but faster in this (possibly rare) context.

What I will do if & when I get the time is try some comparisons between the two approaches, and see if varying the context (source file no of records, No of records in destination tables etc) provides any thing decent to comment on. However I cannot promise this for anytime soon, I am afraid.

Thanks again for taking the time to comment - and I am sorry that this context was not made clearer in the article.

Adam
astmart
astmart
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 321
Great Article..natural keys is not feasible in my current project and it offers another dimension to an otherwise problematic area
astmart
astmart
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 321
Also tried out the solution by admin-499013 and Grasshoper which is along the lines of which I'm used to when it comes to loading Normalized Tables...but without the Surrogate keys..their solution also happens to another effective solution..
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search