Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Fast normalization of a denormalized data source, using surrogate IDs Expand / Collapse
Author
Message
Posted Saturday, January 23, 2010 10:03 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Today @ 5:04 AM
Points: 81, Visits: 902
Comments posted to this topic are about the item Fast normalization of a denormalized data source, using surrogate IDs
Post #852674
Posted Monday, January 25, 2010 2:20 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, September 3, 2013 3:02 PM
Points: 21, Visits: 108
Very interesting approach. Without more test data I cant say is its any faster than straight inserts (below) but a nicely written article. Many thanks for taking the time to share.


INSERT INTO dbo.NT_Client(ClientName,Town)
SELECT DISTINCT ClientName,Town FROM dbo.NT_SourceData

INSERT INTO dbo.NT_OrderHeader (Client_ID,OrderNumber,OrderDate)
SELECT DISTINCT c.ID,d.OrderNumber,d.OrderDate
FROM dbo.NT_SourceData d INNER JOIN dbo.NT_Client c ON c.ClientName=d.ClientName AND c.Town=d.Town


INSERT INTO dbo.NT_OrderDetail (OrderHeader_ID,ItemNo,ItemQuantity,ItemCost,ItemTotal)
SELECT DISTINCT h.ID, d.ItemID,d.ItemQty,d.ItemCost,d.ItemTotal
FROM dbo.NT_SourceData d INNER JOIN dbo.NT_OrderHeader h ON d.OrderNumber=h.OrderNumber
Post #852881
Posted Monday, January 25, 2010 2:59 AM
Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Friday, May 9, 2014 12:47 AM
Points: 3,448, Visits: 4,407
Nice example of set-based data processing.

I have a few minor notes here:

1)
SELECT @Client_ID = ISNULL(IDENT_CURRENT('NT_Client'),0) + 1
SELECT @OrderHeader_ID = ISNULL(IDENT_CURRENT('NT_OrderHeader'),0) + 1

Note the use of ISNULL - this allows you to provide an initial identity value even if the table is empty.

The IDENT_CURRENT() function returns the identity seed value if a table is empty. It can be shown by this example:
create table #test (id int identity (10,3))
select ident_current('#test') -- the table is empty, the result is 10
insert #test default values -- the table has one row, the result is 10
select ident_current('#test')
insert #test default values
select ident_current('#test') -- the table has two rows, the result is 13

This behavior is identical at least in MSSQL 2000, 2005, and 2008.

2) It's better to insert data into the tables in one transaction:
BEGIN TRANSACTION
INSERT INTO dbo.NT_Client ...
INSERT INTO dbo.NT_OrderHeader ...
INSERT INTO dbo.NT_OrderDetail ...
COMMIT TRANSACTION

If something fails during the inserts, we'll not get incomplete data (order headers without order details etc.).

3) If data are loaded to the system that is currently in use and someone adds new rows to the tables, you may receive an error like this: "Violation of PRIMARY KEY constraint 'PK_NT_OrderHeader'. Cannot insert duplicate key in object 'dbo.NT_OrderHeader'."

For example, you have orders with header IDs = 1, 2, 3, get @OrderHeader_ID = 4, then prepare #Tmp_SourceData. The value '4' will be amongst the header IDs in #Tmp_SourceData. At the same time some user creates a new order with Header_ID = 4. You'll receive the above error message while inserting into dbo.NT_OrderHeader.

This issue may be avoided in this way (supposing that no more than 1000 clients and 10000 orders may be created by other users during our data load):
SELECT @Client_ID = IDENT_CURRENT('NT_Client') + 1000
SELECT @OrderHeader_ID = IDENT_CURRENT('NT_OrderHeader') + 10000

Edit: grammar error
Post #852890
Posted Monday, January 25, 2010 12:07 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, April 30, 2014 4:17 AM
Points: 27, Visits: 160
I have to say that my first reaction to reading this article was that it was complete overkill and that the solution should be to insert into the client table then either join onto that table to do the following inserts (as has been proposed above by admin-499013) or by updating the source data table with the IDs using a join from NT_Client onto NT_SourceData so to still keep it all set based.

This just gets repeated for the following tables, but is only more useful than the approach by admin-499013 if you are either wanting to use the source data again, or if you are after more error checking so each step can be checked as being correct and compared against the source data.

This approach also allows the system to carry on running without you requiring exclusive access to stop other users from upsetting the existing IDs as you are running in the data.


Adam Aspin, can you explain why myself and admin-499013 would be wrong in our approach?

I can see the advantage of your approach compared to a cursor, but it does come with a serious impact upon the users of the system being denied whilst the data is being run-in. This is not a problem with the distinct insert and the re-join approach.



Post #853215
Posted Monday, January 25, 2010 4:00 PM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 8:38 AM
Points: 21,739, Visits: 15,428
Thanks for the article. Nice idea and well written.



Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server


SQL RNNR

Posting Performance Based Questions - Gail Shaw
Posting Data Etiquette - Jeff Moden
Hidden RBAR - Jeff Moden
VLFs and the Tran Log - Kimberly Tripp
Post #853368
Posted Tuesday, January 26, 2010 3:20 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Today @ 5:04 AM
Points: 81, Visits: 902
Thanks to all for the comments, and for isolating the points needing clarification.

Certainly, this is not the only way to do things, and it is only a suggestion, wich probably only suits a very limited set of circumstances. So let me clarify the circumstances:

* This was part of an ETL process, where it was the only process in a sequence using the server - so no other active users. I certainly would not use it in a multi-user scenario. Locking & blocking would be inevitable.
* The box had many processors & large amounts of memory - and as is recommended practice, one tempdb file per core on a fast array

Sorry to have to be vague about exact 'specs, my client is nervy about any disclosure, and I have to get articles vetted...(!)

And some more context - the destination tables contain 100 million-odd records at this stage in the process.

So, isolating the ID sequencing to the source data (and avoiding joins to destination tables), as well as use of temp tables was more hard work to set up - but faster in this (possibly rare) context.

What I will do if & when I get the time is try some comparisons between the two approaches, and see if varying the context (source file no of records, No of records in destination tables etc) provides any thing decent to comment on. However I cannot promise this for anytime soon, I am afraid.

Thanks again for taking the time to comment - and I am sorry that this context was not made clearer in the article.

Adam


Post #853483
Posted Tuesday, December 14, 2010 8:21 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Yesterday @ 3:16 AM
Points: 3, Visits: 307
Great Article..natural keys is not feasible in my current project and it offers another dimension to an otherwise problematic area
Post #1034463
Posted Wednesday, December 15, 2010 2:27 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Yesterday @ 3:16 AM
Points: 3, Visits: 307
Also tried out the solution by admin-499013 and Grasshoper which is along the lines of which I'm used to when it comes to loading Normalized Tables...but without the Surrogate keys..their solution also happens to another effective solution..
Post #1034978
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse