Fast normalization of a denormalized data source, using surrogate IDs

Question

Fast normalization of a denormalized data source, using surrogate IDs

Adam Aspin

SSCarpal Tunnel

Points: 4900
More actions
January 23, 2010 at 10:03 pm

#215954

Comments posted to this topic are about the item Fast normalization of a denormalized data source, using surrogate IDs

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

admin-499013 Mr or Mrs. 500 Points: 543 More actions · Answer 1

Very interesting approach. Without more test data I cant say is its any faster than straight inserts (below) but a nicely written article. Many thanks for taking the time to share.

INSERT INTO dbo.NT_Client(ClientName,Town)

SELECT DISTINCT ClientName,Town FROM dbo.NT_SourceData

INSERT INTO dbo.NT_OrderHeader (Client_ID,OrderNumber,OrderDate)

SELECT DISTINCT c.ID,d.OrderNumber,d.OrderDate

FROM dbo.NT_SourceData d INNER JOIN dbo.NT_Client c ON c.ClientName=d.ClientName AND c.Town=d.Town

INSERT INTO dbo.NT_OrderDetail (OrderHeader_ID,ItemNo,ItemQuantity,ItemCost,ItemTotal)

SELECT DISTINCT h.ID,d.ItemID,d.ItemQty,d.ItemCost,d.ItemTotal

FROM dbo.NT_SourceData d INNER JOIN dbo.NT_OrderHeader h ON d.OrderNumber=h.OrderNumber

Pete

vk-kirov SSCertifiable Points: 7686 More actions · Answer 2

Nice example of set-based data processing.

I have a few minor notes here:

1)

SELECT @Client_ID = ISNULL(IDENT_CURRENT('NT_Client'),0) + 1
SELECT @OrderHeader_ID = ISNULL(IDENT_CURRENT('NT_OrderHeader'),0) + 1
Note the use of ISNULL - this allows you to provide an initial identity value even if the table is empty.

The IDENT_CURRENT() function returns the identity seed value if a table is empty. It can be shown by this example:

create table #test (id int identity (10,3))

select ident_current('#test') -- the table is empty, the result is 10

insert #test default values -- the table has one row, the result is 10

select ident_current('#test')

insert #test default values

select ident_current('#test') -- the table has two rows, the result is 13

This behavior is identical at least in MSSQL 2000, 2005, and 2008.

2) It's better to insert data into the tables in one transaction:

BEGIN TRANSACTION

INSERT INTO dbo.NT_Client ...

INSERT INTO dbo.NT_OrderHeader ...

INSERT INTO dbo.NT_OrderDetail ...

COMMIT TRANSACTION

If something fails during the inserts, we'll not get incomplete data (order headers without order details etc.).

3) If data are loaded to the system that is currently in use and someone adds new rows to the tables, you may receive an error like this: "Violation of PRIMARY KEY constraint 'PK_NT_OrderHeader'. Cannot insert duplicate key in object 'dbo.NT_OrderHeader'."

For example, you have orders with header IDs = 1, 2, 3, get @OrderHeader_ID = 4, then prepare #Tmp_SourceData. The value '4' will be amongst the header IDs in #Tmp_SourceData. At the same time some user creates a new order with Header_ID = 4. You'll receive the above error message while inserting into dbo.NT_OrderHeader.

This issue may be avoided in this way (supposing that no more than 1000 clients and 10000 orders may be created by other users during our data load):

SELECT @Client_ID = IDENT_CURRENT('NT_Client') + 1000

SELECT @OrderHeader_ID = IDENT_CURRENT('NT_OrderHeader') + 10000

Edit: grammar error

sgtwilko Old Hand Points: 301 More actions · Answer 3

I have to say that my first reaction to reading this article was that it was complete overkill and that the solution should be to insert into the client table then either join onto that table to do the following inserts (as has been proposed above by admin-499013) or by updating the source data table with the IDs using a join from NT_Client onto NT_SourceData so to still keep it all set based.

This just gets repeated for the following tables, but is only more useful than the approach by admin-499013 if you are either wanting to use the source data again, or if you are after more error checking so each step can be checked as being correct and compared against the source data.

This approach also allows the system to carry on running without you requiring exclusive access to stop other users from upsetting the existing IDs as you are running in the data.

Adam Aspin, can you explain why myself and admin-499013 would be wrong in our approach?

I can see the advantage of your approach compared to a cursor, but it does come with a serious impact upon the users of the system being denied whilst the data is being run-in. This is not a problem with the distinct insert and the re-join approach.

SQLRNNR SSC Guru Points: 281336 More actions · Answer 4

Thanks for the article. Nice idea and well written.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Adam Aspin SSCarpal Tunnel Points: 4900 More actions · Answer 5

Thanks to all for the comments, and for isolating the points needing clarification.

Certainly, this is not the only way to do things, and it is only a suggestion, wich probably only suits a very limited set of circumstances. So let me clarify the circumstances:

* This was part of an ETL process, where it was the only process in a sequence using the server - so no other active users. I certainly would not use it in a multi-user scenario. Locking & blocking would be inevitable.

* The box had many processors & large amounts of memory - and as is recommended practice, one tempdb file per core on a fast array

Sorry to have to be vague about exact 'specs, my client is nervy about any disclosure, and I have to get articles vetted...(!)

And some more context - the destination tables contain 100 million-odd records at this stage in the process.

So, isolating the ID sequencing to the source data (and avoiding joins to destination tables), as well as use of temp tables was more hard work to set up - but faster in this (possibly rare) context.

What I will do if & when I get the time is try some comparisons between the two approaches, and see if varying the context (source file no of records, No of records in destination tables etc) provides any thing decent to comment on. However I cannot promise this for anytime soon, I am afraid.

Thanks again for taking the time to comment - and I am sorry that this context was not made clearer in the article.

Adam

astmart SSC Journeyman Points: 75 More actions · Answer 6

Great Article..natural keys is not feasible in my current project and it offers another dimension to an otherwise problematic area

astmart SSC Journeyman Points: 75 More actions · Answer 7

Also tried out the solution by admin-499013 and Grasshoper which is along the lines of which I'm used to when it comes to loading Normalized Tables...but without the Surrogate keys..their solution also happens to another effective solution..