SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Manipulating data in a 48 million record table


Manipulating data in a 48 million record table

Author
Message
Abu Dina
Abu Dina
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3061 Visits: 3325
I have the following query which basically gets data from a 48 million record table, performs various things using SQL CLR C# functions then puts the results into a new table.

Initially I set the code to run on the entire table but after 3 hours the code failed due to some dodgy data in one of the columns! So now I want to perform this in chunks of 1 million records at a time.

However, as the Loading_B2C_keys_ table has more and more data, the insert is getting slower and slower!

I've tried adding a non-clustered index on the ID column on both tables but it's not having much effect? Is there a better way to do this?

By the way, the ID column appears to contain a mixture of integers and GUIDs. Should I create a new Identity column on the big table and use that as my ID? Would that help?

INSERT INTO dbo.Loading_B2C_keys_ (
ID,
mkNameKey,
mkName1,
mkName2,
mkName3,
mkNormalizedName,
mkOrganizationKey,
mkNormalizedOrganization,
mkOrgName1,
mkOrgName2,
mkorgName3,
mkPostIn,
mkPostOut,
mkPhoneticStreet,
mkPremise,
mkPhoneticTown,
mkEmailAddress)
SELECT top 1000000 ID,
dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetLastWord(SURNAME)) + LEFT(FIRSTNAME, 1), dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetLastWord(SURNAME)),
dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetFirstWord(FIRSTNAME)),
dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetFirstWord(SECOND_INIT_NAME)),
UPPER(dbo.clrFn_GetLastWord(SURNAME)) + ',' + UPPER(dbo.clrFn_GetFirstWord(FIRSTNAME)) + ',' + UPPER(dbo.clrFn_GetFirstWord(SECOND_INIT_NAME)),
NULL,
NULL,
NULL,
NULL,
NULL,
dbo.clrFn_SplitUKPostCode(POSTCODE, 2),
dbo.clrFn_SplitUKPostCode(POSTCODE, 1),
dbo.clrFn_DoubleMetaphone(dbo.clrFn_RemoveDigits(ISNULL(AD1, '') + ' ' + ISNULL(AD2, ''))),
dbo.clrFn_GetDigits(ISNULL(AD1, '') + ' ' + ISNULL(AD2, '')),
dbo.clrFn_DoubleMetaphone(dbo.clrFn_RemoveDigits(TOWN)),
Email
FROM dbo.Loading_B2C as a
WHERE NOT EXISTS (SELECT 1 FROM dbo.Loading_B2C_keys_ as b WHERE a.ID = b.ID)



---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
salliven
salliven
Mr or Mrs. 500
Mr or Mrs. 500 (554 reputation)Mr or Mrs. 500 (554 reputation)Mr or Mrs. 500 (554 reputation)Mr or Mrs. 500 (554 reputation)Mr or Mrs. 500 (554 reputation)Mr or Mrs. 500 (554 reputation)Mr or Mrs. 500 (554 reputation)Mr or Mrs. 500 (554 reputation)

Group: General Forum Members
Points: 554 Visits: 678
Please check this code:
-- Session 1
declare
@from_id int = 0
,@to_id int = 0
,@rowcnt int = -1

create table ##tmp_move_date (id int primary key, from_id int, rowcnt int)
insert into ##tmp_move_date (id, from_id, rowcnt ) values (1, 0, 0)
select @from_id = from_id from ##tmp_move_date where id = 1

while (@rowcnt <> 0)
begin
select top 1000000 @to_id = id from old_table where id > @from_id order by id
begin tran
insert into new_table <fields>
select <fields> from old_table where id between @from_id and @to_id
-- order by id

set @rowcnt = @@ROWCOUNT
set @from_id = @to_id + 1

update ##tmp_move_date
set from_id = @from_id,
rowcnt = rowcnt + @rowcnt
commit
end

-- Session 2: Monitoring
select * from ##tmp_move_date (nolock)


sturner
sturner
SSCertifiable
SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)SSCertifiable (5.2K reputation)

Group: General Forum Members
Points: 5232 Visits: 3259
Abu Dina (2/22/2013)

By the way, the ID column appears to contain a mixture of integers and GUIDs. Should I create a new Identity column on the big table and use that as my ID? Would that help?


yes... and I would do the same on the destination table ... and make it a clustered index because you want to make sure all the inserts go at the end of the table.

Also, if you had an identity ID on the big table you could change the "WHERE NOT EXISTS (SELECT 1 FROM dbo.Loading_B2C_keys_ as b WHERE a.ID = b.ID)" in your inserts to "WHERE ID > @NextChunkID" instead. You would increment @NextChunkID by 1000000 (or whatever your top(N) is ) in each batch loop.

If I were doing this though, I would consider doing the transformations and writing rows to a CSV file. Then Bulk insert the end results back into a table. That would be a faster than the way you are doing it I think.

The probability of survival is inversely proportional to the angle of arrival.
Abu Dina
Abu Dina
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3061 Visits: 3325
Thanks for this suggestion.

I managed to find a workaround for this. The original ID column was a varchar column with a mixsture of INTs and GUIDs (that's how the data was supplied) so I created a new table and added an IDENTITY column then indxed the column. I am now using this new ID in my EXISTS clause.

After inserting 10 million rows as an initial load I can now do 10k rows in 2 seconds.

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Abu Dina
Abu Dina
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3061 Visits: 3325
sturner (2/22/2013)
Abu Dina (2/22/2013)

By the way, the ID column appears to contain a mixture of integers and GUIDs. Should I create a new Identity column on the big table and use that as my ID? Would that help?


yes... and I would do the same on the destination table ... and make it a clustered index because you want to make sure all the inserts go at the end of the table.

Also, if you had an identity ID on the big table you could change the "WHERE NOT EXISTS (SELECT 1 FROM dbo.Loading_B2C_keys_ as b WHERE a.ID = b.ID)" in your inserts to "WHERE ID > @NextChunkID" instead. You would increment @NextChunkID by 1000000 (or whatever your top(N) is ) in each batch loop.


heh.. I saw your reply come in as I was about to post my previous one!

The new identity column seems to have done the trick. Although I'm still using the exists. Will try your method to see if I can get better performance.

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search