Manipulating data in a 48 million record table

  • I have the following query which basically gets data from a 48 million record table, performs various things using SQL CLR C# functions then puts the results into a new table.

    Initially I set the code to run on the entire table but after 3 hours the code failed due to some dodgy data in one of the columns! So now I want to perform this in chunks of 1 million records at a time.

    However, as the Loading_B2C_keys_ table has more and more data, the insert is getting slower and slower!

    I've tried adding a non-clustered index on the ID column on both tables but it's not having much effect? Is there a better way to do this?

    By the way, the ID column appears to contain a mixture of integers and GUIDs. Should I create a new Identity column on the big table and use that as my ID? Would that help?

    INSERT INTO dbo.Loading_B2C_keys_ (

    ID,

    mkNameKey,

    mkName1,

    mkName2,

    mkName3,

    mkNormalizedName,

    mkOrganizationKey,

    mkNormalizedOrganization,

    mkOrgName1,

    mkOrgName2,

    mkorgName3,

    mkPostIn,

    mkPostOut,

    mkPhoneticStreet,

    mkPremise,

    mkPhoneticTown,

    mkEmailAddress)

    SELECT top 1000000 ID,

    dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetLastWord(SURNAME)) + LEFT(FIRSTNAME, 1), dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetLastWord(SURNAME)),

    dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetFirstWord(FIRSTNAME)),

    dbo.clrFn_DoubleMetaphone(dbo.clrFn_GetFirstWord(SECOND_INIT_NAME)),

    UPPER(dbo.clrFn_GetLastWord(SURNAME)) + ',' + UPPER(dbo.clrFn_GetFirstWord(FIRSTNAME)) + ',' + UPPER(dbo.clrFn_GetFirstWord(SECOND_INIT_NAME)),

    NULL,

    NULL,

    NULL,

    NULL,

    NULL,

    dbo.clrFn_SplitUKPostCode(POSTCODE, 2),

    dbo.clrFn_SplitUKPostCode(POSTCODE, 1),

    dbo.clrFn_DoubleMetaphone(dbo.clrFn_RemoveDigits(ISNULL(AD1, '') + ' ' + ISNULL(AD2, ''))),

    dbo.clrFn_GetDigits(ISNULL(AD1, '') + ' ' + ISNULL(AD2, '')),

    dbo.clrFn_DoubleMetaphone(dbo.clrFn_RemoveDigits(TOWN)),

    Email

    FROM dbo.Loading_B2C as a

    WHERE NOT EXISTS (SELECT 1 FROM dbo.Loading_B2C_keys_ as b WHERE a.ID = b.ID)

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Please check this code:

    -- Session 1

    declare

    @from_id int = 0

    ,@to_id int = 0

    ,@rowcnt int = -1

    create table ##tmp_move_date (id int primary key, from_id int, rowcnt int)

    insert into ##tmp_move_date (id, from_id, rowcnt ) values (1, 0, 0)

    select @from_id = from_id from ##tmp_move_date where id = 1

    while (@rowcnt <> 0)

    begin

    select top 1000000 @to_id = id from old_table where id > @from_id order by id

    begin tran

    insert into new_table <fields>

    select <fields> from old_table where id between @from_id and @to_id

    -- order by id

    set @rowcnt = @@ROWCOUNT

    set @from_id = @to_id + 1

    update ##tmp_move_date

    set from_id = @from_id,

    rowcnt = rowcnt + @rowcnt

    commit

    end

    -- Session 2: Monitoring

    select * from ##tmp_move_date (nolock)

  • Abu Dina (2/22/2013)


    By the way, the ID column appears to contain a mixture of integers and GUIDs. Should I create a new Identity column on the big table and use that as my ID? Would that help?

    yes... and I would do the same on the destination table ... and make it a clustered index because you want to make sure all the inserts go at the end of the table.

    Also, if you had an identity ID on the big table you could change the "WHERE NOT EXISTS (SELECT 1 FROM dbo.Loading_B2C_keys_ as b WHERE a.ID = b.ID)" in your inserts to "WHERE ID > @NextChunkID" instead. You would increment @NextChunkID by 1000000 (or whatever your top(N) is ) in each batch loop.

    If I were doing this though, I would consider doing the transformations and writing rows to a CSV file. Then Bulk insert the end results back into a table. That would be a faster than the way you are doing it I think.

    The probability of survival is inversely proportional to the angle of arrival.

  • Thanks for this suggestion.

    I managed to find a workaround for this. The original ID column was a varchar column with a mixsture of INTs and GUIDs (that's how the data was supplied) so I created a new table and added an IDENTITY column then indxed the column. I am now using this new ID in my EXISTS clause.

    After inserting 10 million rows as an initial load I can now do 10k rows in 2 seconds.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • sturner (2/22/2013)


    Abu Dina (2/22/2013)


    By the way, the ID column appears to contain a mixture of integers and GUIDs. Should I create a new Identity column on the big table and use that as my ID? Would that help?

    yes... and I would do the same on the destination table ... and make it a clustered index because you want to make sure all the inserts go at the end of the table.

    Also, if you had an identity ID on the big table you could change the "WHERE NOT EXISTS (SELECT 1 FROM dbo.Loading_B2C_keys_ as b WHERE a.ID = b.ID)" in your inserts to "WHERE ID > @NextChunkID" instead. You would increment @NextChunkID by 1000000 (or whatever your top(N) is ) in each batch loop.

    heh.. I saw your reply come in as I was about to post my previous one!

    The new identity column seems to have done the trick. Although I'm still using the exists. Will try your method to see if I can get better performance.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply