Big data: transfer x-number of rows per batch?

  • Hi all,

    I must transfer a big table (231k rows not that big, but 735GB big because of blob) to another location by INSERT INTO SELECT FROM...

    It appears that first the whole source table is being loaded before any record is actually inserted into the destination table. But there's not enough diskspace because tempdb just grows and grows.

    So to find a solution for this I thought if I loop through batches of let say 1000 or 5000 recs, maybe the tempdb wouldn't grow that much? Or do I need the space (double the table size?) anyway?

    So, how do I set a counter from rec 1 to 1000; select those recs and insert them; then upscale the counter to 1001 to 2000 and so on?

    Thanks in advance!

  • Super_Grover (10/23/2016)


    Hi all,

    I must transfer a big table (231k rows not that big, but 735GB big because of blob) to another location by INSERT INTO SELECT FROM...

    It appears that first the whole source table is being loaded before any record is actually inserted into the destination table. But there's not enough diskspace because tempdb just grows and grows.

    So to find a solution for this I thought if I loop through batches of let say 1000 or 5000 recs, maybe the tempdb wouldn't grow that much? Or do I need the space (double the table size?) anyway?

    So, how do I set a counter from rec 1 to 1000; select those recs and insert them; then upscale the counter to 1001 to 2000 and so on?

    Thanks in advance!

    Do you have a unique key on the table? If so then it's easy to use that key for the batch sizes.

    😎

    231K rows totaling at 735Gb equals 3.25 Mb p. row, what kind of data are you storing there?

  • Eirikur Eiriksson (10/23/2016)


    Super_Grover (10/23/2016)


    Hi all,

    I must transfer a big table (231k rows not that big, but 735GB big because of blob) to another location by INSERT INTO SELECT FROM...

    It appears that first the whole source table is being loaded before any record is actually inserted into the destination table. But there's not enough diskspace because tempdb just grows and grows.

    So to find a solution for this I thought if I loop through batches of let say 1000 or 5000 recs, maybe the tempdb wouldn't grow that much? Or do I need the space (double the table size?) anyway?

    So, how do I set a counter from rec 1 to 1000; select those recs and insert them; then upscale the counter to 1001 to 2000 and so on?

    Thanks in advance!

    Do you have a unique key on the table? If so then it's easy to use that key for the batch sizes.

    😎

    231K rows totaling at 735Gb equals 3.25 Mb p. row, what kind of data are you storing there?

    Oh sorry forgot to describe the table.

    It's for testing/staging/comparing purposes and it's the first rough version (not under our control/design) so there are no key or indexes defined. Nothing.

    It's just 3 columns:

    -doc_number (nvarchar20)

    -doc_name (nvarchar255)

    -blob (varbinary(max)

    Some data:

    doc_number | doc_name | BLOB

    PSD1 | | 0x025546955565D31546B..........

    PSD2 | | 0x025546955565D31546B...........

    PSD3 | | 0x025546955565D31546B............

    PSD4 | | 0x025546955565D31546B............

    I'm not concerned with the actual data but these are healthcare images

    So, no ID or index or nothing. I think it must be done by rowcount or something? Any ideas how I can loop through batches of recs?

  • Is doc_number unique?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (10/23/2016)


    Is doc_number unique?

    It is unique AFAIKS but it's not an unbroken sequnece. In other words: it does not stay in sync with the row numbers (doc_numbers are missing so to say)

  • Super_Grover (10/23/2016)


    Phil Parkin (10/23/2016)


    Is doc_number unique?

    It is unique AFAIKS but it's not an unbroken sequnece. In other words: it does not stay in sync with the row numbers (doc_numbers are missing so to say)

    OK, then you could easily create a temp table with an identity column and insert all of the doc_numbers into that. Then join from the temp table on doc_number and you have your numbering scheme.

    Untested code to get you started

    Create table #doc_numbers(RowNum int identity(1,1) primary key clustered, doc_number nvarchar(20) not null)

    insert #doc_numbers (doc_number)

    select doc_number

    from tbl

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (10/23/2016)


    Super_Grover (10/23/2016)


    Phil Parkin (10/23/2016)


    Is doc_number unique?

    It is unique AFAIKS but it's not an unbroken sequnece. In other words: it does not stay in sync with the row numbers (doc_numbers are missing so to say)

    OK, then you could easily create a temp table with an identity column and insert all of the doc_numbers into that. Then join from the temp table on doc_number and you have your numbering scheme.

    Untested code to get you started

    Create table #doc_numbers(RowNum int identity(1,1) primary key clustered, doc_number nvarchar(20) not null)

    insert #doc_numbers (doc_number)

    select doc_number

    from tbl

    Thank you Phil. But wouldn't that cost me even more diskspace by creating and filling another table?

    I think I'm more looking for a way do the INSERT INTO SELECT FROM in small batches.

  • What I'm trying to archieve is something like this untested (and incorrect) code (just wrote it down)

    DECLARE @counter INT = 0

    DECLARE @RwCnt INT = @@rowcount

    WHILE @counter <= @RwCnt

    BEGIN

    INSERT INTO SELECT TOP (1000) FROM .........

    SET @counter = @counter + 1000

    END

    Only problem is I don't know how to specify the record batch in the SELECT statement. How to link the select to the @counter.

    I'm trying to make loops of 1000 records.

  • Super_Grover (10/23/2016)


    Phil Parkin (10/23/2016)


    Super_Grover (10/23/2016)


    Phil Parkin (10/23/2016)


    Is doc_number unique?

    It is unique AFAIKS but it's not an unbroken sequnece. In other words: it does not stay in sync with the row numbers (doc_numbers are missing so to say)

    OK, then you could easily create a temp table with an identity column and insert all of the doc_numbers into that. Then join from the temp table on doc_number and you have your numbering scheme.

    Untested code to get you started

    Create table #doc_numbers(RowNum int identity(1,1) primary key clustered, doc_number nvarchar(20) not null)

    insert #doc_numbers (doc_number)

    select doc_number

    from tbl

    Thank you Phil. But wouldn't that cost me even more diskspace by creating and filling another table?

    I think I'm more looking for a way do the INSERT INTO SELECT FROM in small batches.

    I think you missed my point somewhat. Yes, it will use disk space, of course. But not much. And once you have the temp table populated, you can do something like this (also untested):

    set nocount, xact_abort on;

    declare @MinId int, @MaxId int;

    select @MinId = min(RowNum), @MaxId = max(RowNum)

    from #doc_numbers;

    declare @Increment int = 1000;

    declare @StartRange int = @MinId;

    declare @EndRange int = @StartRange + @Increment;

    declare @NumInserted int;

    while @StartRange <=MaxId

    begin

    begin transaction;

    insert target(col1, col2)

    select t.col1, t.col2

    from tbl t join #doc_numbers n on t.doc_number = n.doc_number

    where n.RowNum >= @StartRange and n.RowNum < @EndRange

    commit transaction;

    set @StartRange = @EndRange;

    set @EndRange = @StartRange + @Increment;

    end

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Thanks again Phil. Yes I missed your point. Sorry for that.

    Now with the extra code example I think it might be an excellent solution.

    I'll start with it right away and hope I can get the code right.

    I'll let you know how it goes.

  • Okay, so I created the temp table and populated it. No problem.

    I altered the other script so that in my opinion it should work. But now it states that 'Column name or number of supplied values does not match table definition'. So basicly it sais that source and destination tables do not match.

    But I used the exact same INSERT querie that worked fine.

    Is it possible that because of the extra join clause causes this?

    Here my original working statement:

    INSERT INTO [P2WIDB002].[CS_ConversieAanlevering_VWB].[dbo].[MULTIMED_INHOUD]

    SELECT * FROM [NLDSLASSQLTST].[CS_ConversieAanlevering].[dbo].[MULTIMED_INHOUD]

    and just altered it to this:

    INSERT INTO [P2WIDB002].[CS_ConversieAanlevering_VWB].[dbo].[MULTIMED_INHOUD]

    SELECT * FROM [NLDSLASSQLTST].[CS_ConversieAanlevering].[dbo].[MULTIMED_INHOUD] AS ca

    JOIN #doc_numbers AS dn ON ca.DOCUMENTNUMMER = dn.doc_number

    WHERE dn.RowNum >= @StartRange and dn.RowNum < @EndRange

    So is it the extra join?

  • You've just found a pretty good example of why SELECT * is often regarded as bad practice.

    Usually, I would suggest that you specify all of the column names explicitly. But in this case, the shortcut is changing this line

    SELECT * FROM [NLDSLASSQLTST].[CS_ConversieAanlevering].[dbo].[MULTIMED_INHOUD] AS ca

    to this

    SELECT ca.* FROM [NLDSLASSQLTST].[CS_ConversieAanlevering].[dbo].[MULTIMED_INHOUD] AS ca

    Qualify the * with the source table alias and it will no longer mean that columns from the temp table are being included in the SELECT.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (10/23/2016)


    You've just found a pretty good example of why SELECT * is often regarded as bad practice.

    Usually, I would suggest that you specify all of the column names explicitly. But in this case, the shortcut is changing this line

    SELECT * FROM [NLDSLASSQLTST].[CS_ConversieAanlevering].[dbo].[MULTIMED_INHOUD] AS ca

    to this

    SELECT ca.* FROM [NLDSLASSQLTST].[CS_ConversieAanlevering].[dbo].[MULTIMED_INHOUD] AS ca

    Qualify the * with the source table alias and it will no longer mean that columns from the temp table are being included in the SELECT.

    Thank you again! Yes, I knew it is bad practice but it was stronger than me 😉 Learned my lesson I guess 🙂

    Unfortunately I ran into another message which is (I guess) of some whoel other kind: The operation could not be performed because OLE DB provider "SQLNCLI11" for linked server "P2WIDB002" was unable to begin a distributed transaction.

    Destination server is a linked server over VPN tunnel. But it worked fine before.

    I found some info on it but it probably has to do with the server config, so I need to see our sysadmin. On the remote linked server we cannot administer anything...

  • Super_Grover (10/23/2016)


    I knew it is bad practice ...

    As with everything else in SQL Server, "It Depends".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Okay we're trying to solve this but I'll mark Phil's post as solution.

    Thanks again!

Viewing 15 posts - 1 through 15 (of 43 total)

You must be logged in to reply to this topic. Login to reply