Bulk inserts in batch

  • Problem:

    I have a table with 370 million rows and 50+ columns. I need to change the data type of a column from character to numeric. Here's what it contains:

    40 million with numbers I want to keep, the rest I just want to set to null:

    4k with alpha characters

    55 million other numbers

    275 million empty strings

    An alter column statement fails not just on the alpha characters but on the empty strings. So I tried a couple things on a test database to get an idea of the time it would take:

    An update statement to clear out the non-numeric data is too slow (~1.5 days, batched 10000 at a time). I think I probably should create a new column anyway though, so I'm going to copy the data to a new table since it would be faster than adding a new column to the original table.

    An insert ... select ... takes about 12 hours; adding WITH (TABLOCK) didn't seem to have any effect, and I'm not sure how to batch it. Recovery model is simple.

    A select ... into ... only takes about 1 hour, but can't be batched.

    Using a 3rd party ETL tool takes about 5 hours, batched.

    I wanted to batch it to minimize impact on other queries but primarily the logs. Is there any way to do a fast batched bulk transfer within SSMS?

    Thanks.

  • I tested trace flag 610 and it runs over 6 times faster than the tablock option. (It was 3 times faster when I still had the tablock option in the query, but improved again when I took it out.) And I found a key I can batch it on. So that's the route I will pursue for now.

  • Consider using much larger batches, as well. I would recommend a batch size of 2 million rows and I strongly recommend that the batches be executed in the same order as the clustered index. That way, you're only locking a part of the table at once instead of possibly many parts. And I wouldn't go very much over two million rows because every system has a straw that breaks the camel's back where what you would think should only take several seconds suddenly takes several hours.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden


    Consider using much larger batches, as well. I would recommend a batch size of 2 million rows and I strongly recommend that the batches be executed in the same order as the clustered index. That way, you're only locking a part of the table at once instead of possibly many parts. And I wouldn't go very much over two million rows because every system has a straw that breaks the camel's back where what you would think should only take several seconds suddenly takes several hours.

    Thanks very much for the tips. I'm no expert at this, just learning as I go. But yes, after I looked at the numbers I realized I should have used much larger batches. I did some testing with 100k and it sped things up a little bit. Then I used 1 million for my final test.

    Then I looked at the indexes for batching purposes. The clustered PK consists of a date and a useless alphanumeric GUID, so I tried batching it on just the date alone. There are between 2 and 6 million rows per date. It ran OK, about twice as long as the select into, which is about what I expected but not quite as fast as I had hoped. I would probably have to create another index to have more control, unless there's another way.

    The other reason I thought about just using the date is I'd really like to partition this table on that date. Unfortunately sql server's partitioning is a bit limited and would require more work that I had hoped to maintain, but I'll still probably pursue it.

    I should mention that in all cases I was only inserting into a heap. We normally compress this table, so when I added the clustered primary key and compression testing took 3-4 times as long. I assume that's faster than compressing it afterwards but haven't tested it yet.

  • craig.g.smith (6/17/2015)


    The other reason I thought about just using the date is I'd really like to partition this table on that date.

    That was going to be my next recommendation. Since you mention a "date" column, is this some form of "audit" or "log" table? Also, which Edition of SQL Server are you using? Standard, Enterprise, or ???

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/17/2015)


    That was going to be my next recommendation. Since you mention a "date" column, is this some form of "audit" or "log" table? Also, which Edition of SQL Server are you using? Standard, Enterprise, or ???

    We're using enterprise edition. This is a data warehouse table that stores weekly snapshots of a table in our application database.

    With the primary key defined when I inserted by date it only took twice as long as without instead of 3-4x. I thought I was inserting in the order of the clustered index, so I'm still not sure why it adds so much time. But I just realized that the key of the old table has the GUID first and the date second, although when I query by date it should be coming out in date order, and I would assume the GUIDs would still come out in order. But I'll try adding sorts.

  • Finally partitioned the table by date. I was hoping it would be faster since the clustered index didn't have the GUID in it but it took slightly longer. Oh well.

  • Still wondering why an insert select is slower than a select into -- why don't they do the same thing? I wish there were a way to turn off logging altogether.

    Also wondering why having a clustered index is twice as slow as well, when I'm inserting in order. My understanding is a clustered index is really the table itself, so it's not writing any additional data. And it's not having to re-order anything, but I guess it's the overheard in verifying that it actually is in order?

    But the strange thing is that when I partitioned the table by snapshot date it didn't get any faster, and was actually a little slower. Partitioning removed the unique key from the clustered index and just left the snapshot date. So it didn't really have to order anything, but rather just had to figure out what partition it should be in. And the entire batch of inserts had the same date. I didn't think that would add much overhead.

    Would it help to insert into a staging table and then swap the partition? I'll test it shortly ...

  • craig.g.smith (6/24/2015)


    Still wondering why an insert select is slower than a select into -- why don't they do the same thing? I wish there were a way to turn off logging altogether.

    Also wondering why having a clustered index is twice as slow as well, when I'm inserting in order. My understanding is a clustered index is really the table itself, so it's not writing any additional data. And it's not having to re-order anything, but I guess it's the overheard in verifying that it actually is in order?

    But the strange thing is that when I partitioned the table by snapshot date it didn't get any faster, and was actually a little slower. Partitioning removed the unique key from the clustered index and just left the snapshot date. So it didn't really have to order anything, but rather just had to figure out what partition it should be in. And the entire batch of inserts had the same date. I didn't think that would add much overhead.

    Would it help to insert into a staging table and then swap the partition? I'll test it shortly ...

    You can make INSERT/SELECT run nearly as fast as SELECT/INTO and you don't need to be in the BULK LOGGED or SIMPLE recovery model. Create your target table and then create your INSERT/SELECT except be sure that you use WITH (TABLOCK) on the target table in the INSERT. If you don't have any indexes on the table, it will run nearly as fast as SELECT/INTO with similar seemingly (not actual) "minimal logging". If you're in the BULK LOGGED mode, turn on Trace Flag 610 for the session and you can do multiple inserts with the same effect.

    As for the clustered index, it should be a unique clustered index with the date as the first column and the former PK column as the second. SQL Server REALLY likes unique indexes and not adding that second column prevents uniqueness because dates are obviously, not unique.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/24/2015)You can make INSERT/SELECT run nearly as fast as SELECT/INTO and you don't need to be in the BULK LOGGED or SIMPLE recovery model. Create your target table and then create your INSERT/SELECT except be sure that you use WITH (TABLOCK) on the target table in the INSERT. If you don't have any indexes on the table, it will run nearly as fast as SELECT/INTO with similar seemingly (not actual) "minimal logging". If you're in the BULK LOGGED mode, turn on Trace Flag 610 for the session and you can do multiple inserts with the same effect.

    Thanks. I actually already tried the TABLOCK hint and it didn't help; it was still more than 10x slower. Of course, I believe it only works on an empty table, so won't work with a batched insert. But it also didn't help when I did it all in one shot. And this was with no indexes at all on the table. TF 610 helped dramatically however, and we are using SIMPLE. Still twice as slow as SELECT INTO though. Haven't tried changing the model to BULK LOGGED. Debating whether to keep it on or just use it for this one-time script; need to test with our regular ETL loads.

    Still wondering if it's normal for it to be 2x slower on top of that when adding the partitioning though.

    Jeff Moden (6/24/2015)As for the clustered index, it should be a unique clustered index with the date as the first column and the former PK column as the second. SQL Server REALLY likes unique indexes and not adding that second column prevents uniqueness because dates are obviously, not unique.

    OK thanks. I thought I read that the partition index had to exactly align with the partition function in terms of arguments and data types.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply