Split large csv

  • I know this is probably not the right slot... so apologies.

    But do any of you guys know of an easy way to split a large csv type file into manageable chunks on say 10 million row boundaries?

  • Yep... don't split it at all... if the format of the file is stable and a couple of other nuances, use Bulk Insert. In SQL Server 2000, I used it to import a 20 column file at the rate of 5.1 million rows per minute. Haven't used the 2k5 version, yet, but it has some new capabilities like BCP has always had like being able to route bad rows to a holding file where they can be reworked without it causing the whole import to fail.

    So, why not use BCP? You certainly could! I've just found that, at least in the previous version, nothing is faster than Bulk Insert. BCP does come darned close, though.

    Some will recommend using SSIS and the Bulk Insert task... that will also work just fine but, if you can do it all in T-SQL, then why bother?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks jeff.

    Well, the file might be 200gb. Not sure what the ordering is going to be like, but the original table is clustered differently to the final table. Original table is 600GB - just under 2bn rows.

    Additionally it is still for now stuck on 2k... and whether I can squeeze it into 2k8 I am not sure at this point.

    My plan was to pump it into a table in sensible batches allowing for checkpoints and defrags in between.

  • It is all about how much time you have available for the operation.

    I am with Jeff DO NOT split the file. Just make sure that you have enough space on the destination, the DB is in BULK_LOGGED mode AND There are no NON-CLUSTERED indexes on the destination.

    IF you could create the file ( using bcp with queryout) with the Destination ORDER then you can use the hint to speedup the load.

    There are many other things that you can do to further speed up the process like parallel loading multiple files in a table without indexes at all etc.. but your time constraints is what dictates your choice.

    -Noel


    * Noel

  • FYI: I have successfully Bulk imported 1.3 billion rows files without problems before.

    ORDER BY and TABLOCK hints and No type conversions, were the dominant factors in the matter.


    * Noel

  • RichardB (1/15/2009)


    Thanks jeff.

    Well, the file might be 200gb. Not sure what the ordering is going to be like, but the original table is clustered differently to the final table. Original table is 600GB - just under 2bn rows.

    Additionally it is still for now stuck on 2k... and whether I can squeeze it into 2k8 I am not sure at this point.

    My plan was to pump it into a table in sensible batches allowing for checkpoints and defrags in between.

    Well you should be able to do this with DTS. Not that I am any fan of DTS (yech!).

    The real issues with any of these possible answers though is "are they compaible with your CSV file?" Because, to put it bluntly, SQL Servers' support of CSV-formatted files sucks. So before you get too heavily into any of these, you should take a few hundred rows and see if you can get them in at all.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RichardB (1/15/2009)


    Thanks jeff.

    Well, the file might be 200gb. Not sure what the ordering is going to be like, but the original table is clustered differently to the final table. Original table is 600GB - just under 2bn rows.

    Additionally it is still for now stuck on 2k... and whether I can squeeze it into 2k8 I am not sure at this point.

    My plan was to pump it into a table in sensible batches allowing for checkpoints and defrags in between.

    Then, my first recommendation stands... use Bulk Insert from T-SQL... if you really need some logging or collection of bad rows, then use BCP which is almost as fast. Noeld also has the correct suggestions... no clustered index if you can help it unless it's on an IDENTITY column not included in the import or the file is in precise order of the clustered index for all rows. Of course, if its and extraneous IDENTITY column, that will mean that you'll need to have a format file, but that won't slow things down a bit.

    It would probably be a good thing if you attached, say, the first 10 or 20 rows of the file (DO NOT PUBLISH ANY PRIVATE INFORMATION PLEASE), a record layout, and the CREATE TABLE statement for the target table. You'd be surprised as how many folks will be able to help you if you do that.

    By the way... how many rows are in your 200GB file?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • noeld (1/15/2009)


    FYI: I have successfully Bulk imported 1.3 billion rows files without problems before.

    ORDER BY and TABLOCK hints and No type conversions, were the dominant factors in the matter.

    Do you remember about how long that bad boy took?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok, thought I'd let you know how this ended up.

    1 - Re-exported the data into 400m chunks. With 5 exports in parallel the export took a little over 5hrs.

    2 - Import in parallel into 3 seperate tables, around 2hrs per file.

    3 - PK 5hrs to apply to biggest table, but somehow theres a duplicate in one table... think its just the one record though, so I probably mucked up the boundaries somewhere!

    Total 1.9bn records.

    Now I have to do the bigger table 🙁

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply