Loading more than 1 billion records from CSV File

  • Hi Guys

    Need help. I need to load a very big csv file into SQL Server 2016. The file contains more than 1 billion record. Anyone has any idea how to load this huge file fast. I tried SSIS , using its buld load option. The package is running more than 12 hours now and it is even not the half way .

    Any idea how to fast load the file

    Thanks in Advance

    L

  • See the following article.  Even though it's a slightly bastardized version (missing some information and some very helpful graphics, which is another proof that change isn't always for the better) of a nearly decade old document, it's still a "MUST READ" in my opinion.
    https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v=sql.100)#integration-services-data-destinations

    Contrary to one section of that article, I strongly recommend that you do NOT try to write your own compiled importer.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff Moden

    I will read out this article

    Thank you so much

  • learningsqlkid - Tuesday, July 17, 2018 5:47 PM

    Hi Guys

    Need help. I need to load a very big csv file into SQL Server 2016. The file contains more than 1 billion record. Anyone has any idea how to load this huge file fast. I tried SSIS , using its buld load option. The package is running more than 12 hours now and it is even not the half way .

    Any idea how to fast load the file

    Thanks in Advance

    L

    Does your target table contain any indexes or primary keys? Are they clustered? Does it have any foreign keys?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Sounds like there may be some bottlenecks when the data lands such as indexes, primary keys and so forth if it's taking that long.

    Regardless, what's stopping you from splitting this data on disk before you load it? What I do is use Python to split up the data into small chunks then use SSIS to loop through those chunks. For example, turning one file into 1,000 files. Then you can parallel load say 10 files at a time over 1,000 files. SSIS should be able to chunk up this data if Python is not an option. I would imagine the same is true for Powershell.

  • I'm not saying

    xsevensinzx - Wednesday, July 18, 2018 5:54 AM

    Sounds like there may be some bottlenecks when the data lands such as indexes, primary keys and so forth if it's taking that long.

    Regardless, what's stopping you from splitting this data on disk before you load it? What I do is use Python to split up the data into small chunks then use SSIS to loop through those chunks. For example, turning one file into 1,000 files. Then you can parallel load say 10 files at a time over 1,000 files. SSIS should be able to chunk up this data if Python is not an option. I would imagine the same is true for Powershell.

    I'm not saying that what you're doing is right or wrong.  I'm just identifying a possible problem that I've run into in the past.

    Just like people that think that "going parallel" on backups increases the speed of backups and will reduce the duration of the backups and have actually made them slower (and the more parallel they went, the slower it got) because they didn't understand the action on the Read/Write heads of the single disk they were writing to, I've found that "going parallel" for imports can result in similar problems. 

    By all means, if you can control the imports to write to more than one disk, then go parallel.  If you can't, the going parallel may actually slow the imports down.  I've repaired many import performance problems (both in SSIS and other methods like BULK INSERT) by shifting from a parallel load to a serial load and then applying techniques like Minimal Logging and, if you need one, importing to a table with the Clustered index in place.  With both SSIS and BULK INSERT (my favorite in case you couldn't tell from some of my other posts) and since SQL Server 2008, minimal logging is very possible even with multiple batches (batch size controls this) especially if the data being imported happens to be in the same order as the Clustered Index (always a HUGE win!).

    That, notwithstanding and as with all else in SQL Server, "It Depends" and  one good test for a specific situation is always worth a thousand expert opinions. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I would ask "why?"

    Why do you need to load 1 billion records regularly?

    Is it a "refresh data" task?

    _____________
    Code for TallyGenerator

  • Phil Parkin - Wednesday, July 18, 2018 5:36 AM

    learningsqlkid - Tuesday, July 17, 2018 5:47 PM

    Hi Guys

    Need help. I need to load a very big csv file into SQL Server 2016. The file contains more than 1 billion record. Anyone has any idea how to load this huge file fast. I tried SSIS , using its buld load option. The package is running more than 12 hours now and it is even not the half way .

    Any idea how to fast load the file

    Thanks in Advance

    L

    Does your target table contain any indexes or primary keys? Are they clustered? Does it have any foreign keys?

    Hi.. Nope , no index, no constraints and no checks. Simply a straight dump of file in to a table

  • xsevensinzx - Wednesday, July 18, 2018 5:54 AM

    Sounds like there may be some bottlenecks when the data lands such as indexes, primary keys and so forth if it's taking that long.

    Regardless, what's stopping you from splitting this data on disk before you load it? What I do is use Python to split up the data into small chunks then use SSIS to loop through those chunks. For example, turning one file into 1,000 files. Then you can parallel load say 10 files at a time over 1,000 files. SSIS should be able to chunk up this data if Python is not an option. I would imagine the same is true for Powershell.

    I am not a Python Gury but i heard that pythen has multiprocessing libraries which are very efficient when loading the data into SQL ?

    I am not sure if i can chunk up the files using SSIS . I will find out more on this 

    Are you aware of the resource which i can read to split the file into multiple chunks

  • Jeff Moden - Wednesday, July 18, 2018 7:04 AM

    I'm not saying

    xsevensinzx - Wednesday, July 18, 2018 5:54 AM

    Sounds like there may be some bottlenecks when the data lands such as indexes, primary keys and so forth if it's taking that long.

    Regardless, what's stopping you from splitting this data on disk before you load it? What I do is use Python to split up the data into small chunks then use SSIS to loop through those chunks. For example, turning one file into 1,000 files. Then you can parallel load say 10 files at a time over 1,000 files. SSIS should be able to chunk up this data if Python is not an option. I would imagine the same is true for Powershell.

    I'm not saying that what you're doing is right or wrong.  I'm just identifying a possible problem that I've run into in the past.

    Just like people that think that "going parallel" on backups increases the speed of backups and will reduce the duration of the backups and have actually made them slower (and the more parallel they went, the slower it got) because they didn't understand the action on the Read/Write heads of the single disk they were writing to, I've found that "going parallel" for imports can result in similar problems. 

    By all means, if you can control the imports to write to more than one disk, then go parallel.  If you can't, the going parallel may actually slow the imports down.  I've repaired many import performance problems (both in SSIS and other methods like BULK INSERT) by shifting from a parallel load to a serial load and then applying techniques like Minimal Logging and, if you need one, importing to a table with the Clustered index in place.  With both SSIS and BULK INSERT (my favorite in case you couldn't tell from some of my other posts) and since SQL Server 2008, minimal logging is very possible even with multiple batches (batch size controls this) especially if the data being imported happens to be in the same order as the Clustered Index (always a HUGE win!).

    That, notwithstanding and as with all else in SQL Server, "It Depends" and  one good test for a specific situation is always worth a thousand expert opinions. 😀

    Aye, for sure. But, that all depends too. Going parallel is not a given, but chunking up your data (breaking the larger problem up into smaller problems) on disk or even reading in chunks effectively is always a good idea for extremely large files. Chunking should be a main consideration first, then parallel second if it makes sense for your variables (i.e.: multiple disks, threads, etc).

    In terms of chunking, there are plenty of options out there.

  • learningsqlkid - Wednesday, July 18, 2018 5:18 PM

    xsevensinzx - Wednesday, July 18, 2018 5:54 AM

    Sounds like there may be some bottlenecks when the data lands such as indexes, primary keys and so forth if it's taking that long.

    Regardless, what's stopping you from splitting this data on disk before you load it? What I do is use Python to split up the data into small chunks then use SSIS to loop through those chunks. For example, turning one file into 1,000 files. Then you can parallel load say 10 files at a time over 1,000 files. SSIS should be able to chunk up this data if Python is not an option. I would imagine the same is true for Powershell.

    I am not a Python Gury but i heard that pythen has multiprocessing libraries which are very efficient when loading the data into SQL ?

    I am not sure if i can chunk up the files using SSIS . I will find out more on this 

    Are you aware of the resource which i can read to split the file into multiple chunks

    I had some scripted tasks with SSIS do it before. I know other ways could be with conditional splits and row count tasks to predefine N number of files. For example, having 10 csv files with each one having 100 million records in each.

    I've switched to Python mostly because it can support these type of mundane data tasks easier such as using the Pandas module to read the CSV file in chunks then kick out the chunks into separate CSV files based on part number. It's relatively easy.

    I see Phil in the thread, he prob has done this with more native approaches with Powershell or VB I'm sure.

  • xsevensinzx - Wednesday, July 18, 2018 5:47 PM

    Jeff Moden - Wednesday, July 18, 2018 7:04 AM

    I'm not saying

    xsevensinzx - Wednesday, July 18, 2018 5:54 AM

    Sounds like there may be some bottlenecks when the data lands such as indexes, primary keys and so forth if it's taking that long.

    Regardless, what's stopping you from splitting this data on disk before you load it? What I do is use Python to split up the data into small chunks then use SSIS to loop through those chunks. For example, turning one file into 1,000 files. Then you can parallel load say 10 files at a time over 1,000 files. SSIS should be able to chunk up this data if Python is not an option. I would imagine the same is true for Powershell.

    I'm not saying that what you're doing is right or wrong.  I'm just identifying a possible problem that I've run into in the past.

    Just like people that think that "going parallel" on backups increases the speed of backups and will reduce the duration of the backups and have actually made them slower (and the more parallel they went, the slower it got) because they didn't understand the action on the Read/Write heads of the single disk they were writing to, I've found that "going parallel" for imports can result in similar problems. 

    By all means, if you can control the imports to write to more than one disk, then go parallel.  If you can't, the going parallel may actually slow the imports down.  I've repaired many import performance problems (both in SSIS and other methods like BULK INSERT) by shifting from a parallel load to a serial load and then applying techniques like Minimal Logging and, if you need one, importing to a table with the Clustered index in place.  With both SSIS and BULK INSERT (my favorite in case you couldn't tell from some of my other posts) and since SQL Server 2008, minimal logging is very possible even with multiple batches (batch size controls this) especially if the data being imported happens to be in the same order as the Clustered Index (always a HUGE win!).

    That, notwithstanding and as with all else in SQL Server, "It Depends" and  one good test for a specific situation is always worth a thousand expert opinions. 😀

    Aye, for sure. But, that all depends too. Going parallel is not a given, but chunking up your data (breaking the larger problem up into smaller problems) on disk or even reading in chunks effectively is always a good idea for extremely large files. Chunking should be a main consideration first, then parallel second if it makes sense for your variables (i.e.: multiple disks, threads, etc).

    In terms of chunking, there are plenty of options out there.

    One of the easy ways to chunk data is to set the batch size in rows in BULK INSERT.  You can also tell BULK INSERT what the FIRST ROW is.  I've not tried it (because I've not need to, yet... emphasis on "yet" :D) but I believe that means you can go parallel without actually splitting up the input file.  The fly in the ointment will be if BULK INSERT locks the file exclusively.  I'll have to setup an experiment to see what happens there.

    Haven't tried loading a billion rows, that's for sure.  I don't actually know anyone that has.  What's the largest number of rows and widest row length you've tried?  The most I've ever done is 10 million rows and a crazy 800 columns wide (not in the same load, thankfully).  The 10 million rows was about 50 columns wide (I don't remember the exact number) and the 800 column wide stuff was typically 30,000 rows (from Double-Click.net "spotlight" files).  Most of the stuff I work with now is 40-60 columns and between 100,000 and 500,000 rows.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yep, BULK INSERT is a great option. I used it as the sole way to slam 50 million+ records into a staging table and then use SP's to transform the data. The method in which you describe with row numbers is technically the same approach I take with Python. The only differences would be the added overhead in reading to that row number per process/thread/etc. In Python, the file is already open and read to the row number, so, as it gets to the row, it's partitioned out without actually closing the file and reopening it.

    I only recommend chunking the data on disk because that's the recommended approach from the Kimball methods. Do as much on disk as possible before bringing up to the service. But, there is no hard rule on that because this is not accounting for what tools you have available to do that.

    For example, I just decompressed 235 GB of files across 3,500 files to load 2.15 TB of data at 3.07 billion records in Azure. Files were already partitioned up and then loaded in parallel up to 15 threads. This was a slow trickle load as the data was being decompressed and re-encoded to UTF8. Took about 10 hours to load over night. So, even in that case, it took time for just 3 billion, but there is a lot of overhead here.

  • xsevensinzx - Wednesday, July 18, 2018 5:47 PM

    In terms of chunking, there are plenty of options out there.

    Could you point out a source for chunking CSV files?
    The ones I used could not handle all types of CSV files. Specifically <CR><LF> or similar did not chuck correctly when part of a field.
    Or fields containing a (properly quoted) CSV file 🙂 .

    Ben

  • ben.brugman - Thursday, July 19, 2018 3:36 AM

    xsevensinzx - Wednesday, July 18, 2018 5:47 PM

    In terms of chunking, there are plenty of options out there.

    Could you point out a source for chunking CSV files?
    The ones I used could not handle all types of CSV files. Specifically <CR><LF> or similar did not chuck correctly when part of a field.
    Or fields containing a (properly quoted) CSV file 🙂 .

    Ben

    As mentioned, Python is one way, SSIS has options with scripted tasks or conditional splits, BULK INSERT has another option with starting at row number, and I'm sure Powershell may have an option too.

    Python is pretty easy with:


    import pandas as pd

    chunksize = 10 ** 6

    for chunk in pd.read_csv(filename, chunksize=chunksize):
      process(chunk)

    So, in the process you could export it out to separate CSV files per chunk. The chunk represents the amount of rows you want to partition out from the target CSV files. When formatting issues become an issue, read_csv has additional options for handling that including things like quoting, escaping, etc.

    Once loaded with pd.read_csv you can also use a number of SQL like functions to group by, filter, etc just like you would in SQL. Thus, you can further aggregate, clean, or conform data as you chunk it out to final file you load into SQL using SSIS or BULK INSERT.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply