Good or slow? 1 min to load 600M text format into SQL

  • Hello,

    Just be curious on the SSIS performance, it took me 1 minute to load 600M data into SQL, purely a Bulk Insert, nothing else.

    Is it normal? or slow? I am doing the math and it comes out 6T would cost 416 hours (if not taking any other factors), I personally would think it is slow.

    Thanks.

  • halifaxdal - Thursday, February 16, 2017 2:15 PM

    Hello,

    Just be curious on the SSIS performance, it took me 1 minute to load 600M data into SQL, purely a Bulk Insert, nothing else.

    Is it normal? or slow? I am doing the math and it comes out 6T would cost 416 hours (if not taking any other factors), I personally would think it is slow.

    Thanks.

    600M - is that a number of rows or an amount of raw data?   Regardless of the answer, the question is whether it makes any sense to move 6 Terabytes of data across a network.  600 MEG in 1 minute is pretty good, as is 600 million rows.   I only reason I asked the initial question is because when I divide 6 terabytes by 600 megabytes, I end up with 10,485.76.  That divided by 60 would be the number of hours, which my calculator says is 172.76266666667.   As that differs from 416 hours by at least a factor of two, I figured it was worth asking about.  The question is what form the data is currently in, as that will certainly have an impact on the choice of method.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Thursday, February 16, 2017 2:52 PM

    halifaxdal - Thursday, February 16, 2017 2:15 PM

    Hello,

    Just be curious on the SSIS performance, it took me 1 minute to load 600M data into SQL, purely a Bulk Insert, nothing else.

    Is it normal? or slow? I am doing the math and it comes out 6T would cost 416 hours (if not taking any other factors), I personally would think it is slow.

    Thanks.

    600M - is that a number of rows or an amount of raw data?   Regardless of the answer, the question is whether it makes any sense to move 6 Terabytes of data across a network.  600 MEG in 1 minute is pretty good, as is 600 million rows.   I only reason I asked the initial question is because when I divide 6 terabytes by 600 megabytes, I end up with 10,485.76.  That divided by 60 would be the number of hours, which my calculator says is 172.76266666667.   As that differs from 416 hours by at least a factor of two, I figured it was worth asking about.  The question is what form the data is currently in, as that will certainly have an impact on the choice of method.

    Thank you Steve, the current data is txt and yes the raw file size is 600MB, in fact, 3 million rows.

  • halifaxdal - Thursday, February 16, 2017 7:32 PM

    sgmunson - Thursday, February 16, 2017 2:52 PM

    halifaxdal - Thursday, February 16, 2017 2:15 PM

    Hello,

    Just be curious on the SSIS performance, it took me 1 minute to load 600M data into SQL, purely a Bulk Insert, nothing else.

    Is it normal? or slow? I am doing the math and it comes out 6T would cost 416 hours (if not taking any other factors), I personally would think it is slow.

    Thanks.

    600M - is that a number of rows or an amount of raw data?   Regardless of the answer, the question is whether it makes any sense to move 6 Terabytes of data across a network.  600 MEG in 1 minute is pretty good, as is 600 million rows.   I only reason I asked the initial question is because when I divide 6 terabytes by 600 megabytes, I end up with 10,485.76.  That divided by 60 would be the number of hours, which my calculator says is 172.76266666667.   As that differs from 416 hours by at least a factor of two, I figured it was worth asking about.  The question is what form the data is currently in, as that will certainly have an impact on the choice of method.

    Thank you Steve, the current data is txt and yes the raw file size is 600MB, in fact, 3 million rows.

    Chances are, you're not going to want to move that much data across the network.   Backup that file or set of files to a backup device and attach it to the destination server and have a drive with at least 6 TB free space on it to hold that data, and that is on a separate physical disk from the drive that holds the database.   You'll have to see how many concurrent bulk inserts you can have going at once and drive the disk I/O through the roof while nothing else is going on.   You might well have to create a database box solely for this purpose, and at least then if you need it be on another instance, you can just backup that database and then restore it where you need it, avoiding down time to the extent possible.   Six terabytes is one HECK of a lot of stuff....

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • If the 6TB lives in a database on another machine and you have the ability to do a compressed backup, then do it using compressed backup/restore.  My main server contains nearly 2TB of data and my compressed backups are taking 59 minutes and that's across a network.  Of course, it's a fast network.

    600MB per minute is pretty slow.  That's only 36GB per hour.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you all for the comment/advice.

    If 600MB/min is slow, what would be best practice to improve it then?

  • I just gave you one.  Compressed backups/restore if the data is in a table on another server.

    And it wouldn't be A best practice... it would be MANY best practices that all have to come into play.  Servers need to be properly configured.  Network need to be configured correctly and that includes everything including NIC cards, Switches, Routers, protocols, and even cable bend radii.  Minimal logging. Presizing MDF/LDF file space, Etc, etc.  And, to be honest, I probably wouldn't use SSIS for any of this even if it's "just" a text file that you're importing.  I'd likely just use BULK INSERT with a correctly formatted non-XML BCP Format File to make it a bit faster still.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • halifaxdal - Thursday, February 16, 2017 2:15 PM

    Hello,

    Just be curious on the SSIS performance, it took me 1 minute to load 600M data into SQL, purely a Bulk Insert, nothing else.

    Is it normal? or slow? I am doing the math and it comes out 6T would cost 416 hours (if not taking any other factors), I personally would think it is slow.

    Thanks.

    You can move 6 TB between Atlanta and Chicago in 2 hours using a combination of Delta airlines and a Sneakernet. If what you're doing is a one-off mass data load, then you may want to consider doing something like that, and then letting your normal ETL process handle the smaller daily loads going forward.

    But regarding performance of SSIS, one thing to consider is whether it has plenty of memory for buffers. If SSIS is installed on a server alongside MSSQL, it is limited to whatever memory isn't reserved for MSSQL Min/Max, and even that remaining portion is shared with Windows. If that's the case, I'd have MSSQL leave at least 8 GB remaining.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Friday, February 17, 2017 8:07 AM

    halifaxdal - Thursday, February 16, 2017 2:15 PM

    Hello,

    Just be curious on the SSIS performance, it took me 1 minute to load 600M data into SQL, purely a Bulk Insert, nothing else.

    Is it normal? or slow? I am doing the math and it comes out 6T would cost 416 hours (if not taking any other factors), I personally would think it is slow.

    Thanks.

    But regarding performance of SSIS, one thing to consider is whether it has plenty of memory for buffers. If SSIS is installed on a server alongside MSSQL, it is limited to whatever memory isn't reserved for MSSQL Min/Max, and even that remaining portion is shared with Windows. If that's the case, I'd have MSSQL leave at least 8 GB remaining.

    Is there any options in SSIS that can be tweaked to tune the performance? for example, data chunk size. 

    BTW: I am not moving 6TB data, I was using it as an example to show the unexpected performance. Sorry if you misunderstood it, I should have made it clearer.

  • halifaxdal - Friday, February 17, 2017 9:49 AM

    Eric M Russell - Friday, February 17, 2017 8:07 AM

    halifaxdal - Thursday, February 16, 2017 2:15 PM

    Hello,

    Just be curious on the SSIS performance, it took me 1 minute to load 600M data into SQL, purely a Bulk Insert, nothing else.

    Is it normal? or slow? I am doing the math and it comes out 6T would cost 416 hours (if not taking any other factors), I personally would think it is slow.

    Thanks.

    But regarding performance of SSIS, one thing to consider is whether it has plenty of memory for buffers. If SSIS is installed on a server alongside MSSQL, it is limited to whatever memory isn't reserved for MSSQL Min/Max, and even that remaining portion is shared with Windows. If that's the case, I'd have MSSQL leave at least 8 GB remaining.

    Is there any options in SSIS that can be tweaked to tune the performance? for example, data chunk size. 

    BTW: I am not moving 6TB data, I was using it as an example to show the unexpected performance. Sorry if you misunderstood it, I should have made it clearer.

    If you're using the Bulk Insert task, then I'm guessing we can rule out network latency, since it only works with locally attached data storage. What you're inserting into has more to do with performance than what you're selecting from. Bulk insert performance at peak when the destination table has indexes, triggers, and constraints disabled and you can acquire a table lock. If that's not practical, then you can insert into an empty staging table and then partition switch into the destination table as a final step. This article contains provides some other bulk insert optimizations.
    https://technet.microsoft.com/en-us/library/ms190421(v=sql.105).aspx

    Also, David Peter Hansen does a good talk on SSIS internals and performance, including things like optimization of buffers and parallel data loads.
    https://www.youtube.com/watch?v=j3yYD320os8

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • IIRC, BULK INSERT doesn't require you to disable the clustered index to get minimal logging if setup correctly and the data is in the same order as the CI or the CI is an IDENTITY column especially after the improvements made in 2008.  Trace flag 610 can also help a whole lot.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, February 17, 2017 11:27 AM

    IIRC, BULK INSERT doesn't require you to disable the clustered index to get minimal logging if setup correctly and the data is in the same order as the CI or the CI is an IDENTITY column especially after the improvements made in 2008.  Trace flag 610 can also help a whole lot.

    Have you ever tried inserting a table after you've disabled the clustered index ?   :pinch:

    Msg 8655, Level 16, State 1, Line 3
    The query processor is unable to produce a plan because the index ... on table or view ... is disabled.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • As other's have said REALLY efficient loading of very large amounts of data (and then presumably either processing it or using it as a fact table meaning you need additional indexes, constraints, etc) takes a WEALTH of knowledge and experience and even benchmarking. 

    Not sure I saw anyone mention doing IO stall and wait stats anaylsys during the run. VERY important.

    I have helped companies move entire data centers using "airplane net" - with a car net at the same time in case one or the other didn't make it in a timely fashion.

    If you REALLY want this process to be as fast as it can be just hire a professional for a few hours. I bet they can get it down to a reasonable number assuming you have hardware up to the task.

    Oh, Jeff: you can get minimal logging on CI insert under specific conditions. I leave it to you to determine those. 🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Really really depends on the volumes you are going to be loading, how many times you need to do it and on the whole infrastructure you have to play with.
    One article that is always great to read is https://technet.microsoft.com/en-us/library/dd537533(v=sql.100).aspx

    In my own experience, and when loading high volumes of data, bcp, ssis with fast load or .net bulkcopy are all valid options. Each has their own issues/benefits so not really saying one should always be used over the other.

    For volumes of over 50 million rows I normally load smaller blocks into individual tables (heaps), then create all required indexes, and then switch them onto their final (or intermediary) table. If intermediary that means just another final partition merge/switch.

    All my volume data load is done in parallel, with between 6 and 12 concurrent processes processed from a queue - I had once a process with nearly 300 tasks - 1 billion rows processed with heavy etl in each one - 9 hours to load.

    A more recent load I did, 500 Million+ rows, 600GB data (1 big blob) onto a single table, took me 20m to load the data, gzip the blob on the middle of the extract/load process, and 35 min to build clustered index +2 non clustered. Loading in parallel with 6 tasks. Server to server.
    Using a small C# utility that does nothing else other than issue select against a source, gzip column on the fly and bulkload onto destination. with 50k rows as bulkload batch size.

    Note that on the particular last case, using SSIS was dead slow due to blob processing so that is also something to take in consideration for this type of loads.

  • TheSQLGuru - Friday, February 17, 2017 3:14 PM

    As other's have said REALLY efficient loading of very large amounts of data (and then presumably either processing it or using it as a fact table meaning you need additional indexes, constraints, etc) takes a WEALTH of knowledge and experience and even benchmarking. 

    Not sure I saw anyone mention doing IO stall and wait stats anaylsys during the run. VERY important.

    I have helped companies move entire data centers using "airplane net" - with a car net at the same time in case one or the other didn't make it in a timely fashion.

    If you REALLY want this process to be as fast as it can be just hire a professional for a few hours. I bet they can get it down to a reasonable number assuming you have hardware up to the task.

    Oh, Jeff: you can get minimal logging on CI insert under specific conditions. I leave it to you to determine those. 🙂

    They have a whole section about all that in BOL.  My "IIRC" was based on the fact that I've not read it all in several months. Whenever I need to do something big, I go back and reread it all just to make sure I've not missed something that came out in an SP or CU.  They even made it so that the table doesn't actually need to be empty anymore for it all to work.  New pages will be minimally logged under, as you say, "specific conditions".  TF 610 can also work a treat there to get more relaxed "conditions". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply