Good or slow? 1 min to load 600M text format into SQL

Question

Good or slow? 1 min to load 600M text format into SQL

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003869 More actions · Answer 1

frederico_fonseca - Friday, February 17, 2017 3:56 PM
Really really depends on the volumes you are going to be loading, how many times you need to do it and on the whole infrastructure you have to play with.
One article that is always great to read is https://technet.microsoft.com/en-us/library/dd537533(v=sql.100).aspx
In my own experience, and when loading high volumes of data, bcp, ssis with fast load or .net bulkcopy are all valid options. Each has their own issues/benefits so not really saying one should always be used over the other.
For volumes of over 50 million rows I normally load smaller blocks into individual tables (heaps), then create all required indexes, and then switch them onto their final (or intermediary) table. If intermediary that means just another final partition merge/switch.
All my volume data load is done in parallel, with between 6 and 12 concurrent processes processed from a queue - I had once a process with nearly 300 tasks - 1 billion rows processed with heavy etl in each one - 9 hours to load.
A more recent load I did, 500 Million+ rows, 600GB data (1 big blob) onto a single table, took me 20m to load the data, gzip the blob on the middle of the extract/load process, and 35 min to build clustered index +2 non clustered. Loading in parallel with 6 tasks. Server to server.
Using a small C# utility that does nothing else other than issue select against a source, gzip column on the fly and bulkload onto destination. with 50k rows as bulkload batch size.
Note that on the particular last case, using SSIS was dead slow due to blob processing so that is also something to take in consideration for this type of loads.

Good info. I've not had to bring in quite that much in the past. Only about half that. So, let me ask please, what are you using to split the original monolithic file into smaller chunks to support the parallel loads? Or do you use things like FirstRow/LastRow options in the native tools? Or, are there multiple smaller files to begin with?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

frederico_fonseca SSCoach Points: 16019 More actions · Answer 2

Jeff Moden - Friday, February 17, 2017 4:29 PM
frederico_fonseca - Friday, February 17, 2017 3:56 PM
Really really depends on the volumes you are going to be loading, how many times you need to do it and on the whole infrastructure you have to play with.
One article that is always great to read is https://technet.microsoft.com/en-us/library/dd537533(v=sql.100).aspx
In my own experience, and when loading high volumes of data, bcp, ssis with fast load or .net bulkcopy are all valid options. Each has their own issues/benefits so not really saying one should always be used over the other.
For volumes of over 50 million rows I normally load smaller blocks into individual tables (heaps), then create all required indexes, and then switch them onto their final (or intermediary) table. If intermediary that means just another final partition merge/switch.
All my volume data load is done in parallel, with between 6 and 12 concurrent processes processed from a queue - I had once a process with nearly 300 tasks - 1 billion rows processed with heavy etl in each one - 9 hours to load.
A more recent load I did, 500 Million+ rows, 600GB data (1 big blob) onto a single table, took me 20m to load the data, gzip the blob on the middle of the extract/load process, and 35 min to build clustered index +2 non clustered. Loading in parallel with 6 tasks. Server to server.
Using a small C# utility that does nothing else other than issue select against a source, gzip column on the fly and bulkload onto destination. with 50k rows as bulkload batch size.
Note that on the particular last case, using SSIS was dead slow due to blob processing so that is also something to take in consideration for this type of loads.
Good info. I've not had to bring in quite that much in the past. Only about half that. So, let me ask please, what are you using to split the original monolithic file into smaller chunks to support the parallel loads? Or do you use things like FirstRow/LastRow options in the native tools? Or, are there multiple smaller files to begin with?

It depends.

If i'm given files and have no control of how they are extracted if they need splitting a small c# utility does the trick if required. Either in-house or just the plain Unix utility split (Unix or Windows version). Sometimes just copying the file is enough, and then use a "skip x records" functionality. very easy to skip records on SSIS (Or C#).

If I have full control over the extract I also have a automated process to issue any sql against any table and output it onto a file, delimited or fixed width, or load directly onto another table. With this the logical split is done as part of the sql extract piece.

If writting to files I do use fixed width when the source data may contain types of data that are incompatible with line sequential readers like SSIS - I do work a lot with databases that contain binary data on their varchar fields 🙂

As many of my clients prefer the use of SSIS for this I even have some C# code to create the packages based on metadata (and no BIML won't do!!)

Jeff Moden SSC Guru Points: 1003869 More actions · Answer 3

frederico_fonseca - Friday, February 17, 2017 5:02 PM
Jeff Moden - Friday, February 17, 2017 4:29 PM
frederico_fonseca - Friday, February 17, 2017 3:56 PM
Really really depends on the volumes you are going to be loading, how many times you need to do it and on the whole infrastructure you have to play with.
One article that is always great to read is https://technet.microsoft.com/en-us/library/dd537533(v=sql.100).aspx
In my own experience, and when loading high volumes of data, bcp, ssis with fast load or .net bulkcopy are all valid options. Each has their own issues/benefits so not really saying one should always be used over the other.
For volumes of over 50 million rows I normally load smaller blocks into individual tables (heaps), then create all required indexes, and then switch them onto their final (or intermediary) table. If intermediary that means just another final partition merge/switch.
All my volume data load is done in parallel, with between 6 and 12 concurrent processes processed from a queue - I had once a process with nearly 300 tasks - 1 billion rows processed with heavy etl in each one - 9 hours to load.
A more recent load I did, 500 Million+ rows, 600GB data (1 big blob) onto a single table, took me 20m to load the data, gzip the blob on the middle of the extract/load process, and 35 min to build clustered index +2 non clustered. Loading in parallel with 6 tasks. Server to server.
Using a small C# utility that does nothing else other than issue select against a source, gzip column on the fly and bulkload onto destination. with 50k rows as bulkload batch size.
Note that on the particular last case, using SSIS was dead slow due to blob processing so that is also something to take in consideration for this type of loads.
Good info. I've not had to bring in quite that much in the past. Only about half that. So, let me ask please, what are you using to split the original monolithic file into smaller chunks to support the parallel loads? Or do you use things like FirstRow/LastRow options in the native tools? Or, are there multiple smaller files to begin with?
It depends.
If i'm given files and have no control of how they are extracted if they need splitting a small c# utility does the trick if required. Either in-house or just the plain Unix utility split (Unix or Windows version). Sometimes just copying the file is enough, and then use a "skip x records" functionality. very easy to skip records on SSIS (Or C#).
If I have full control over the extract I also have a automated process to issue any sql against any table and output it onto a file, delimited or fixed width, or load directly onto another table. With this the logical split is done as part of the sql extract piece.
If writting to files I do use fixed width when the source data may contain types of data that are incompatible with line sequential readers like SSIS - I do work a lot with databases that contain binary data on their varchar fields 🙂
As many of my clients prefer the use of SSIS for this I even have some C# code to create the packages based on metadata (and no BIML won't do!!)

Thanks for the feedback. I really appreciate it. My interest was mostly in what you used for the "recent" load of 500M rows/600 GB.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

frederico_fonseca SSCoach Points: 16019 More actions · Answer 4

Jeff Moden - Friday, February 17, 2017 5:45 PM
frederico_fonseca - Friday, February 17, 2017 5:02 PM
Jeff Moden - Friday, February 17, 2017 4:29 PM
frederico_fonseca - Friday, February 17, 2017 3:56 PM
Really really depends on the volumes you are going to be loading, how many times you need to do it and on the whole infrastructure you have to play with.
One article that is always great to read is https://technet.microsoft.com/en-us/library/dd537533(v=sql.100).aspx
In my own experience, and when loading high volumes of data, bcp, ssis with fast load or .net bulkcopy are all valid options. Each has their own issues/benefits so not really saying one should always be used over the other.
For volumes of over 50 million rows I normally load smaller blocks into individual tables (heaps), then create all required indexes, and then switch them onto their final (or intermediary) table. If intermediary that means just another final partition merge/switch.
All my volume data load is done in parallel, with between 6 and 12 concurrent processes processed from a queue - I had once a process with nearly 300 tasks - 1 billion rows processed with heavy etl in each one - 9 hours to load.
A more recent load I did, 500 Million+ rows, 600GB data (1 big blob) onto a single table, took me 20m to load the data, gzip the blob on the middle of the extract/load process, and 35 min to build clustered index +2 non clustered. Loading in parallel with 6 tasks. Server to server.
Using a small C# utility that does nothing else other than issue select against a source, gzip column on the fly and bulkload onto destination. with 50k rows as bulkload batch size.
Note that on the particular last case, using SSIS was dead slow due to blob processing so that is also something to take in consideration for this type of loads.
Good info. I've not had to bring in quite that much in the past. Only about half that. So, let me ask please, what are you using to split the original monolithic file into smaller chunks to support the parallel loads? Or do you use things like FirstRow/LastRow options in the native tools? Or, are there multiple smaller files to begin with?
It depends.
If i'm given files and have no control of how they are extracted if they need splitting a small c# utility does the trick if required. Either in-house or just the plain Unix utility split (Unix or Windows version). Sometimes just copying the file is enough, and then use a "skip x records" functionality. very easy to skip records on SSIS (Or C#).
If I have full control over the extract I also have a automated process to issue any sql against any table and output it onto a file, delimited or fixed width, or load directly onto another table. With this the logical split is done as part of the sql extract piece.
If writting to files I do use fixed width when the source data may contain types of data that are incompatible with line sequential readers like SSIS - I do work a lot with databases that contain binary data on their varchar fields 🙂
As many of my clients prefer the use of SSIS for this I even have some C# code to create the packages based on metadata (and no BIML won't do!!)
Thanks for the feedback. I really appreciate it. My interest was mostly in what you used for the "recent" load of 500M rows/600 GB.

Powershell script to do the parallel processing
reads command queue
spawn up to x processes in parallel - spwan new process once each execution finishes so there are always the same number of processes working until queue finishes

Command queue

C# program to issue the sql against the database and process gzip/load onto destination table - supplied with the sql to extract just a block of data at the time (on this case based on createdon date field)

each commandline executed was as follows - varying only x and y

migrate_data /srcconstr srcconnstring /destconnstr dstconnstring /query "select a,b,convert(varbinary(max), blobfield) as blobfield from tbl where createdon between x and y" /compresscolumn "blobfield" /desttable destinationtable /batchsize 50000

Jeff Moden SSC Guru Points: 1003869 More actions · Answer 5

Sounds really interesting. You should do an article here on SSC on what you've done. I'm thinking that it would be a smash hit.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)