Sorting large volume of data

  • This question is regarding how to sort a very large volume of data using SQL Server 2008 R2. I have ~3,500 pipe-delimited text files totaling 300+ GB of data and ~4.8 billion records. All of these files are in the same format. That is, they all have the same columns and would go into the same data table. I would like to merge all of these files and sort by two of the columns, specifically column 1 would be primary sort column and column 15 would be the secondary sort column.

    I put together a simple SSIS package and batch file that I could use to import all of the data into a SQL Server database table.

    @echo Off

    set /a fileCount=0

    setlocal ENABLEDELAYEDEXPANSION

    for /F "delims=&" %%i in ('dir d:\temp\*.txt /b') do (

    dtexec /f Test.dtsx /set \Package.Variables[User::SourceFilePath].Value;"d:\temp\%%i" /Reporting E >> d:\temp\DataImport.log

    set /a fileCount=fileCount+1

    echo Processed %%i, count: !fileCount!

    )

    endlocal

    My goal is simply to produce a single text file that is sorted by two of the columns. What would be the fastest way to accomplish this?

    Should I just create a simple table with no indices, upload the data using my SSIS package and then export the data using SSIS and a simple SELECT...ORDER BY statement? The SQL server has lots of storage, 2 8-core processors and 128 GB RAM. I don't know how SQL Server will perform the sort when it can't hold all of the data in memory or whether there are things I can do to facilitate the process. Thanks.

  • Apologies for not offering help here but I do have a couple of questions... Why do you need to have 4.8 Billion rows sorted in a single file? What are you going to do with that file once it has been constructed?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The data will ultimately be used for analysis in SAS[/url]. It's not so much about producing a single text file as it is a question of the optimal way of producing all of the data in sorted format, given my input, using SQL Server. I don't know at this point if the data will remain in SQL Server and be read using a SQL Server connector for SAS or if the data will be moved into a sorted SAS dataset. I asked if the data really had to be sorted and was told yes.

    Does anyone have any thoughts about how to do or (not do) this in SQL Server, whether SQL server is in fact a good tool for this, or whether there are other free/inexpensive tools for producing my desired output?

  • Hi,

    Heavy load.

    My approach would be like:

    You can change your package to use the Foreach loop container, put a data flow task in the loop container and then in the data flow tab you could use the OLE DB Destination component which will be your table; use "fast load" data access mode, and specify "rows per batch" (e.g. 10K) for more efficient loading. These two options will make the load faster (see this video http://www.youtube.com/watch?v=o1rhIC-rSVE if you have issues).

    SQL Server is going to use the tempdb to sort the data by specifying ORDER BY..., so for such big data you should have your tempdb prepared.

    You can create a clustered index on column 1 and column 15 as keys, and by performing

    select * from yourTable with (index index_name) -- index hint, you'll have your data listed sorted by column 1 and column 15.

    Regards,

    Igor

    Igor Micev,My blog: www.igormicev.com

  • cjmorelock (1/31/2014)


    I have ~3,500 pipe-delimited text files totaling 300+ GB of data and ~4.8 billion records.

    Well, minimally logged operations just went out the window, unless you're going to merge prior to delivery (don't).

    I would like to merge all of these files and sort by two of the columns, specifically column 1 would be primary sort column and column 15 would be the secondary sort column.

    I'd allow the For Each component to insert to a staging table that is already clustered indexed on the necessary columns. Pad for expectation to try to avoid page splits.

    My goal is simply to produce a single text file that is sorted by two of the columns. What would be the fastest way to accomplish this?

    Once the data is sorted via the clustered index, call it back out in the same way (SELECT * FROM tbl ORDER BY ColA, ColB), and it should bring the clustered index back in order. Also make sure your OLEDB source is set to 'ordered results'.

    Avoid doing sorts in SSIS of any form, it breaks the datastream.

    Somewhere or another you're going to pay the price of sorting all that data. I'd personally let the engine do it in the staging table, in pieces.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Igor Micev (2/4/2014)


    Hi,

    Heavy load.

    My approach would be like:

    You can change your package to use the Foreach loop container, put a data flow task in the loop container and then in the data flow tab you could use the OLE DB Destination component which will be your table; use "fast load" data access mode, and specify "rows per batch" (e.g. 10K) for more efficient loading. These two options will make the load faster (see this video http://www.youtube.com/watch?v=o1rhIC-rSVE if you have issues).

    SQL Server is going to use the tempdb to sort the data by specifying ORDER BY..., so for such big data you should have your tempdb prepared.

    You can create a clustered index on column 1 and column 15 as keys, and by performing

    select * from yourTable with (index index_name) -- index hint, you'll have your data listed sorted by column 1 and column 15.

    Regards,

    Igor

    Based on the assumption that you preloaded the table in SQL Server, then there is no need for the package to have that table as an OLE DB Destination.

    If the table is to be populated during this process, then yes, the OLE DB Destination will be needed in 1 data flow - but it can be skipped altogether. It depends on when you are actually populating the table that is merging the files.

    If the table exists already, then inside a DataFlow, add an OLE DB source. In that source you can do your select statement from the existing table and place the ORDER BY (column 1 and column 15) there. The use of the index hint is really unnecessary too and can lead to issues down the road if this package is to be used on a regular basis. Just think if somebody thinks the index needs renamed or removed.

    Then add your Flat file destination.

    You can add a sort transformation between the source and destination. Sometimes it would be necessary but should only be used if the sort in your query fails to maintain the order for some reason.

    Based on the size of the records and the number of files, you might need to consider adding a script task to perform a manual GC after each file. I have seen this help performance immensely because SSIS gets bogged down.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (2/4/2014)


    Igor Micev (2/4/2014)


    Hi,

    Heavy load.

    My approach would be like:

    You can change your package to use the Foreach loop container, put a data flow task in the loop container and then in the data flow tab you could use the OLE DB Destination component which will be your table; use "fast load" data access mode, and specify "rows per batch" (e.g. 10K) for more efficient loading. These two options will make the load faster (see this video http://www.youtube.com/watch?v=o1rhIC-rSVE if you have issues).

    SQL Server is going to use the tempdb to sort the data by specifying ORDER BY..., so for such big data you should have your tempdb prepared.

    You can create a clustered index on column 1 and column 15 as keys, and by performing

    select * from yourTable with (index index_name) -- index hint, you'll have your data listed sorted by column 1 and column 15.

    Regards,

    Igor

    Based on the assumption that you preloaded the table in SQL Server, then there is no need for the package to have that table as an OLE DB Destination.

    If the table is to be populated during this process, then yes, the OLE DB Destination will be needed in 1 data flow - but it can be skipped altogether. It depends on when you are actually populating the table that is merging the files.

    If the table exists already, then inside a DataFlow, add an OLE DB source. In that source you can do your select statement from the existing table and place the ORDER BY (column 1 and column 15) there. The use of the index hint is really unnecessary too and can lead to issues down the road if this package is to be used on a regular basis. Just think if somebody thinks the index needs renamed or removed.

    Then add your Flat file destination.

    You can add a sort transformation between the source and destination. Sometimes it would be necessary but should only be used if the sort in your query fails to maintain the order for some reason.

    Based on the size of the records and the number of files, you might need to consider adding a script task to perform a manual GC after each file. I have seen this help performance immensely because SSIS gets bogged down.

    Merges are going to impose perf issues. The idea is to quickly insert the data from the flat files into the table.

    Then the sort can be performed in various ways.

    Igor Micev,My blog: www.igormicev.com

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply