Sorting data files before merge join

  • Hi,

    i have extracted the data to .txt file from server by using BCP. This data need to be loaded into another server with help of SSIS, and i have seen that the .txt file data is not sorted. The extracted file is source to my ssis job.i have to send the total data by doing merge join which need the data to be sorted out. But the data in the .txt file is not sorted and also the capacity is around 10GB and getting strucked due to memory issue in the ssis. pls help me how to sort data (either in ssis or in .txt file) before sending it to merge join.

    Note:

    1. The column datatype that need to be sort is UNIQUE_IDENTIFIER

    2. I dont have any staging area where i can load extracted data into table and continue ssis process.

  • ambajirao (5/8/2009)


    Hi,

    i have extracted the data to .txt file from server by using BCP. This data need to be loaded into another server with help of SSIS, and i have seen that the .txt file data is not sorted. The extracted file is source to my ssis job.i have to send the total data by doing merge join which need the data to be sorted out. But the data in the .txt file is not sorted and also the capacity is around 10GB and getting strucked due to memory issue in the ssis. pls help me how to sort data (either in ssis or in .txt file) before sending it to merge join.

    Note:

    1. The column datatype that need to be sort is UNIQUE_IDENTIFIER

    2. I dont have any staging area where i can load extracted data into table and continue ssis process.

    You have to use the data flow task. In it you have to:

    1. Setup source component to your data.

    2. Connect the source to sort component.

    3. Connect the sort to merge component.

    Give it a try and let us know how it goes.

    ---
    SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

  • Sorting a 10GB text file? I can imagine some complaints from the server ....

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • ouch... sorting large data sets in SSIS for a merge join is a nightmare

    Even if the data was sorted in the text file SSIS will still want to sort it again so that it can add the "IS SORTED" attribute flag to the data stream before merge joining.

    From what i can understand from your post, the major problem is that you don't have a database staging area you can temporarily load data to for manipulation.

    Easist solution would be to fix this staging area shortfall! More of a company political problem than a technical problem 🙂

    The other option is to use a different tool for manipulating the text file prior to sorting.

    Log Parser is a free utility from Microsoft and can be used to execute SQL commands against many different sources including text files

    Link to MS Log Parser

    I have not tested this against large amounts of data but it might offer you a short cut

  • 10gb file can be sorted and you do not have to buy new hardware

    Microsoft is promoting the idea that doing everything the memory is great. It does definitely speed up the process plus it help large companies to sell hardware and memory. Sorting algorithms are very well documented and it is difficult to invent something new.

    When working with large datasets there are ways of sorting data without buying new hardware

    More Information about sorting

    Have ever thought why Java and .Net so slow => to make us buy more hardware...

  • Mike-1076638 (10/10/2010)


    10gb file can be sorted and you do not have to buy new hardware

    Microsoft is promoting the idea that doing everything the memory is great. It does definitely speed up the process plus it help large companies to sell hardware and memory. Sorting algorithms are very well documented and it is difficult to invent something new.

    When working with large datasets there are ways of sorting data without buying new hardware

    More Information about sorting

    Have ever thought why Java and .Net so slow => to make us buy more hardware...

    You got that wrong buddy. Hardware is cheap these days. What is expensive is reinventing the wheel. Java and .NET are not slow either. This is a myth.

    ---
    SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

  • Samuel Vella (5/11/2009)


    ouch... sorting large data sets in SSIS for a merge join is a nightmare

    Even if the data was sorted in the text file SSIS will still want to sort it again so that it can add the "IS SORTED" attribute flag to the data stream before merge joining. ....

    That is incorrect. SSIS does not add the isSorted flag. If you know the data is sorted, YOU set the flag to TRUE and SSIS will assume the data is sorted.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • There are tons of sophisticated sorting algorithms out there. Sort your text file before you load it into SSIS. Use whatever programming language you feel comfortable in (or whatever language you need to use due to some conspiracy theories).

    After the file is sorted, use a flat file source and indicate the ISSORTED flag to TRUE. Then finally use the MERGE JOIN.

    (or, to avoid such issues: can't you get the data directly from the source server? That way you can use a regular join in T-SQL. Much easier...)

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I'm not suggesting this as a way to sort big files. This is just an idea for a bit of fun.

    I wonder if there's a way to implement a bubble sort in the data flow. One thing for sure, it wouldn't be fast.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Alvin Ramard (10/12/2010)


    --

    One thing for sure, it wouldn't be fast.

    I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin (10/12/2010)


    Alvin Ramard (10/12/2010)


    --

    One thing for sure, it wouldn't be fast.

    I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀

    No argument there. I'm trying to wrap my head around what it would take and I keep thinking that you'd have to go through the data flow as many times as you normally go through the loop.

    Q: How long did it take to sort 1 million records?

    A: Hang on, it's almost done. ... Just just hang on, almost there. ... Be patient now, it's almost done.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Phil Parkin (10/12/2010)


    Alvin Ramard (10/12/2010)


    --

    One thing for sure, it wouldn't be fast.

    I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀

    Quicksort on a single column shouldn't be all that slow. You have to remember that only pointers are sorted, not the full 10 GB of data.

    Well that's how I'd sort this anyways :w00t:.

  • Ninja's_RGR'us (10/12/2010)


    Phil Parkin (10/12/2010)


    Alvin Ramard (10/12/2010)


    --

    One thing for sure, it wouldn't be fast.

    I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀

    Quicksort on a single column shouldn't be all that slow. You have to remember that only pointers are sorted, not the full 10 GB of data.

    Well that's how I'd sort this anyways :w00t:.

    Pointers to rows in a text file? I still have a lot to learn if that is possible ...

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin (10/12/2010)


    Ninja's_RGR'us (10/12/2010)


    Phil Parkin (10/12/2010)


    Alvin Ramard (10/12/2010)


    --

    One thing for sure, it wouldn't be fast.

    I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀

    Quicksort on a single column shouldn't be all that slow. You have to remember that only pointers are sorted, not the full 10 GB of data.

    Well that's how I'd sort this anyways :w00t:.

    Pointers to rows in a text file? I still have a lot to learn if that is possible ...

    ... once in RAM.

    I've coded both binary searches and quick sort back in school and even at 1M rows it was super fast.

    Now I frankly have no clue how SQL and SSIS actually do this. I can barely spell SISS let alone work with it :hehe:.

  • Since the source of data is a TXT file created with BCP, why don't you just order the TXT file in the order you want by using a query with an ORDER BY as the source for the BCP command?

    This will almost certainly be faster than the other methods mentioned, and will be much simpler.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply