May 8, 2009 at 2:28 am
Hi,
i have extracted the data to .txt file from server by using BCP. This data need to be loaded into another server with help of SSIS, and i have seen that the .txt file data is not sorted. The extracted file is source to my ssis job.i have to send the total data by doing merge join which need the data to be sorted out. But the data in the .txt file is not sorted and also the capacity is around 10GB and getting strucked due to memory issue in the ssis. pls help me how to sort data (either in ssis or in .txt file) before sending it to merge join.
Note:
1. The column datatype that need to be sort is UNIQUE_IDENTIFIER
2. I dont have any staging area where i can load extracted data into table and continue ssis process.
May 11, 2009 at 6:49 am
ambajirao (5/8/2009)
Hi,i have extracted the data to .txt file from server by using BCP. This data need to be loaded into another server with help of SSIS, and i have seen that the .txt file data is not sorted. The extracted file is source to my ssis job.i have to send the total data by doing merge join which need the data to be sorted out. But the data in the .txt file is not sorted and also the capacity is around 10GB and getting strucked due to memory issue in the ssis. pls help me how to sort data (either in ssis or in .txt file) before sending it to merge join.
Note:
1. The column datatype that need to be sort is UNIQUE_IDENTIFIER
2. I dont have any staging area where i can load extracted data into table and continue ssis process.
You have to use the data flow task. In it you have to:
1. Setup source component to your data.
2. Connect the source to sort component.
3. Connect the sort to merge component.
Give it a try and let us know how it goes.
May 11, 2009 at 6:57 am
Sorting a 10GB text file? I can imagine some complaints from the server ....
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
May 11, 2009 at 7:01 am
ouch... sorting large data sets in SSIS for a merge join is a nightmare
Even if the data was sorted in the text file SSIS will still want to sort it again so that it can add the "IS SORTED" attribute flag to the data stream before merge joining.
From what i can understand from your post, the major problem is that you don't have a database staging area you can temporarily load data to for manipulation.
Easist solution would be to fix this staging area shortfall! More of a company political problem than a technical problem 🙂
The other option is to use a different tool for manipulating the text file prior to sorting.
Log Parser is a free utility from Microsoft and can be used to execute SQL commands against many different sources including text files
I have not tested this against large amounts of data but it might offer you a short cut
October 10, 2010 at 9:47 am
10gb file can be sorted and you do not have to buy new hardware
Microsoft is promoting the idea that doing everything the memory is great. It does definitely speed up the process plus it help large companies to sell hardware and memory. Sorting algorithms are very well documented and it is difficult to invent something new.
When working with large datasets there are ways of sorting data without buying new hardware
More Information about sorting
Have ever thought why Java and .Net so slow => to make us buy more hardware...
October 10, 2010 at 10:01 am
Mike-1076638 (10/10/2010)
10gb file can be sorted and you do not have to buy new hardwareMicrosoft is promoting the idea that doing everything the memory is great. It does definitely speed up the process plus it help large companies to sell hardware and memory. Sorting algorithms are very well documented and it is difficult to invent something new.
When working with large datasets there are ways of sorting data without buying new hardware
More Information about sorting
Have ever thought why Java and .Net so slow => to make us buy more hardware...
You got that wrong buddy. Hardware is cheap these days. What is expensive is reinventing the wheel. Java and .NET are not slow either. This is a myth.
October 10, 2010 at 8:29 pm
Samuel Vella (5/11/2009)
ouch... sorting large data sets in SSIS for a merge join is a nightmareEven if the data was sorted in the text file SSIS will still want to sort it again so that it can add the "IS SORTED" attribute flag to the data stream before merge joining. ....
That is incorrect. SSIS does not add the isSorted flag. If you know the data is sorted, YOU set the flag to TRUE and SSIS will assume the data is sorted.
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
October 12, 2010 at 6:29 am
There are tons of sophisticated sorting algorithms out there. Sort your text file before you load it into SSIS. Use whatever programming language you feel comfortable in (or whatever language you need to use due to some conspiracy theories).
After the file is sorted, use a flat file source and indicate the ISSORTED flag to TRUE. Then finally use the MERGE JOIN.
(or, to avoid such issues: can't you get the data directly from the source server? That way you can use a regular join in T-SQL. Much easier...)
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
October 12, 2010 at 7:19 am
I'm not suggesting this as a way to sort big files. This is just an idea for a bit of fun.
I wonder if there's a way to implement a bubble sort in the data flow. One thing for sure, it wouldn't be fast.
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
October 12, 2010 at 7:22 am
Alvin Ramard (10/12/2010)
--One thing for sure, it wouldn't be fast.
I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
October 12, 2010 at 7:38 am
Phil Parkin (10/12/2010)
Alvin Ramard (10/12/2010)
--One thing for sure, it wouldn't be fast.
I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀
No argument there. I'm trying to wrap my head around what it would take and I keep thinking that you'd have to go through the data flow as many times as you normally go through the loop.
Q: How long did it take to sort 1 million records?
A: Hang on, it's almost done. ... Just just hang on, almost there. ... Be patient now, it's almost done.
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
October 12, 2010 at 7:47 am
Phil Parkin (10/12/2010)
Alvin Ramard (10/12/2010)
--One thing for sure, it wouldn't be fast.
I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀
Quicksort on a single column shouldn't be all that slow. You have to remember that only pointers are sorted, not the full 10 GB of data.
Well that's how I'd sort this anyways :w00t:.
October 12, 2010 at 7:55 am
Ninja's_RGR'us (10/12/2010)
Phil Parkin (10/12/2010)
Alvin Ramard (10/12/2010)
--One thing for sure, it wouldn't be fast.
I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀
Quicksort on a single column shouldn't be all that slow. You have to remember that only pointers are sorted, not the full 10 GB of data.
Well that's how I'd sort this anyways :w00t:.
Pointers to rows in a text file? I still have a lot to learn if that is possible ...
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
October 12, 2010 at 8:08 am
Phil Parkin (10/12/2010)
Ninja's_RGR'us (10/12/2010)
Phil Parkin (10/12/2010)
Alvin Ramard (10/12/2010)
--One thing for sure, it wouldn't be fast.
I'd like to nominate this comment for UNDERSTATEMENT OF THE DAY 😀
Quicksort on a single column shouldn't be all that slow. You have to remember that only pointers are sorted, not the full 10 GB of data.
Well that's how I'd sort this anyways :w00t:.
Pointers to rows in a text file? I still have a lot to learn if that is possible ...
... once in RAM.
I've coded both binary searches and quick sort back in school and even at 1M rows it was super fast.
Now I frankly have no clue how SQL and SSIS actually do this. I can barely spell SISS let alone work with it :hehe:.
October 12, 2010 at 8:24 am
Since the source of data is a TXT file created with BCP, why don't you just order the TXT file in the order you want by using a query with an ORDER BY as the source for the BCP command?
This will almost certainly be faster than the other methods mentioned, and will be much simpler.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply