Select and Process X number of records at a time

Question

Select and Process X number of records at a time

dmusson17

Ten Centuries

Points: 1235
More actions
June 18, 2008 at 9:16 am

#122817

Hello,
I am getting data from a remote source via an odbc driver in a DataReader Source. I am limited to selecting 2000 records at a time but need to select many more on a regular basis. The recommended way of getting around this is querying once to get the ID numbers that I need (I can do an unlimited query when only selecting ID and recmoddate) and then use that result set to build a subsequent select statement that selects where id# in id1, id2, id3, ... id2000.
Then I would need to create the next select statement to get ids 2001-4000, and so on.
I am having a hard time conceptualizing how to accomplish this in SSIS. Do I utilize a few for loop containers and some variables or is there a more desirable way to accomplish this?
Any guidance is much appreciated.
Dave

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Brandie Tarvin SSC Guru Points: 173004 More actions · Answer 1

Hmm. I was going to recommend using the Row_Number() function to build your original query, then realized your post sounds like you're not pulling your source records from SQL 2k5. Is that correct?

In that case, try this. Create 2 package wide variables (Minnie & Maxie as Int16 for example). Set Minnie at 0 and Maxie at 2000 for the initial variable setup. Create a 3rd variable that will (shortly) contain the value of the maximum ID you'll be pulling (We'll call it MaxID) but set it to 0 for starters.

Create an Execute SQL Task (or Script Task) to populate MaxID with the max id number. When you connect the success of this step to the next step, make sure to change the success to Expression & Constraint, using this to check MaxID for a value other than 0. This will allow you to do a check before you go further.

Pull in a FOR Loop Container. (Note: I haven't needed to use this container, but I'm pretty sure it'll work). In the InitExpression you can use your Maxie & Minnie variables to start out with, Set the EvalExpression equal to MaxID and then use AssignExpression to increment both Minnie & Maxie by 2000 each time.

In your database, have a stored procedure which accepts input values in the WHERE clause and pulls data from your source based on "WHERE MyID >= @Minnie and MyID < @Maxie". --Test this to make sure it works in SSMS first!!!

Now, I think you can populate the OLE DB Source of a Data Flow task with package variables, but I can't remember. So if the following info is wrong, let me know and I'll play with it....

Inside the FOR LOOP Container, add a Data Flow Task with an OLE DB Source using SQL Command that will call the proc mentioned above. ("Exec MyProc @Minnie, @Maxie"). Add OLE DB Destination (or whatever) plus any steps in between...Test thoroughly.

Not knowing your data or your source, I can't give more specific details, but I believe this is a good place for you to start. And if this doesn't work, putting it together should give you enough of an idea of what you're doing to find the solution you need.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

pduplessis-723389 SSChampion Points: 10599 More actions · Answer 2

Awesome post Brandie...

Make it a FOR EACH ADO enumerator, and read your mini and maxi variables into here

Rgs

~PD

dmusson17 Ten Centuries Points: 1235 More actions · Answer 3

Yes, Thanks Brandie. The post is very much appreciated. I am working through it now, and will let you know how it goes.

Couple of notes:

- Correct, the source is not SQL 2k5, and the IDs I am pulling on my initial query are not sequential. Thus once I pull in my initial list of IDs and recmoddates, I am using the Row_Number() function and adding a column of sequential record #'s to help simplify the processing.

- Since my IDs are not sequential, I need to structure my subsequent queries that pull 2000 records each like this:

select * from table where ID in id1,id2,id3...id2000

which I am thinking would require a script task that accepts the min and max records and creates the sql statement which is put in another variable and passed back out to the flow? Does this sound correct?

Thanks so much for the help.

Brandie Tarvin SSC Guru Points: 173004 More actions · Answer 4

When you say your IDs aren't sequentially, what exactly do you mean? Could you give us a RL example of several IDs so we can see?

You could use a script task, but it might be overkill, depending on what those IDs look like. And having to write an IN () clause for 2000 different IDs would be a serious pain. Not to mention having to write the variables for it.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

dmusson17 Ten Centuries Points: 1235 More actions · Answer 5

Here is a sample of what my initial query is bringing in:

031134426/13/2008 12:00:00 AM

032084386/13/2008 12:00:00 AM

032117046/16/2008 12:00:00 AM

032162016/16/2008 12:00:00 AM

I add a column to (hopefully) make processing easier:

1031134426/13/2008 12:00:00 AM

2032084386/13/2008 12:00:00 AM

3032117046/16/2008 12:00:00 AM

4032162016/16/2008 12:00:00 AM

So I would process the second table 2000 records at a time with an IN () containing the first 2000 IDs, and the next IN () containing IDs 2001-4000, etc. until I reach my max.

Let me know if there is a better way to do this, I am definitely open to any ideas!

Thanks.

Brandie Tarvin SSC Guru Points: 173004 More actions · Answer 6

David,

If you've added that extra column, there's no need to use the IN() clause. You can just do the variables as I stated previously and use a "WHERE MyNewCol >= @Minnie and MyNewCol <= @Maxie". This will keep you from having to type all those new column numbers in a query.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

dmusson17 Ten Centuries Points: 1235 More actions · Answer 7

It's not a traditional relational database, and there are some (many) limitations in the sql translation process (read only supporting only simple select statements, no joins, no order by, IS NULL or IS NOT NULL not supported, etc). Thus the only way to pull multiple records from the source (in my case anyway where I know the IDs I am looking for) is with an IN specifying the list of IDs that you are looking for. So I think that the way to do that would be to create the sql statement via script, plug the sql statement into a variable, and use that variable in my DataReader Source.

Let me know if I am missing the boat here....

Brandie Tarvin SSC Guru Points: 173004 More actions · Answer 8

David,

I'm confused. When you say you added an extra column, where did you add it? To the Source before you pulled the data or are you pulling the data then adding the column?

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

dmusson17 Ten Centuries Points: 1235 More actions · Answer 9

I'm pulling the ID and the record mod date down onto my sql server and then adding the row number column. Now I need to issue the subsequent queries to go back and get the detail 2000 records at a time.

Sorry for the confusion!

Dave

Brandie Tarvin SSC Guru Points: 173004 More actions · Answer 10

Do you have the ability to do a T-SQL join back to your original record source?

If so, then do "Select * from OrigDs od join NewDs nd on od.ID = nd.ID where nd.MyNewCol >= @Minnie and nd.MyNewCol <= @Maxie".

OrigDs is your Original Data Source and NewDs is the table you've created on SQL Server. That's assuming, of course, that you've already pulled every single last ID you'll need from the original data source to SQL Server in this temp table of yours.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

dmusson17 Ten Centuries Points: 1235 More actions · Answer 11

Unfortunately no joins. Their explanation:

The db server software does not have the concept of a join, therefor, the query language and translator do not attempt to support one.

So I think I am stuck with building that sql statement with an IN clause containing 2000 IDs.

What do you think?

Brandie Tarvin SSC Guru Points: 173004 More actions · Answer 12

Subquery then.

Select *

from OrigDs

where ID in (Select Distinct ID from NewDs

where MyNewCol >= @Minnie and MyNewCol <= @Maxie)

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.