GROUP BY in SSIS SQL Query for 35+ million rows

Question

GROUP BY in SSIS SQL Query for 35+ million rows

Willem Tilstra-440027

Ten Centuries

Points: 1164
More actions
March 16, 2010 at 12:29 pm

#217896

I'm running an SSIS package that consumes 35+ million rows and one of the tasks in the package is a SQL Task that uses a Group By. That kind of sorting activity takes place in TEMPDB, occasionally causing the job to abend because TEMPDB cannot grow any larger. Are there alternate methods that I could consider or test that would take the Sorting and Grouping activity out of TEMPDB? Is that even a reasonable question or should I just bite the bullet and increase the size of TEMPDB? TEMPDB currently consists of 4 files - 5 GB each, restricted to 5 GB in size. The server is running SQL Server 2008 Enterprise 64 bit, Windows Server 2008 and has 16 Gb of memory and 4 processors.

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

I'd say there is a mismatch between the table size you're using and the size you allow tempdb to grow to. You'd fill one of those tempdb files completely if the table you want to insert has as little as 150byte per row (5GB / 35mill rows).

I don't think there's any way not to use tempdb for that task other than redesign the task to process the data in batches (if possible at all).

What is the reason for the restricted size of tempdb?

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Willem Tilstra-440027 Ten Centuries Points: 1164 More actions · Answer 2

Imu92,

Thanks for the response. In answer to your question, I think my reason for restricting the size of the TEMPDB is based on at least two assumptions, none of which are, now that I think about it, defensible. The first one is that the sizes that are mentioned in articles about TEMPDB are often pretty small. I was just reading an article and the TEMPDB sizes mentioned (in connection with what the file growth increments should be) were 10MB, between 10 MB and 200 MB and > 200 MB. My TEMPDB, as I mentioned earlier, is at 20 GB, way, way bigger than any numbers I've seen in any literature. So I made the assumption that by and large, best practices dictate that TEMPDB database sizes be pretty small. Another assumption that I made was that queries that use a lot of TEMPDB space are inefficient and should be re-written to not use that much, either by improving the query or, if a lot of rows are involved, by using work tables in the application database itself, not temporary tables in TEMPDB.

It sounds like I may have to rethink my approach to TEMPDB or at least update it. Do you have any suggestions or reading that would broaden my knowledge of how TEMPDB should be used, particularly in regards to what sizes correspond to what kinds of data processing? That seems to be a roadblock in my understanding of TEMPDB. I keep thinking that a TEMPDB that is 100 Gb, for instance, is way too big but maybe, in some circumstances, a TEMPDB of that size is quite acceptable.

Again, thanks for responding to my question. You've started me thinking!!

Willem

SQLRNNR SSC Guru Points: 281344 More actions · Answer 3

Something that worked for me in the past was an 8 file tempdb. I permitted autogrowth at 512MB in each file - for the just in case scenario. Each file was sized to 5GB initially. Our largest table was 22GB, thus tempdb was large enough for the file and then some.

Besides the number of files & autogrowth setting, it is important that the tempdb be larger than your largest table.

On occasion we did need the tempdb because a program would get carried away and grow the tempdb before we caught it.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 4

In my eyes, 20GB for such a database is probably ok but a bit large. My suggestion would be to reexamine the underlying methods and, as already suggested, maybe deploy a little "Divide'n'Conquer". In truth, aggregating/grouping 35 million rows shouldn't cause you to bottom out on a 20GB TempDB. I've done aggregations on 120 million row tables and aggregations on 93 four million row tables and have never shoved Temp DB over the 8GB mark in the process. Such a large TempDB for only 35 million rows is usually not because of the volume of the table... it's usually because of accidental cross-joins that are covered up by the GROUP BY.

My recommendation is to relegate the task to a well written stored procedure that uses a combination of "Divide'n'Conquer" and "Pre-aggregation" methods if joins are involved in the process. That and understanding where the accidental cross-join is occuring. Once you get the process into a stored procedure, the accidental cross-join will be easy to find... it'll be in the form of an arrow with a crazy high row count that's many times more than the sum of the row count of all the tables in the query combined.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLRNNR SSC Guru Points: 281344 More actions · Answer 5

Jeff Moden (3/16/2010)
In my eyes, 20GB for such a database is probably ok but a bit large. My suggestion would be to reexamine the underlying methods and, as already suggested, maybe deploy a little "Divide'n'Conquer". In truth, aggregating/grouping 35 million rows shouldn't cause you to bottom out on a 20GB TempDB. I've done aggregations on 120 million row tables and aggregations on 93 four million row tables and have never shoved Temp DB over the 8GB mark in the process. Such a large TempDB for only 35 million rows is usually not because of the volume of the table... it's usually because of accidental cross-joins that are covered up by the GROUP BY.
My recommendation is to relegate the task to a well written stored procedure that uses a combination of "Divide'n'Conquer" and "Pre-aggregation" methods if joins are involved in the process. That and understanding where the accidental cross-join is occuring. Once you get the process into a stored procedure, the accidental cross-join will be easy to find... it'll be in the form of an arrow with a crazy high row count that's many times more than the sum of the row count of all the tables in the query combined.

Thanks Jeff for pointing out what we neglected to bring up. Definitely change your SQL Task to use a stored procedure.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Paul White SSC Guru Points: 150468 More actions · Answer 6

Willem Tilstra-440027 (3/16/2010)
I'm running an SSIS package that consumes 35+ million rows and one of the tasks in the package is a SQL Task that uses a Group By.

Have you considered reading the source data and performing the GROUP BY with an SSIS Aggregate component?

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

spaghettidba SSC Guru Points: 105732 More actions · Answer 7

Last week somebody at my shop ran a query with a mad cross join over a 27 million rows table.

Tempdb (unlimited growth) went up to 130 Gb in ten minutes, filling the data disk.

The user said: "Well, I noticed that something was wrong, because I usually get my results from this query in 1 or 2 seconds, but I wanted to see how long it would have taken with a slight modification."

Slight modification?!?!?!??

-- Gianluca Sartori

Paul White SSC Guru Points: 150468 More actions · Answer 8

Gianluca Sartori (3/17/2010)
Last week somebody at my shop ran a query with a mad cross join over a 27 million rows table.
Tempdb (unlimited growth) went up to 130 Gb in ten minutes, filling the data disk.

Resource Governor in 2008 is a marvellous thing http://msdn.microsoft.com/en-us/library/ee151608.aspx

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Willem Tilstra-440027 Ten Centuries Points: 1164 More actions · Answer 9

Thank you all for your thoughtful and informative replies. You've given me plenty of material to digest and suggestions to try on my process.

Willem

SQLRNNR SSC Guru Points: 281344 More actions · Answer 10

Willem Tilstra-440027 (3/17/2010)
Thank you all for your thoughtful and informative replies. You've given me plenty of material to digest and suggestions to try on my process.
Willem

You're welcome

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Paul White SSC Guru Points: 150468 More actions · Answer 11

CirquedeSQLeil (3/17/2010)
Willem Tilstra-440027 (3/17/2010)
Thank you all for your thoughtful and informative replies. You've given me plenty of material to digest and suggestions to try on my process.Willem
You're welcome

Likewise 😉

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi