Query for Top N for large amount of data

Question

Post reply

Query for Top N for large amount of data

eltontodd

SSC Journeyman

Points: 94
More actions
March 24, 2008 at 2:07 pm

#188384

I have a database that has 5M rows of data in it. The table is setup as such:
Invoice Table
Date
Salesperson
State
Cost
We are trying to get a query that shows the 5 sales people with the most total sales $ PER STATE. There are multiple entries per sales person per state.
Thanks for the help.

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 1

Could you post the table's create script, some sample data and expected output please.

See here - Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Thanks

Edit: And the index definitions as well please, since it's a fairly large set of data

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

eltontodd SSC Journeyman Points: 94 More actions · Answer 2

--===== Create the test table with

CREATE TABLE #mytable

(

Date DATETIME,

SalesPerson char(100),

State char(2),

Cost float

)

--===== Insert the test data into the test table

INSERT INTO #mytable

(Date, SalesPerson, State, Cost)

SELECT 'Oct 17 2007 12:00AM','John','AL','235.43' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Joe','OK','432.25' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Joyce','AL','238.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','John','AL','111.12' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Joe','OK','267.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Joyce','OK','367.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','John','OK','67.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Joe','AL','267.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Fred','OK','79.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Moe','OK','7.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Joe','AL','167.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Moe','AL','37.98' UNION ALL

SELECT 'Oct 17 2007 12:00AM','Joyce','OK','987.45'

--====================== Expected Result (Top 3 instead of Top 5)

SalesPerson TotalSales State

Joe 435.96 AL

John 346.55 AL

Joyce 238.98 AL

Joyce 1355.43 OK

Joe 700.23 OK

Fred 79.78 OK

At this point, there are no indexes on the table.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 3

You're definitely going to need indexing. Recommend state, salesperson with an include for cost

;with firstCte as (

select salesperson,

[state],

sum(cost) totsales

from #mytable

group by salesperson,

[state] ),

SecondCTE as (

select *,

Row_number() over (partition by state order by totsales desc) RN

from firstCTE)

select *

from secondCTE

where rn<4

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 4

Matt's code is functionally perfect and identical in performance (< 15 seconds total duration) to the following code... the only difference is the following code uses a single CTE for a slightly easier read...

;WITH cteFirst AS

(

SELECT ROW_NUMBER() OVER (PARTITION BY State ORDER BY SUM(Cost) DESC) AS StateRank,

SalesPerson,

State,

SUM(Cost) AS TotalSales

FROM dbo.jbmTest

GROUP BY SalesPerson, State

)

SELECT StateRank,SalesPerson,STR(TotalSales,9,2) AS TotalSales,State

FROM cteFirst

WHERE StateRank <= 5

... and it's pointed at the following test table to show the kind of speed we both got out of the code using the index he recommended. The details are in the comments. Takes less than 3 minutes to build the 5 million row test table and the nonclustered index....

--drop table jbmtest

--===== Create and populate a 5,000,000 row test table.

-- Column "Date" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

-- Column "SalesPerson" contains 256 different sales person names

-- Column "State" has a range of "AA to YA" and "AB to YB" (simulates 50 states)

-- Column "Cost has a range of 0.0000 to 99.9999 non-unique numbers

-- Jeff Moden

SELECT TOP 5000000

Date = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

SalesPerson = CAST(RIGHT(NEWID(),2)+'SalesPersonName' AS VARCHAR(100)),

State = CHAR(ABS(CHECKSUM(NEWID()))%25+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%2+65),

Cost = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS FLOAT)

INTO dbo.JBMTest

FROM Master.dbo.SysColumns t1,

Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

CREATE NONCLUSTERED INDEX IX_jbmTest_Composite1 ON dbo.jbmTest

(State ASC, SalesPerson ASC)

INCLUDE (Cost)

Couple of things I noticed... First, your table had CHAR(100) for the Sales Person. If you insist on using a character based name, then it should be VARCHAR(100).

Second, it would be a lot better if SalesPerson were an ID but I don't know how or where you got that data from.

Third, table has no primary key... again, I don't know if this is an interim table, just wasn't included in the example code you provided, or what.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

eltontodd SSC Journeyman Points: 94 More actions · Answer 5

Thanks for all your help. That query works great on SQL 2005. Do you happen to know what would be the fastest way to get the same results from the same data on SQL 2000 or SQL 7?

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 6

Yep... you're gonna have to put the result of the CTE into a temp table as a separate query... include a RANK column. Then, see the part about "Grouped Running Count" in the following URL...

http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

... and, once you've filled in the RANK column, do a final select where RANK <=5.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)