Query for Top N for large amount of data

  • I have a database that has 5M rows of data in it. The table is setup as such:

    Invoice Table

    Date

    Salesperson

    State

    Cost

    We are trying to get a query that shows the 5 sales people with the most total sales $ PER STATE. There are multiple entries per sales person per state.

    Thanks for the help.

  • Could you post the table's create script, some sample data and expected output please.

    See here - Forum Etiquette: How to post data/code on a forum to get the best help[/url]

    Thanks

    Edit: And the index definitions as well please, since it's a fairly large set of data

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • --===== Create the test table with

    CREATE TABLE #mytable

    (

    Date DATETIME,

    SalesPerson char(100),

    State char(2),

    Cost float

    )

    --===== Insert the test data into the test table

    INSERT INTO #mytable

    (Date, SalesPerson, State, Cost)

    SELECT 'Oct 17 2007 12:00AM','John','AL','235.43' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Joe','OK','432.25' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Joyce','AL','238.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','John','AL','111.12' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Joe','OK','267.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Joyce','OK','367.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','John','OK','67.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Joe','AL','267.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Fred','OK','79.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Moe','OK','7.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Joe','AL','167.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Moe','AL','37.98' UNION ALL

    SELECT 'Oct 17 2007 12:00AM','Joyce','OK','987.45'

    --====================== Expected Result (Top 3 instead of Top 5)

    SalesPerson TotalSales State

    Joe 435.96 AL

    John 346.55 AL

    Joyce 238.98 AL

    Joyce 1355.43 OK

    Joe 700.23 OK

    Fred 79.78 OK

    At this point, there are no indexes on the table.

  • You're definitely going to need indexing. Recommend state, salesperson with an include for cost

    ;with firstCte as (

    select salesperson,

    [state],

    sum(cost) totsales

    from #mytable

    group by salesperson,

    [state] ),

    SecondCTE as (

    select *,

    Row_number() over (partition by state order by totsales desc) RN

    from firstCTE)

    select *

    from secondCTE

    where rn<4

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt's code is functionally perfect and identical in performance (< 15 seconds total duration) to the following code... the only difference is the following code uses a single CTE for a slightly easier read...

    ;WITH cteFirst AS

    (

    SELECT ROW_NUMBER() OVER (PARTITION BY State ORDER BY SUM(Cost) DESC) AS StateRank,

    SalesPerson,

    State,

    SUM(Cost) AS TotalSales

    FROM dbo.jbmTest

    GROUP BY SalesPerson, State

    )

    SELECT StateRank,SalesPerson,STR(TotalSales,9,2) AS TotalSales,State

    FROM cteFirst

    WHERE StateRank <= 5

    ... and it's pointed at the following test table to show the kind of speed we both got out of the code using the index he recommended. The details are in the comments. Takes less than 3 minutes to build the 5 million row test table and the nonclustered index....

    --drop table jbmtest

    --===== Create and populate a 5,000,000 row test table.

    -- Column "Date" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SalesPerson" contains 256 different sales person names

    -- Column "State" has a range of "AA to YA" and "AB to YB" (simulates 50 states)

    -- Column "Cost has a range of 0.0000 to 99.9999 non-unique numbers

    -- Jeff Moden

    SELECT TOP 5000000

    Date = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SalesPerson = CAST(RIGHT(NEWID(),2)+'SalesPersonName' AS VARCHAR(100)),

    State = CHAR(ABS(CHECKSUM(NEWID()))%25+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%2+65),

    Cost = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS FLOAT)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    CREATE NONCLUSTERED INDEX IX_jbmTest_Composite1 ON dbo.jbmTest

    (State ASC, SalesPerson ASC)

    INCLUDE (Cost)

    Couple of things I noticed... First, your table had CHAR(100) for the Sales Person. If you insist on using a character based name, then it should be VARCHAR(100).

    Second, it would be a lot better if SalesPerson were an ID but I don't know how or where you got that data from.

    Third, table has no primary key... again, I don't know if this is an interim table, just wasn't included in the example code you provided, or what.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for all your help. That query works great on SQL 2005. Do you happen to know what would be the fastest way to get the same results from the same data on SQL 2000 or SQL 7?

  • Yep... you're gonna have to put the result of the CTE into a temp table as a separate query... include a RANK column. Then, see the part about "Grouped Running Count" in the following URL...

    http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

    ... and, once you've filled in the RANK column, do a final select where RANK <=5.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply