Rank a large table

  • Hi there,
    I am trying to figure out the best way to rank a table that has 600 million rows.

    This is data about Company financial info from different companies.
    Eg: ID=1 (there is a reference table for ID to CompanyName) ; iDate (date when data was received) ; FinancialInfoName can be sales, Loans or Deposits; Value is value of the FinancialInfoName;

    Create table dbo.tbl_needs_ranking(ID int not null, iDate Smalldatetime null,Value Decimal(22,2),FinancialInfoName varchar(100) null, Rnk int null)

    This table gets data from 2 different work tables.
    dbo.table1 and dbo.table2 each have about 300 million rows.
    dbo.table1 has data from 2016 and dbo.table2 from 2015.
    This is how I insert

    TRUNCATE TABLE dbo.tbl_needs_ranking
    INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk)
    SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table1

    INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk)
    SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table2

    Rnk is a column that is rank by ID and Date.

    ;WITH CTE AS
    (
    SELECT Rnk
            ,[ID]
            ,[iDate]
          ,DENSE_RANK() over (ORDER BY id, iDate) AS ranking
    FROM dbo.tbl_needs_ranking WITH (TABLOCK)
        )  UPDATE CTE SET Rnk=CTE.ranking

    It's been 3 hours and not yet finished running.

    What's the best way to rank this table?

    Thank you
    MR

  • 600M records is a LOT of records. Even if your disc speed (both read and write) is great, upi have loads of RAM, etc, it is going to take a long time.

    Doing an INSERT and then an UPDATE isn't going to help matter though. That means, firstly, all the data needs to be read and written (in the tables) for the initial INSERT(s), but then read again (in your new table) and then written to with the new Rank.

    I would firstly ensure that you do the whole process in one go:
    TRUNCATE TABLE dbo.tbl_needs_ranking;

    --UNION the data first
    WITH CoreData AS (
        SELECT ID, iDate, Value, FinancialInfoName,
        FROM dbo.table1
        UNION ALL
        SELECT ID, iDate, Value, FinancialInfoName
        FROM dbo.table2),
    --Now calculate the rankings
    RankedData AS (
        SELECT *,
        DENSE_RANK() OVER (ORDER BY id, idate) AS Ranking
        FROM CoreData)
    INSERT INTO dbo.tbl_needs_ranking (ID,iDate,Value,FinancialInfoName,Rnk)
    SELECT ID, iDate, Value, FinancialInfoName, Ranking
    FROM RankedData;

    This, won't, necessarily, be that quick still; depending on what indexes you have on your tables. You're going to need as appropriate INDEX that the server can use to be able to ORDER by id and idate columns in both tables.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • What indexes do you have on table1, table2 and tbl_needs_ranking?

    Have you considered something like this - doing the ranking and both inserts all in one operation?  Your transaction log and/or tempdb may not thank you, but it's worth seeing how it compares on a test database with the same data.
    WITH Combined AS (
        SELECT ID,iDate,Value,FinancialInfoName
        FROM dbo.table1
        UNION
        SELECT ID,iDate,Value,FinancialInfoName
        FROM dbo.table2
        )
    INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk)
    SELECT ID,iDate,Value,FinancialInfoName,
    DENSE_RANK() over (ORDER BY id, iDate)
    FROM Combined;

    John

  • The best way to do this is to cluster the table on ( id, iDate ); that is, the clustered index on the table should be on those columns.  And, as long as you look up by id and/or ( id, iDate ) at least half the time on that table, then that's what you should do.  Since I/we don't see other queries, I/we don't know whether you need nonclus index(es) or not.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • I only have non-clustered index on dbo.table1 and dbo.table2.
    And after the update (it never did finish) I have clustered index on dbo.tbl_needs_ranking

  • Aclustered index on the base tables is going to be a huge improvement. I'd suggest following Scott's advice.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • If you know that all rows in table1 come before rows in table2, you can go ahead and assign the ranking as part of the load of the table.  It's faster to sort 300M rows and 300M rows than to sort 600M rows.


    TRUNCATE TABLE dbo.tbl_needs_ranking;

    --drop any nonclus index on dbo.tbl_needs_ranking
    CREATE CLUSTERED INDEX tbl_needs_ranking__CL ON dbo.tbl_needs_ranking ( ID, iDate ) WITH ( FILLFACTOR = 99 ) ON [PRIMARY];

    INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk) WITH (TABLOCK)
    SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table1

    INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk) WITH (TABLOCK)
    SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table2

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply