Using data windows in 650M rows table

  • I have a 680M rows table. This table contains data for each customer. I need to have a rollup table with data for each customer, that means obtain next row for each row for each customer (last row for a customer will have next row value = null) and 3 more values that work as next or previous values, i.e. origin move for a row... all of these are calculated by datetime order.

    I am calculating this data and inserting in other table, but I think performance is not the best. I have tested by using LEAD and LAG functions and I found that OUTER APPLY is faster. My huge query is (I resume it in order to clarify)

    INSERT INTO destinationTable

    SELECT columns

    FROM 680MrowsTable currM

    OUTER APPLY (

    SELECT TOP 1 *

    FROM 680MrowsTable nextM

    WHERE currM.CustomerId=nextM.CustomerId AND currM.ACTdatetime< nextM.ACTdatetime

    ORDER BY ACTdatetime

    ) nextM

    OUTER APPLY (

    SELECT TOP 1

    *

    FROM 680MrowsTable nextMoveM

    WHERE currM.CustomerId=nextM.CustomerId AND currM.ACTdatetime< nextMoveM.ACTdatetime

    AND --other conditions

    ORDER BY ACTdatetime

    ) nextMoveM

    OUTER APPLY (

    SELECT TOP 1 *

    FROM 680MrowsTable origMoveM

    WHERE currM.CustomerId=nextM.CustomerId AND currM.ACTdatetime>= origMoveM.ACTdatetime

    AND --other conditions

    ORDER BY ACTdatetimeDESC

    ) origMoveM

    OUTER APPLY (

    SELECT TOP 1 *

    FROM 680MrowsTable PrevMoveM

    WHERE currM.CustomerId=nextM.CustomerId AND origMoveM.ACTdatetime> PrevMoveM.ACTdatetime

    AND --other conditions

    ORDER BY ACTdatetimeDESC

    ) PrevMoveM

    WHERE 1=1

    I have read this article: https://www.simple-talk.com/sql/t-sql-programming/calculating-values-within-a-rolling-window-in-transact-sql/

    I have tested and when I use a cursor obtaining data from each customer ordering by datetime and storing it in a temp table, and after that I do outer apply and its faster.

    Any idea?

  • Can you set up some sample data please Celia? It doesn't need to contain all of the columns, just the ones required to help identify the problem domain.

    Cheers.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The LAG function is aimed for those kind of problems like yours. Have you saved you query using LAG, if so share it here. Someone may find a better solution with it if you provide some data and objects.

    Igor Micev,My blog: www.igormicev.com

  • What's really critical here is how the 680MrowsTable is clustered. Best for what you are doing here would be ( CustomerId, ACTdatetime ). If this is typical of how you process the data in that table, you should consider changing the clustering of the table. If the table is currently clustered on identity, it's about a 90% chance that it should be changed to better match your actual processing.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Igor Micev (6/20/2016)


    The LAG function is aimed for those kind of problems like yours. Have you saved you query using LAG, if so share it here. Someone may find a better solution with it if you provide some data and objects.

    In my tests LAG function is slower than ROW_NUMBER solution adopted.

    Now I am testing with Cluster index.

    My next test will be Quirky Update.

    See attached an example. I have added an Id column to simplify, but when I get nextRow,PreviousAction,NextAction and OriginAction I get other fields of the action (datetime,details,comments,etc), not Id.

    nextRow is the next row of this CustomerId, each Action is identified because has Action=Subaction. In this example Actions 2,8 and 12 are not real action, so when I get previous action I need to obtain previous row for this customer that has Action=Subaction. Same for NextAction; originAction is for rows with Action<>Subaction the row with same Action and with Action=Subaction.

    I hope you can understand my example.

    <style type="text/css">

    table.tableizer-table {

    font-size: 12px;

    border: 1px solid #CCC;

    font-family: Arial, Helvetica, sans-serif;

    }

    .tableizer-table td {

    padding: 4px;

    margin: 3px;

    border: 1px solid #CCC;

    }

    .tableizer-table th {

    background-color: #104E8B;

    color: #FFF;

    font-weight: bold;

    }

    .tableizer-table th2 {

    background-color: #ffffc7;

    color: #FFF;

    font-weight: bold;

    }

    </style>

    <table class="tableizer-table">

    <thead><tr class="tableizer-firstrow"><th>ActionId</th><th>CustomerId</th><th>Action</th><th>Subaction</th><th>ACTdatetime</th><th>NextRow</th><th>PreviousAction</th><th>NextAction</th><th>OriginAction</th></tr></thead><tbody>

    <tr><td>1</td><td>1</td><td>L</td><td>L</td><td>2013-01-22 12:38:00.000</td><td>2</td><td>NULL</td><td>3</td><td>1</td></tr>

    <tr><td>2</td><td>1</td><th>L</th><th>C</th><td>2013-01-22 12:39:00.000</td><td>3</td><td>1</td><td>3</td><td>1</td></tr>

    <tr><td>3</td><td>1</td><td>D</td><td>D</td><td>2013-03-11 18:17:00.000</td><td>4</td><td>1</td><td>4</td><td>3</td></tr>

    <tr><td>4</td><td>1</td><td>L</td><td>L</td><td>2013-03-16 10:25:00.000</td><td>5</td><td>3</td><td>5</td><td>4</td></tr>

    <tr><td>5</td><td>1</td><td>D</td><td>D</td><td>2013-03-20 18:05:00.000</td><td>6</td><td>4</td><td>NULL</td><td>5</td></tr>

    <tr><td>6</td><td>2</td><td>G</td><td>G</td><td>2013-04-01 22:15:00.000</td><td>7</td><td>NULL</td><td>7</td><td>6</td></tr>

    <tr><td>7</td><td>2</td><td>G</td><td>G</td><td>2013-04-02 11:33:00.000</td><td>8</td><td>6</td><td>9</td><td>7</td></tr>

    <tr><td>8</td><td>2</td><th>G</th><th>C</th><td>2013-04-02 18:29:00.000</td><td>9</td><td>6</td><td>9</td><td>7</td></tr>

    <tr><td>9</td><td>2</td><td>G</td><td>G</td><td>2013-04-10 09:24:00.000</td><td>10</td><td>7</td><td>10</td><td>9</td></tr>

    <tr><td>10</td><td>2</td><td>G</td><td>G</td><td>2013-04-12 20:15:00.000</td><td>11</td><td>9</td><td>11</td><td>10</td></tr>

    <tr><td>11</td><td>2</td><td>L</td><td>L</td><td>2013-04-17 17:45:00.000</td><td>12</td><td>10</td><td>NULL</td><td>11</td></tr>

    <tr><td>12</td><td>2</td><th>L</th><th>C</th><td>2013-04-17 17:46:00.000</td><td>13</td><td>10</td><td>NULL</td><td>11</td></tr>

    </tbody></table>

  • This format is preferable because folks can begin working with it straight away:

    CREATE TABLE #Temp (ActionId INT NULL, CustomerId INT NULL, Action CHAR(1) NULL, Subaction CHAR(1) NULL, ACTdatetime DATETIME NULL, NextRow INT NULL, PreviousAction INT NULL, NextAction INT NULL, OriginAction INT NULL)

    INSERT INTO #Temp (ActionId, CustomerId, Action, Subaction, ACTdatetime, NextRow, PreviousAction, NextAction, OriginAction) VALUES

    (1,1,'L','L','2013-01-22 12:38:00.000',2,NULL,3,1),

    (1,1,'L','L','2013-01-22 12:38:00.000',2,NULL,3,1)

    etc

    I recommend you post two tables like this - your source table, and the exact results you expect from it.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply