Adding batch numbers based on row count and column values

  • Hi,
    I have to insert sales order data from a staging table to a base table where further processing is done on this data.  The staging table has around half a million records for every insert.  The problem is that the processing can only be done on a maximum of 10,000 records at a time.  For this, I plan to divide the inserted data into batches of roughly 10,000 records each. An important point to be considered while allocating batch numbers is that Orders must be kept together.  Some orders have only one line while others (majority) have multiple lines.  So, each batch should have a maximum of 10,000 lines but should include all lines of any order. 

    For example, if in batch 11 an order starts on line number 9800 and has 201 records, then a new batch number (batch 12) must start with this order even though only 9800 lines are there for the previous batch (batch 11).

    I would really appreciate any help or suggestions on how I can achieve this.

    Sudarshan

  • pssudarshan - Friday, November 3, 2017 11:16 PM

    Hi,
    I have to insert sales order data from a staging table to a base table where further processing is done on this data.  The staging table has around half a million records for every insert.  The problem is that the processing can only be done on a maximum of 10,000 records at a time.  For this, I plan to divide the inserted data into batches of roughly 10,000 records each. An important point to be considered while allocating batch numbers is that Orders must be kept together.  Some orders have only one line while others (majority) have multiple lines.  So, each batch should have a maximum of 10,000 lines but should include all lines of any order. 

    For example, if in batch 11 an order starts on line number 9800 and has 201 records, then a new batch number (batch 12) must start with this order even though only 9800 lines are there for the previous batch (batch 11).

    I would really appreciate any help or suggestions on how I can achieve this.

    Sudarshan

    Here is a quick example
    😎

    USE [TEEST]
    GO

    DECLARE @SET_SIZE  BIGINT = 500;
    DECLARE @BATCH_SIZE BIGINT = 100;
    DECLARE @NUM_LINES BIGINT = 29;
    DECLARE @FIRST_ORDER BIGINT = 7984;

    WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    , NUMS(N) AS (SELECT TOP(@SET_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
         FROM T T1,T T2,T T3,T T4,T T5,T T6)
    ,SAMPLE_DATA AS
    (
      SELECT
       FLOOR(NM.N / @NUM_LINES) + @FIRST_ORDER AS ORDER_NUMBER
      ,NM.N         AS ROWNUMBER
      FROM  NUMS  NM
    )
    ,SAMPLE_ORDERS AS
    (
      SELECT
       SD.ORDER_NUMBER
      ,ROW_NUMBER() OVER
        (
          PARTITION BY SD.ORDER_NUMBER
          ORDER BY  SD.ROWNUMBER
        ) AS LINE_NUMBER
      ,SD.ROWNUMBER
      FROM  SAMPLE_DATA  SD
    )
    ,SO_MIN_MAX AS
    (
      SELECT
       SO.ROWNUMBER
      ,SO.ORDER_NUMBER
      ,SO.LINE_NUMBER 
      ,MIN(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS START_ROW
      ,MAX(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS END_ROW
      FROM  SAMPLE_ORDERS SO
    )
    ,BASE_BATCH_MATRIX AS
    (
      SELECT
       SMM.ROWNUMBER
      ,SMM.ORDER_NUMBER
      ,SMM.LINE_NUMBER
      ,SMM.START_ROW
      ,SMM.END_ROW
      ,CASE WHEN (FLOOR(SMM.ROWNUMBER / @BATCH_SIZE) + 1) * @BATCH_SIZE < SMM.END_ROW THEN 0 ELSE 1 END AS BATCH_MATRIX
      FROM  SO_MIN_MAX  SMM
    )
    ,MARKED_SET AS
    (
      SELECT
       BBM.ORDER_NUMBER
      ,BBM.START_ROW
      ,BBM.END_ROW
      ,MAX(BBM.BATCH_MATRIX) - MIN(BBM.BATCH_MATRIX) AS BATCH_MARKER
      FROM  BASE_BATCH_MATRIX BBM
      GROUP BY BBM.ORDER_NUMBER
        ,BBM.START_ROW
        ,BBM.END_ROW
    )
    SELECT
      MS.ORDER_NUMBER
     ,MS.START_ROW
     ,MS.END_ROW
     ,SUM(MS.BATCH_MARKER) OVER
       (
        ORDER BY MS.START_ROW
        ROWS BETWEEN UNBOUNDED PRECEDING
          AND  CURRENT ROW
       ) + 1 AS BATCH_NUMBER
    FROM  MARKED_SET  MS
    ;

    Output
    ORDER_NUMBER         START_ROW            END_ROW              BATCH_NUMBER
    -------------------- -------------------- -------------------- ------------
    7984                 1                    28                   1
    7985                 29                   57                   1
    7986                 58                   86                   1
    7987                 87                   115                  2
    7988                 116                  144                  2
    7989                 145                  173                  2
    7990                 174                  202                  3
    7991                 203                  231                  3
    7992                 232                  260                  3
    7993                 261                  289                  3
    7994                 290                  318                  4
    7995                 319                  347                  4
    7996                 348                  376                  4
    7997                 377                  405                  5
    7998                 406                  434                  5
    7999                 435                  463                  5
    8000                 464                  492                  5
    8001                 493                  500                  5


    ='font-size:9.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"courier>

  • It's probably easier to just process them in batches rather than run through them to add batch numbers and then run through them again to process them.  Here is an approach that will add the batch numbers.  I've tested with a table with 5 records and used batch sizes of 2 and 3.

    DECLARE @t TABLE (order_num tinyint, batch_num tinyint)
    ;
    DECLARE @batch_num tinyint = 0,
        @batch_size tinyint = 2 -- 3
    ;
    INSERT @t(order_num)
    VALUES(1),(2), (2), (3), (3), (4)
    ;
    WHILE @@ROWCOUNT > 0
    BEGIN
        SET @batch_num = @batch_num + 1  /* increment @batch_num here to prevent infinite loop  */
    ;
        WITH batch AS
        (
            SELECT TOP(@batch_size) WITH TIES *, RANK() OVER(ORDER BY order_num) + COUNT(*) OVER(PARTITION BY order_num) - 1 AS cur_batch_size
            FROM @t
            WHERE batch_num IS NULL
            ORDER BY order_num
        )
        UPDATE batch
        SET batch_num = @batch_num
        WHERE cur_batch_size <= @batch_size
    END

    SELECT *
    FROM @t

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Saturday, November 4, 2017 9:22 AM

    It's probably easier to just process them in batches rather than run through them to add batch numbers and then run through them again to process them.  Here is an approach that will add the batch numbers.  I've tested with a table with 5 records and used batch sizes of 2 and 3.

    DECLARE @t TABLE (order_num tinyint, batch_num tinyint)
    ;
    DECLARE @batch_num tinyint = 0,
        @batch_size tinyint = 2 -- 3
    ;
    INSERT @t(order_num)
    VALUES(1),(2), (2), (3), (3), (4)
    ;
    WHILE @@ROWCOUNT > 0
    BEGIN
        SET @batch_num = @batch_num + 1  /* increment @batch_num here to prevent infinite loop  */
    ;
        WITH batch AS
        (
            SELECT TOP(@batch_size) WITH TIES *, RANK() OVER(ORDER BY order_num) + COUNT(*) OVER(PARTITION BY order_num) - 1 AS cur_batch_size
            FROM @t
            WHERE batch_num IS NULL
            ORDER BY order_num
        )
        UPDATE batch
        SET batch_num = @batch_num
        WHERE cur_batch_size <= @batch_size
    END

    SELECT *
    FROM @t

    Drew

    I don't think a RBAR method is the best option here, a while loop and a separate execution per batch is bound to add a significant overhead compared to a set based method. The code I posted uses a set based method to produce a result set that then can be used to control the batch, it doesn't do but a portion of the work your code dose.
    😎

  • Drew Allen & Eirikur Eiriksson,

    Thanks a lot for your replies.  I really appreciate your efforts. 

    I have been able to test the solution provided by Drew Allen and it works precisely as desired.  Thank you.  The reason we need to assign batch numbers to the records and then process them by the batch numbers instead of initially processing by batches is that there is a re-process procedure that may need to run if the main procedure returns data validation errors that can be corrected.  By that time, the original records in the staging table will be cleared to make way for the next days files, so we need to have all records in the process table with batch numbers against them.

    I will also test the solution provided by Eirikur Eiriksson and get back with my updates on that.  

    Once again, thanks a lot to both of you.  You are champions....

    Sudarshan.

  • Eirikur Eiriksson - Saturday, November 4, 2017 1:10 PM

    drew.allen - Saturday, November 4, 2017 9:22 AM

    It's probably easier to just process them in batches rather than run through them to add batch numbers and then run through them again to process them.  Here is an approach that will add the batch numbers.  I've tested with a table with 5 records and used batch sizes of 2 and 3.

    DECLARE @t TABLE (order_num tinyint, batch_num tinyint)
    ;
    DECLARE @batch_num tinyint = 0,
        @batch_size tinyint = 2 -- 3
    ;
    INSERT @t(order_num)
    VALUES(1),(2), (2), (3), (3), (4)
    ;
    WHILE @@ROWCOUNT > 0
    BEGIN
        SET @batch_num = @batch_num + 1  /* increment @batch_num here to prevent infinite loop  */
    ;
        WITH batch AS
        (
            SELECT TOP(@batch_size) WITH TIES *, RANK() OVER(ORDER BY order_num) + COUNT(*) OVER(PARTITION BY order_num) - 1 AS cur_batch_size
            FROM @t
            WHERE batch_num IS NULL
            ORDER BY order_num
        )
        UPDATE batch
        SET batch_num = @batch_num
        WHERE cur_batch_size <= @batch_size
    END

    SELECT *
    FROM @t

    Drew

    I don't think a RBAR method is the best option here, a while loop and a separate execution per batch is bound to add a significant overhead compared to a set based method. The code I posted uses a set based method to produce a result set that then can be used to control the batch, it doesn't do but a portion of the work your code dose.
    😎

    This is not RBAR.  It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Monday, November 6, 2017 8:32 AM

    This is not RBAR.  It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.

    Drew

    Accoriding to the original requirements, order lines from each order must be processed within a single batch, not across batches, your solution doesn't cater for any such thing as there is no provisioning for the grouping of order lines into an order. 
    😎

    An important point to be considered while allocating batch numbers is that Orders must be kept together.

  • Eirikur Eiriksson - Monday, November 6, 2017 9:33 AM

    drew.allen - Monday, November 6, 2017 8:32 AM

    This is not RBAR.  It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.

    Drew

    Accoriding to the original requirements, order lines from each order must be processed within a single batch, not across batches, your solution doesn't cater for any such thing as there is no provisioning for the grouping of order lines into an order. 
    😎

    An important point to be considered while allocating batch numbers is that Orders must be kept together.

    Actually mine does account for that.  The RANK function orders by the order number and the COUNT counts the number of records within an order.  These are combined to calculate the current batch size.  This number will be the same for all records with the same order number, and those records will only be processed if that total is less than or equal to the desired batch size.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Monday, November 6, 2017 9:59 AM

    Eirikur Eiriksson - Monday, November 6, 2017 9:33 AM

    drew.allen - Monday, November 6, 2017 8:32 AM

    This is not RBAR.  It will process approximately 10000 records at a time, which is required by the original description. My approach was written with the idea of actually processing the records instead of just assigning a batch number, and a seemingly straightforward comparison of the two different approaches doesn't take that into account.

    Drew

    Accoriding to the original requirements, order lines from each order must be processed within a single batch, not across batches, your solution doesn't cater for any such thing as there is no provisioning for the grouping of order lines into an order. 
    😎

    An important point to be considered while allocating batch numbers is that Orders must be kept together.

    Actually mine does account for that.  The RANK function orders by the order number and the COUNT counts the number of records within an order.  These are combined to calculate the current batch size.  This number will be the same for all records with the same order number, and those records will only be processed if that total is less than or equal to the desired batch size.

    Drew

    Sorry, I must be slow today
    😎

  • Eirikur Eiriksson - Saturday, November 4, 2017 8:44 AM

    pssudarshan - Friday, November 3, 2017 11:16 PM

    Hi,
    I have to insert sales order data from a staging table to a base table where further processing is done on this data.  The staging table has around half a million records for every insert.  The problem is that the processing can only be done on a maximum of 10,000 records at a time.  For this, I plan to divide the inserted data into batches of roughly 10,000 records each. An important point to be considered while allocating batch numbers is that Orders must be kept together.  Some orders have only one line while others (majority) have multiple lines.  So, each batch should have a maximum of 10,000 lines but should include all lines of any order. 

    For example, if in batch 11 an order starts on line number 9800 and has 201 records, then a new batch number (batch 12) must start with this order even though only 9800 lines are there for the previous batch (batch 11).

    I would really appreciate any help or suggestions on how I can achieve this.

    Sudarshan

    Here is a quick example
    😎

    USE [TEEST]
    GO

    DECLARE @SET_SIZE  BIGINT = 500;
    DECLARE @BATCH_SIZE BIGINT = 100;
    DECLARE @NUM_LINES BIGINT = 29;
    DECLARE @FIRST_ORDER BIGINT = 7984;

    WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    , NUMS(N) AS (SELECT TOP(@SET_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
         FROM T T1,T T2,T T3,T T4,T T5,T T6)
    ,SAMPLE_DATA AS
    (
      SELECT
       FLOOR(NM.N / @NUM_LINES) + @FIRST_ORDER AS ORDER_NUMBER
      ,NM.N         AS ROWNUMBER
      FROM  NUMS  NM
    )
    ,SAMPLE_ORDERS AS
    (
      SELECT
       SD.ORDER_NUMBER
      ,ROW_NUMBER() OVER
        (
          PARTITION BY SD.ORDER_NUMBER
          ORDER BY  SD.ROWNUMBER
        ) AS LINE_NUMBER
      ,SD.ROWNUMBER
      FROM  SAMPLE_DATA  SD
    )
    ,SO_MIN_MAX AS
    (
      SELECT
       SO.ROWNUMBER
      ,SO.ORDER_NUMBER
      ,SO.LINE_NUMBER 
      ,MIN(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS START_ROW
      ,MAX(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS END_ROW
      FROM  SAMPLE_ORDERS SO
    )
    ,BASE_BATCH_MATRIX AS
    (
      SELECT
       SMM.ROWNUMBER
      ,SMM.ORDER_NUMBER
      ,SMM.LINE_NUMBER
      ,SMM.START_ROW
      ,SMM.END_ROW
      ,CASE WHEN (FLOOR(SMM.ROWNUMBER / @BATCH_SIZE) + 1) * @BATCH_SIZE < SMM.END_ROW THEN 0 ELSE 1 END AS BATCH_MATRIX
      FROM  SO_MIN_MAX  SMM
    )
    ,MARKED_SET AS
    (
      SELECT
       BBM.ORDER_NUMBER
      ,BBM.START_ROW
      ,BBM.END_ROW
      ,MAX(BBM.BATCH_MATRIX) - MIN(BBM.BATCH_MATRIX) AS BATCH_MARKER
      FROM  BASE_BATCH_MATRIX BBM
      GROUP BY BBM.ORDER_NUMBER
        ,BBM.START_ROW
        ,BBM.END_ROW
    )
    SELECT
      MS.ORDER_NUMBER
     ,MS.START_ROW
     ,MS.END_ROW
     ,SUM(MS.BATCH_MARKER) OVER
       (
        ORDER BY MS.START_ROW
        ROWS BETWEEN UNBOUNDED PRECEDING
          AND  CURRENT ROW
       ) + 1 AS BATCH_NUMBER
    FROM  MARKED_SET  MS
    ;

    Output
    ORDER_NUMBER         START_ROW            END_ROW              BATCH_NUMBER
    -------------------- -------------------- -------------------- ------------
    7984                 1                    28                   1
    7985                 29                   57                   1
    7986                 58                   86                   1
    7987                 87                   115                  2
    7988                 116                  144                  2
    7989                 145                  173                  2
    7990                 174                  202                  3
    7991                 203                  231                  3
    7992                 232                  260                  3
    7993                 261                  289                  3
    7994                 290                  318                  4
    7995                 319                  347                  4
    7996                 348                  376                  4
    7997                 377                  405                  5
    7998                 406                  434                  5
    7999                 435                  463                  5
    8000                 464                  492                  5
    8001                 493                  500                  5


    ='font-size:9.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"courier>

    I went back and looked at this.  There is an error in the logic.  It's not evident from the sample data used, but here is an example where a different set of sample data produces an incorrect result (all records are placed into the first batch).  The only change I made was in setting up the sample data (CTE "Sample Data").


    DECLARE @SET_SIZE BIGINT = 500;
    DECLARE @BATCH_SIZE BIGINT = 100;
    DECLARE @NUM_LINES BIGINT = 29;
    DECLARE @FIRST_ORDER BIGINT = 7984;

    WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    , NUMS(N) AS (SELECT TOP(@SET_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
      FROM T T1,T T2,T T3,T T4,T T5,T T6)
    ,SAMPLE_DATA AS
    (
    SELECT
     @FIRST_ORDER + CASE
            WHEN NM.n < 41 THEN 0
            WHEN NM.n < 101 THEN 1
            WHEN NM.n < 161 THEN 2
            ELSE NM.n / 50 - 1
        END AS ORDER_NUMBER
    ,NM.N   AS ROWNUMBER
    FROM NUMS NM
    )
    ,SAMPLE_ORDERS AS
    (
    SELECT
     SD.ORDER_NUMBER
    ,ROW_NUMBER() OVER
      (
      PARTITION BY SD.ORDER_NUMBER
      ORDER BY SD.ROWNUMBER
      ) AS LINE_NUMBER
    ,SD.ROWNUMBER
    FROM SAMPLE_DATA SD
    )
    ,SO_MIN_MAX AS
    (
    SELECT
     SO.ROWNUMBER
    ,SO.ORDER_NUMBER
    ,SO.LINE_NUMBER
    ,MIN(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS START_ROW
    ,MAX(SO.ROWNUMBER) OVER (PARTITION BY SO.ORDER_NUMBER) AS END_ROW
    FROM SAMPLE_ORDERS SO
    )
    ,BASE_BATCH_MATRIX AS
    (
    SELECT
     SMM.ROWNUMBER
    ,SMM.ORDER_NUMBER
    ,SMM.LINE_NUMBER
    ,SMM.START_ROW
    ,SMM.END_ROW
    ,CASE WHEN (FLOOR(SMM.ROWNUMBER / @BATCH_SIZE) + 1) * @BATCH_SIZE < SMM.END_ROW THEN 0 ELSE 1 END AS BATCH_MATRIX
    FROM SO_MIN_MAX SMM
    )
    ,MARKED_SET AS
    (
    SELECT
     BBM.ORDER_NUMBER
    ,BBM.START_ROW
    ,BBM.END_ROW
    ,MAX(BBM.BATCH_MATRIX) - MIN(BBM.BATCH_MATRIX) AS BATCH_MARKER
    FROM BASE_BATCH_MATRIX BBM
    GROUP BY BBM.ORDER_NUMBER
      ,BBM.START_ROW
      ,BBM.END_ROW
    )
    SELECT
    MS.ORDER_NUMBER
    ,MS.START_ROW
    ,MS.END_ROW
    ,SUM(MS.BATCH_MARKER) OVER
      (
      ORDER BY MS.START_ROW
      ROWS BETWEEN UNBOUNDED PRECEDING
       AND CURRENT ROW
      ) + 1 AS BATCH_NUMBER
    FROM MARKED_SET MS
    ;

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply