Generate Repeating Set of Row_Number values

  • Hello -

    I have a table that contains a set of rows, where each row can represent a query to run. I have an ETL process that I can create multiple pipelines to execute the queries in parallel, and I am trying to figure out a way to add a Row_Number type value (using one of the functions like row_number(), rank, etc) to each row, but I need that number to be between 1 and 5. And then repeat.

    For example, the results might be like:

    queryID queryName RNum

    2500 Query_01 1

    2501 Query_02 2

    2510 Query_03 3

    2609 Query_04 4

    4300 Query_05 5

    4401 Query_06 1

    5433 Query_07 2

    1777 Query_08 3

    4322 Query_09 4

    7307 Query_10 5

    ...and so on, where the RNum value repeats from 1 to 5...over and over. And, it might happen that the last RNum might not be 5 ...could be any number between 1 and 5.

    I can generate a value for RNum, but I haven't found a way to just use a range.

    Thanks for any suggestions or links.

    - will

  • douglashwilliams (7/12/2016)


    Hello -

    I have a table that contains a set of rows, where each row can represent a query to run. I have an ETL process that I can create multiple pipelines to execute the queries in parallel, and I am trying to figure out a way to add a Row_Number type value (using one of the functions like row_number(), rank, etc) to each row, but I need that number to be between 1 and 5. And then repeat.

    For example, the results might be like:

    queryID queryName RNum

    2500 Query_01 1

    2501 Query_02 2

    2510 Query_03 3

    2609 Query_04 4

    4300 Query_05 5

    4401 Query_06 1

    5433 Query_07 2

    1777 Query_08 3

    4322 Query_09 4

    7307 Query_10 5

    ...and so on, where the RNum value repeats from 1 to 5...over and over. And, it might happen that the last RNum might not be 5 ...could be any number between 1 and 5.

    I can generate a value for RNum, but I haven't found a way to just use a range.

    Thanks for any suggestions or links.

    - will

    Hi Will,

    In the future, help us help you by providing readily consumable data so we can concentrate on a nice, tested, coded answer for you. Please see the first link under "Helpful Links" in my signature line below for one way to do that. Please see the following code for another way. Also, see the comments in the following code for two different answers.

    --===== Create a test table and populate it.

    -- This is NOT a part of the solution.

    IF OBJECT_ID('tempdb..#TestTable','U') IS NOT NULL

    DROP TABLE #TestTable

    ;

    SELECT d.*

    INTO #TestTable

    FROM (

    SELECT 2500,'Query_01' UNION ALL

    SELECT 2501,'Query_02' UNION ALL

    SELECT 2510,'Query_03' UNION ALL

    SELECT 2609,'Query_04' UNION ALL

    SELECT 4300,'Query_05' UNION ALL

    SELECT 4401,'Query_06' UNION ALL

    SELECT 5433,'Query_07' UNION ALL

    SELECT 1777,'Query_08' UNION ALL

    SELECT 4322,'Query_09' UNION ALL

    SELECT 7307,'Query_10'

    ) d (queryID,queryName)

    ;

    --===== Solution 1

    -- If the order of execution means so little to you,

    -- then this will split the "load" very easily.

    SELECT RNum = NTILE(5) OVER (ORDER BY queryName)

    ,*

    FROM #TestTable

    ;

    --===== Solution 2

    -- Demonstrate the creation of grouped row numbers

    -- along with the group (which is optional).

    -- Without the group number, this is exactly what

    -- you asked for. GNum might be used to indicate

    -- which leg of the process will run the code.

    WITH cteEnumerate AS

    (

    SELECT Seq = ROW_NUMBER() OVER (ORDER BY queryName)-1

    ,queryID

    ,queryName

    FROM #TestTable

    )

    SELECT queryID

    ,queryName

    ,GNum = Seq/5+1 --Comment this out if not desired

    ,RNum = Seq%5+1

    FROM cteEnumerate

    ORDER BY Seq

    ;

    Now... a couple of warnings...

    Distributing multiple ETL queries or whatever across multiple ETL Processes may do absolutely nothing for performance. In fact, doing such a thing may actually slow the queries down unless you're writing to SSDs because the "great equalizer" is the physical disk and you just can't change physics. Unless you can guarantee that each process will be using a separate set spindle and, therefor, a separate read/write arm (heads), multiple processes must share a single arm and the time it spends jumping from one file or file position to another for each process can really put the whammy on ETL performance. I strongly recommend that you test with serial queries and then parallel queries. You might be in for a surprise but you won't know until you do the test. Parallelism is NOT a panacea for performance when disk reads and writes are involved.

    The other thing is that if parallelism is effective in this case, you might not be taking full advantage because, in theory, your seeming random assignment of queries could lead to a much longer processing time because the queries take more time than the other processes. What you should probably do is accumulate a query history for duration and do a "bin fill" for processor assignment in an attempt at load balancing so that they all get done near the same time and one or two of the processes aren't unduly overloaded.

    I also have a hard time thinking that the process order of the queries will never make a difference.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff -

    Thanks very much for the reply and all the options. I will give them a try and see which works best for us. And, yes, we are in the process of running the queries sequentially/serially and gathering all the runtime stats and metrics first.

    I appreciate all the help

    - will

  • Note that Jeff's second solution will sort the results in the order you demonstrated in your original post. The NTILE solution would not. If you need the numbers in that order then use the second solution (where RNum = Seq%5+1) you could do so using NTILE like this.

    WITH cteEnumerate AS

    (

    SELECT Seq = ROW_NUMBER() OVER (ORDER BY queryName)-1

    ,queryID

    ,queryName

    FROM #TestTable

    )

    SELECT queryID

    ,queryName

    ,RNum = Seq%5+1 -- leaving here to compare to NTILE

    ,NTILE(5) OVER (ORDER BY Seq%5)

    FROM cteEnumerate

    ORDER BY Seq

    I'm not recommending this, I'm just showing how you'd get the result you demonstrated using NTILE. Either way, I would recommend the second solution as it will get the job done much faster and with a lot less reads.

    Lastly, I recommend never using NTILE unless you need:

    1. the results sorted in the in the order that Jeff's above NTILE solution sorts the numbers AND

    2. You need to partition your "tile" numbers (e.g. NTILE(N) OVER (PARTITION BY...))

    If #1 is true and #2 is not (no partitioning required) then I would recommend the NTally table alternative here.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply