To extract the Nth record out of grouped records (avoiding NTile)

  • We have a large volume of records (1.5 million) from which we want to extract the nth record out of.

    I have the following sample table for this (small volume but same principle should apply):

    DROP TABLE IF EXISTS [#DataGroup]

    CREATE TABLE [#DataGroup](

    [ReadingDateTime] [datetime2](7) NULL,

    [ReadingValue] [float] NULL

    ) ON [PRIMARY]

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:06:39.0933333' AS DateTime2), 59.64)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:06:39.0937500' AS DateTime2), 59.64)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:08:39.0933333' AS DateTime2), 59.71)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:08:39.0937500' AS DateTime2), 59.71)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:10:39.0933333' AS DateTime2), 59.69)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:10:39.0937500' AS DateTime2), 59.69)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:12:39.0933333' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:12:39.0937500' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:14:39.0933333' AS DateTime2), 59.66)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:14:39.0937500' AS DateTime2), 59.66)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:16:39.0933333' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:16:39.0937500' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:18:39.0933333' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:18:39.0937500' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:20:39.0933333' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:20:39.0937500' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:22:39.0933333' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:22:39.0937500' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:24:39.2800000' AS DateTime2), 59.69)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:24:39.2812500' AS DateTime2), 59.69)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:26:39.0933333' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:26:39.0937500' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:28:39.0933333' AS DateTime2), 59.69)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:28:39.0937500' AS DateTime2), 59.69)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:30:39.0933333' AS DateTime2), 59.67)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:30:39.0937500' AS DateTime2), 59.67)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:32:39.0933333' AS DateTime2), 59.67)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:32:39.0937500' AS DateTime2), 59.67)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:34:39.0933333' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:34:39.0937500' AS DateTime2), 59.68)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:36:39.0933333' AS DateTime2), 59.72)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:36:39.0937500' AS DateTime2), 59.72)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:38:39.0933333' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:38:39.0937500' AS DateTime2), 59.7)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:40:39.0933333' AS DateTime2), 59.65)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:40:39.0937500' AS DateTime2), 59.65)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:42:39.0933333' AS DateTime2), 59.66)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:42:39.0937500' AS DateTime2), 59.66)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:44:39.0933333' AS DateTime2), 59.72)

    GO

    INSERT [#DataGroup] ([ReadingDateTime], [ReadingValue]) VALUES (CAST(N'2021-05-16T05:44:39.0937500' AS DateTime2), 59.72)

    GO

    In this table, I want to extract only the 10th record out of each groip

    So i would just see the 10, 20th, 30th, 40th

    Im aware of the NTILE function but this performs very poorly with high volumes

    I tried this but it didnt work

    SELECT

    ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS RowNumber,

    CEILING(CAST(ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS DECIMAL(10,2))/10) AS BatchNo, *

    FROM [#DataGroup]

     

  • Selecting every 10th record means selecting every row where the row_number divides evenly by 10, as in, when you divide the row_number by 10, the remainder will be 0.

    WHERE RowNumber % 10 = 0

    One way to do it:

    WITH DGBatches AS (
    SELECT ROW_NUMBER() OVER (ORDER BY (ReadingDateTime)) AS RowNumber, *
    FROM [#DataGroup]
    )
    SELECT RowNumber / 10 AS BatchNo, *
    FROM DGBatches
    WHERE RowNumber % 10 = 0
    ORDER BY RowNumber;

    I replaced the ORDER BY clause of the ROW_NUMBER() function because the OVER (ORDER BY SELECT 0) pattern is a speed boost, it also does not guarantee the same order. If you're just looking for 10% of the records, then there's no need to order the ROW_NUMBER() data.

    Eddie Wuerch
    MCM: SQL

  • An alternate, but similar, approach would be to add a sequential integer column to the table rather than the row_number approach.  This should have better SELECT performance than row_number() at the cost of additional disk space.  If you wanted even more SELECT performance, you could have a calculated column added to the end that is "1" when the row number is a multiple of 10, then toss an index on that column and your select is just grabbing all rows where that calculated column is 1.  Mind you that assumes you are ALWAYS wanting to view multiples of 10 for the row number on your data.  If this isn't the case and you sometimes do 100 or 50 or 3 or whatever, then this calculated column is wasted overhead.

    The advantage is that your selects will be faster.  Disadvantage is your INSERT, UPDATE, and DELETE will be slower.  So it depends on how this table is used.  If it is rarely IUD'ed, but frequently SELECTED, then the additional indexes and calculated columns MAY be a good approach.  If it is frequently IUD'ed, then I'd avoid the calculated columns and additional indexes.

    If you do go with Eddie's approach though, I'd still recommend adding a clustered index. Not sure on what columns as I don't see any way that a row will be guaranteed unique, so you may need to add a uniquifier such as an INT or a GUID.  You MAY be able to add it on ReadingDateTime as long as that is ALWAYS unique.  I am not sure on your processes so you may hit snags.  Failing adding the clustered index, I'd add a non-clustered index on the ReadingDateTime column at a minimum to help with the ordering of the data so you can have your data sorted  before you start ordering it for the ROW_NUMBER() function.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!

  • Thank you very much Eddie and Brian for your help on this

     

    I went with Eddies approach and then scaled it up to a table of 1.5 million records

    The performance using Eddies query was very fast!!

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply