Tally Tables

  • Jeff Moden (9/7/2013)


    Good to "see" you again, ol' friend.

    Same here! Still waiting for an opportunity to actually meet face to face and buy you the beverage of your choice. 😉

    Let us indeed build both tables without indexes and, for a minute, go with the idea that a heap can be implicitly ordered, as you did with your code, although I didn't trust the implicit ordering of ROW_NUMBER() and added an ORDER BY to the code that builds the "sorted" table much like there's an ORDER BY that builds the randomized table.

    I didn't truly trust the implicit ordering. I added a SELECT after the INSERT to check the order, because I don't trust anything. After adding the MAXDOP hint to the INSERT statement, I did get the rows in the natural order - so far me, at least in that run, it worked.

    I played with the code you posted. The first thing I did was to add a similar SELECT after creating the table:

    SELECT * FROM #RandomizedTable OPTION(MAXDOP 1);

    SELECT * FROM #OrderedTable OPTION(MAXDOP 1);

    On my computer, neither result set was ordered, and neither was really random. Both had long chunks of ordered rows, with the chunks in semi-random order. (e.g. 19041 - 19042 - 19043 - (...) - 22847 - 22848 - 45697 - 45698 - (...) - 49503 - 49504 - 76161 - 76162 - (...) - etc)

    I tried a lot of variations on your code, but they all exposed the same result. I notice that you don't have a similar SELECT in your code - did you actually check how the rows were generated?

    The problem I had is dcescribed here: http://blogs.msdn.com/b/sqltips/archive/2005/07/20/441053.aspx. I hunted this down because I had a recollection that an ORDER BY in an INSERT query is only observed under specific circumstances. As you see, my memory is wrong - IDENTITY order is guaranteed; insertion order is not.

    But there is more than just this. I tried everything I could think of to create a heap with a million rows that are actually ordered. Nothing worked - not even using a loop to insert these rows one by one. I finally managed to find the cause - to avoid contention in my tempdb's GAM and SGAM pages, I have created multiple files for tempdb. The pages for the tables are of course scattered over the various files, and when reading from the files, all files are read concurrently. So I created a test database on a single file, and changed your code to use permanent tables instead of temp tables. That helped - the OrderedTally table now returns rows numbered consecutively from 1 to 100000. And the RandomizedTally table ... does exactly the same!!!!

    I then used a helper table with an IDENTITY column (to make use of the only ordering guarantee with insertion), and with a clustered index on the identity column. This ensures that the generated row numbers are in random order. From this helper table, I could fill the RandomizedTally table. After a lot of attempts (most not even described here), I finally got the tables to look as I wanted them to look using the code below:

    USE tempdb;

    go

    DROP DATABASE MyTest;

    go

    CREATE DATABASE MyTest

    ON PRIMARY

    ( NAME = N'MyTest_Data', FILENAME = N'D:\SQL 2012\Data\MyTest_Data.mdf' , SIZE = 200 MB , MAXSIZE = UNLIMITED, FILEGROWTH = 0 )

    LOG ON

    ( NAME = N'MyTest_Log', FILENAME = N'C:\SQL 2012\Log\MyTest_log.ldf' , SIZE = 400 MB , MAXSIZE = UNLIMITED, FILEGROWTH = 0);

    go

    USE MyTest;

    go

    CREATE TABLE BaseData

    (id int NOT NULL IDENTITY PRIMARY KEY, N int NOT NULL);

    -- The ORDER BY forces IDENTITY values in random order;

    -- the clustered index orders them, shuffling the N values,

    WITH

    cteGenerateSequence AS

    (

    SELECT TOP (1000000)

    N = ROW_NUMBER() OVER (ORDER BY (SELECT 1))

    FROM master.sys.all_columns ac1

    CROSS JOIN master.sys.all_columns ac2

    )

    INSERT INTO BaseData(N)

    SELECT N = ISNULL(N,0) --ISNULL used to make a NOT NULL column

    FROM cteGenerateSequence

    ORDER BY NEWID()

    ;

    --===== Create a "Tally" table as a randomized heap

    SELECT N

    INTO RandomizedTally

    FROM BaseData

    ORDER BY id

    ;

    --===== Create a "Tally" table as "ordered" Clustered Table

    SELECT TOP (1000000)

    N = ISNULL(ROW_NUMBER() OVER (ORDER BY (SELECT 1)),0) --ISNULL used to make a NOT NULL column

    INTO OrderedTally

    FROM master.sys.all_columns ac1

    CROSS JOIN master.sys.all_columns ac2

    ORDER BY N --<-----<<<< Added this

    ;

    SELECT * FROM RandomizedTally OPTION(MAXDOP 1);

    SELECT * FROM OrderedTally OPTION (MAXDOP 1);

    (Note the SELECT at the end. I cannot guarantee that this code will produce the desired results on other computers. Leave the SELECT in and do inspect the result sets!)

    I then modified your test code to use the permanent table and ran it a couple of times. I did get different execution times, but no consistency. One time the first query was faster; another time the second query was faster.

    Your testing does prove a huge difference on your system. I'd like to know the cause of this - based on my current understanding of SQL Server architecture, I don't see where it comes from. Can you please do the following for me:

    1. Execute the SELECT * FROM ... OPTION MAXDOP(1) statements to check the actual ordering of the two tables;

    2. Execute the test code with SET STATISTICS IO ON added;

    3. Run with "include actual execution plan" checked, then move to the execution plan, right-click and "Save as" an .xmlplan file. (This is a file format that SQL Server Central allows in attachments).

    If you can post the results here, I'll dig into it and try to find an explanation. There must be a reason for the performance difference you see, and I hate not being able to explain the actual internal process that causes this difference.

    EDIT: For anyone reading this, please let me clarify that the above is all a purely academic discussion; an actual tally table in a production environment should always have a primary key, based on a clustered index.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Good question. 🙂

  • I seem to remember getting an error without a partition statement

  • Don Cureton (9/19/2013)


    I seem to remember getting an error without a partition statement

    ROW_NUMBER() does not require a PARTITION clause.

    Some of the other window aggregates, e.g., MAX(xx), MIN(xx), AVG(xx), do.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 4 posts - 46 through 48 (of 48 total)

You must be logged in to reply to this topic. Login to reply