Max function and indexes

  • Jeff Moden wrote:

    ScottPletcher wrote:

    Given that the table had no index at all, I didn't think in this case that making the clus index unique was necessary, particularly to the extent of adding the overhead of an identity column. SQL can very quickly seek to find the highest key value whether it's unique or not.

    But, to help cut down on dup key values, make the underlying column datetime2 rather than datetime.

    Also, for single row inserts, it will certainly "help" cut down down on dup key values but it will not eliminate them.  Also, if you use something even like SYSDATETIME ( ), ALL of the rows inserted in a single batch are guaranteed to have exactly the same date and time.

    If you need the clustered index to be unique, just having a single temporal column isn't going to do it... not even DATETIME2(7).

    No, it won't be unique; but, as I stated, datetime2(7) will have fewer dup keys than datetime.  But dup keys are irrelevant in this type of index anyway.  The only point is to get the max value, which works perfectly with or without dups.  No need for the overhead of an identity column.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    Given that the table had no index at all, I didn't think in this case that making the clus index unique was necessary, particularly to the extent of adding the overhead of an identity column. SQL can very quickly seek to find the highest key value whether it's unique or not.

    But, to help cut down on dup key values, make the underlying column datetime2 rather than datetime.

    Also, for single row inserts, it will certainly "help" cut down down on dup key values but it will not eliminate them.  Also, if you use something even like SYSDATETIME ( ), ALL of the rows inserted in a single batch are guaranteed to have exactly the same date and time.

    If you need the clustered index to be unique, just having a single temporal column isn't going to do it... not even DATETIME2(7).

    No, it won't be unique; but, as I stated, datetime2(7) will have fewer dup keys than datetime.  But dup keys are irrelevant in this type of index anyway.  The only point is to get the max value, which works perfectly with or without dups.  No need for the overhead of an identity column.

    If you put a clustered index on a datetime column and don't add another column to make the clustered index unique, SQL Server adds a uniqifier to the index.

     

  • ScottPletcher wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    Given that the table had no index at all, I didn't think in this case that making the clus index unique was necessary, particularly to the extent of adding the overhead of an identity column. SQL can very quickly seek to find the highest key value whether it's unique or not.

    But, to help cut down on dup key values, make the underlying column datetime2 rather than datetime.

    Also, for single row inserts, it will certainly "help" cut down down on dup key values but it will not eliminate them.  Also, if you use something even like SYSDATETIME ( ), ALL of the rows inserted in a single batch are guaranteed to have exactly the same date and time.

    If you need the clustered index to be unique, just having a single temporal column isn't going to do it... not even DATETIME2(7).

    No, it won't be unique; but, as I stated, datetime2(7) will have fewer dup keys than datetime.  But dup keys are irrelevant in this type of index anyway.  The only point is to get the max value, which works perfectly with or without dups.  No need for the overhead of an identity column.

    A unique clustered index is less of an overhead than a non-unique clustered index because sql server would have to add extra data to a non-unique clustered index to uniquely identify a row.  I'm not sure if this extra data on the clustered index is a lot more efficient than adding an identity column to the table?

  • Jonathan AC Roberts wrote:

    ScottPletcher wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    Given that the table had no index at all, I didn't think in this case that making the clus index unique was necessary, particularly to the extent of adding the overhead of an identity column. SQL can very quickly seek to find the highest key value whether it's unique or not.

    But, to help cut down on dup key values, make the underlying column datetime2 rather than datetime.

    Also, for single row inserts, it will certainly "help" cut down down on dup key values but it will not eliminate them.  Also, if you use something even like SYSDATETIME ( ), ALL of the rows inserted in a single batch are guaranteed to have exactly the same date and time.

    If you need the clustered index to be unique, just having a single temporal column isn't going to do it... not even DATETIME2(7).

    No, it won't be unique; but, as I stated, datetime2(7) will have fewer dup keys than datetime.  But dup keys are irrelevant in this type of index anyway.  The only point is to get the max value, which works perfectly with or without dups.  No need for the overhead of an identity column.

    A unique clustered index is less of an overhead than a non-unique clustered index because sql server would have to add extra data to a non-unique clustered index to uniquely identify a row.  I'm not sure if this extra data on the clustered index is a lot more efficient than adding an identity column to the table?

    I don't see it as enough to worry about.  Although, yes, I do believe SQL's built-in dup key mechanism is more efficient than an identity column, although the difference likely isn't that significant.

    That said, sure, normally you very much want a clus index key to be unique, and SQL performs much better when non-clus indexes are used with them.  But that's not the case here.  This is a very simple table that doesn't need non-clus indexes.  It needs the clus index only to facilitate the max lookup and, I suspect, to purge rows later.  I just wouldn't sweat the uniqueness thing here, to me it's not worth it.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Thanks everyone for all the responses and good information.  I'm sorry I haven't been able to keep up with them.  Let me try clarify a few things quickly.

    There are 24 columns in this table, there is a unique key field currently with a nonunique nonclustered index.  I am changing that to a clustered primary key, as the majority of table access is through that key field.  The date field in discussion is not a candidate for the primary key or a clustered index.  I was trying to determine if it's worth it to have a nonclustered index on that date field, solely for the purpose of determining the maximum value.  A nonclustered index would be just as functional as a clustered index for this purpose.  But I was surprised to find that I could do a full table scan on 10 million rows in 1 second.

    Here's the context -- the data in that table comes from another system; we just need the current latest value to facilitate updates from the other system.  So we don't even need to run the query that often, currently once an hour.  In the source system (Oracle) it would be helpful to me if that field was the clustered index, but they need the primary key for their purposes and a nonclustered index I had them add is working fine.  A full table scan in that system was horribly slow.  Note that this field is a datetime2 in my database, but a date in Oracle, which only goes down to the second.

    There are several tables I am looking at.  Unfortunately the 300+ million row one doesn't have a unique key -- I was almost able to create one using a composite key of 3 fields, but 3 records violated it.  This one will be a little more problematic to update, but since all fields in those 3 records were duplicates it doesn't matter.  The max query takes about 30 seconds in that larger table without an index.  It would clearly be a lot faster with an index but it would use a lot of space, especially with the composite key.

    I'm still amazed these queries are that fast -- it would certainly take a lot longer to return all the data.  The plan says 15 bytes per row regardless of the actual size of the row or key, so I'm not exactly sure what's happening, but it appears to be very efficient at pulling just that column.

  • A question to be answered is this: How wide is each row of data?  This will impact the number of rows of data in each data page.  The fewer rows of data in a page, the more pages that need to be read into memory to scan the table for the max value.  A narrow index on that column would pack more data per page meaning fewer pages that need to be read and scanned.   So, the answer to you your question regarding a specific index on the data column is "It depends."

    The database I work with, yes, it would be worthwhile as we can have as few as 3 rows of data per page because each row is quite wide.

     

  • I'd probably just add a trigger to the table to capture the max value(s) in a separate table so that it was always available.  All you have to check for is if any of the newly added datetime values are greater than the one you've already identified as the largest value.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Lynn Pettis wrote:

    ScottPletcher wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    Given that the table had no index at all, I didn't think in this case that making the clus index unique was necessary, particularly to the extent of adding the overhead of an identity column. SQL can very quickly seek to find the highest key value whether it's unique or not.

    But, to help cut down on dup key values, make the underlying column datetime2 rather than datetime.

    Also, for single row inserts, it will certainly "help" cut down down on dup key values but it will not eliminate them.  Also, if you use something even like SYSDATETIME ( ), ALL of the rows inserted in a single batch are guaranteed to have exactly the same date and time.

    If you need the clustered index to be unique, just having a single temporal column isn't going to do it... not even DATETIME2(7).

    No, it won't be unique; but, as I stated, datetime2(7) will have fewer dup keys than datetime.  But dup keys are irrelevant in this type of index anyway.  The only point is to get the max value, which works perfectly with or without dups.  No need for the overhead of an identity column.

    If you put a clustered index on a datetime column and don't add another column to make the clustered index unique, SQL Server adds a uniqifier to the index.

    I know that, but it's only done for those rows that actually do have a duplicate key value.  An identity would appear on every row.

    Now that we know this will not be the clus index, the table already has one, I'd skip the indexing id and just use a trigger to capture the max value.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • We're getting a bit lost here because I wasn't clear my last post.  I'm thinking this isn't the only query that will be played against the column.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hmm, maybe I misunderstand, I thought the OP said that it was just that one query for the max date, ~ every hour, in order to provide that value to some other process.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Lynn Pettis wrote:

    A question to be answered is this: How wide is each row of data?  This will impact the number of rows of data in each data page.  The fewer rows of data in a page, the more pages that need to be read into memory to scan the table for the max value.  A narrow index on that column would pack more data per page meaning fewer pages that need to be read and scanned.   So, the answer to you your question regarding a specific index on the data column is "It depends."

    The database I work with, yes, it would be worthwhile as we can have as few as 3 rows of data per page because each row is quite wide.

    Every table is different; as mentioned this particular one has an average row size of 185 bytes, with 24 columns.  The one with 300M rows has only 5 columns with an average size of 48 bytes.  The execution plans show a row size of 15 bytes for each however.  The performance seems related to the number of rows but not the row size, but I'll do more testing.

  • ScottPletcher wrote:

    I'd probably just add a trigger to the table to capture the max value(s) in a separate table so that it was always available.  All you have to check for is if any of the newly added datetime values are greater than the one you've already identified as the largest value.

    Hadn't thought of that, I like it.  Sometimes we do "manually" write the value to a table however as part of the load.

    • This reply was modified 4 years, 3 months ago by  Craig Smith.
  • ScottPletcher wrote:

    Hmm, maybe I misunderstand, I thought the OP said that it was just that one query for the max date, ~ every hour, in order to provide that value to some other process.

    Correct, that is the only usage for that field from my perspective.  Consumers of the data probably want to see that date but I don't think they query by it (although I will probably look into that more).

    • This reply was modified 4 years, 3 months ago by  Craig Smith.
  • Craig Smith wrote:

    ScottPletcher wrote:

    Hmm, maybe I misunderstand, I thought the OP said that it was just that one query for the max date, ~ every hour, in order to provide that value to some other process.

    Correct, that is the only usage for that field from my perspective.  Consumers of the data probably want to see that date but I don't think they query by it (although I will probably look into that more).

    Understood.  I knew what your perspective was.  I was a bit concerned that you might not know what others are doing with their queries.  I also don't know what the table is actually being used for and so can't anticipate the needs.  My suggestions are just based on other needs.  And, to be sure, if the only thing that fits your perspective is the only thing where the column will be a search criteria, I wouldn't waste the Clustered Index on the query.  A nice, single column NonClustered Index reduces the number of read to just 3 pages, even on the 300 Million row monster.  Not having all the other pages caused by a read-ahead into memory will also save memory for other thins that can use it better.

    If, however, other also hit the column with search criteria and they do it a lot along with returns of related columns, then setting up a Clustered Index on it might be the way to go.  If you're going to do that, might as well spend the extra 4 bytes to make it UNIQUE, as well.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Craig Smith wrote:

    Thanks everyone.  I have added a clustered primary key in our test environment but can't compare the performance as the configuration is totally different.  When I do the 300 million row table I'll do a before and after test.

    So the actual row size is about 185 bytes.  I read that fixed length columns (such as this date) are stored at the beginning of the row however; does it only have to retrieve the first part of the row?

    Edit -- posts crossed in the mail -- That column is not in the primary key.  Sure, any index on that column would help, but it can't be the clustering key so I'm trying to see if it's really necessary.  Doesn't appear to be on tables this size, we'll see about the larger one.

    Be advised that the process of building such a Clustered Index on that 300 Million row table will cause the old heap (or whatever) to remain until the new index is created and then the old heap will be dropped.  If you don't have any freespace in the MDF, that will cause, according to my calculations, your MDF file to grow by ~54.5GB.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply