Table Index

  • I have a potential table


    CREATE TABLE dbo.Preferences
     (
     EmailAddress VARCHAR(100) NOT NULL,
     ChannelID INTEGER NOT NULL,
     ContactTypeID INTEGER NOT NULL,
     AllowContact BIT NOT NULL,
     GDPR BIT NOT NULL,
     DateCreated DATETIME NOT NULL,
     CreatedBy VARCHAR(10) NOT NULL,
     CreatedSource INTEGER NOT NULL,
     DateUpdated DATETIME NULL,
     UpdatedBy VARCHAR(10) NULL,
     UpdatedSource INTEGER NULL,
     UserName VARCHAR(10) NULL
     )

    Unique primary key EmailAddress,ChannelID,ContactTypeID
    The table will initially be populated with 500,000+ rows
    The number and frequency of inserts is unknown
    Do not know if/when rows may get deleted in the future
    Current system (where data is to be ported from) the email address is VARCHAR(100)

    Questions
    1. Should EmailAddress be increased e.g. max of VARCHAR(254) to avoid future possible truncation errors
    2. Should Primary Key be CLUSTERED?
    3. Any suggestion for FILLFACTOR?

    Additional info
    Access to the data will be either
    EmailAddress or ChannelID or ContactTypeID or ChannelID and ContactTypeID

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Hi David,

    Having recently gone through some problems with several relatively large tables that were once the size you predict, I do have some suggestions for "bullet proofing".  I'll try to lay those out tonight after work.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • P.S.  No wanting to rely on abbreviations, what does GDPR stand for in this case?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, October 19, 2017 9:25 AM

    P.S.  No wanting to rely on abbreviations, what does GDPR stand for in this case?

    It is the EU General Data Protection Regulation that comes into effect next year.
    The flag is there to indicate that the data is compliant, existing data will not be compliant when migrated but will be when compliant system(s) update the data.
    When the date the regulation comes into effect is reached all non compliant data will be deleted, i.e. where the flag is not set.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • GDPR compliance?  mm'kay.

    I would have thought it might be handy to have a few other GDPR-related flags in there, as there are some "features" of GDPR that can be applied at the user level - eg, the right to restrict processing ("you may have my data, but you're not allowed to do anything with it"), and maybe track the automated decision-making / profiling stuff...

    This has the potential to be a bit messy...  https://ico.org.uk/for-organisations/data-protection-reform/overview-of-the-gdpr/individuals-rights/rights-related-to-automated-decision-making-and-profiling/ - why am I not surprised?

    Thomas Rushton
    blog: https://thelonedba.wordpress.com

  • David Burrows - Thursday, October 12, 2017 9:32 AM

    I have a potential table


    CREATE TABLE dbo.Preferences
     (
     EmailAddress VARCHAR(100) NOT NULL,
     ChannelID INTEGER NOT NULL,
     ContactTypeID INTEGER NOT NULL,
     AllowContact BIT NOT NULL,
     GDPR BIT NOT NULL,
     DateCreated DATETIME NOT NULL,
     CreatedBy VARCHAR(10) NOT NULL,
     CreatedSource INTEGER NOT NULL,
     DateUpdated DATETIME NULL,
     UpdatedBy VARCHAR(10) NULL,
     UpdatedSource INTEGER NULL,
     UserName VARCHAR(10) NULL
     )

    Unique primary key EmailAddress,ChannelID,ContactTypeID
    The table will initially be populated with 500,000+ rows
    The number and frequency of inserts is unknown
    Do not know if/when rows may get deleted in the future
    Current system (where data is to be ported from) the email address is VARCHAR(100)

    Questions
    1. Should EmailAddress be increased e.g. max of VARCHAR(254) to avoid future possible truncation errors
    2. Should Primary Key be CLUSTERED?
    3. Any suggestion for FILLFACTOR?

    Additional info
    Access to the data will be either
    EmailAddress or ChannelID or ContactTypeID or ChannelID and ContactTypeID

    Let's handle the questions first.

    1.  Yes, I believe the email address should be increased.  According to a link I quote from and can no longer find because I accidentally closed it, the maximum length should be 256 "octets" (bytes) with the caveat that must include the leading and trailing carats.  So, for storage purposes, 254 seems like it should be the right number. but there's that extra 2 byte thing and you might not know ahead of time if the carats will ever be included.  I'd set the column to 256 and call it a day.  I can tell you it was a huge PITA when the folks at work started getting email data longer than the VARCHAR(100) they had setup for especially since they had used the email address as the key that some other tables pointed to (bad idea to begin with because it's not immutable).

    2.  Should the PK be Clustered?  In the case of the 3 column key you posted, I'm strongly leaning towards a hearty "NO" because that's a fairly wide clustered index and, as you know, the clustered index columns are auto-magically added to every non-clustered index both in the BTree and in the Leaf Level.  For such a narrow table, that's a huge duplication of data for virtually any NCI you may add.  Still, it is a fairly narrow table and that just might not matter so much. 

    On that same note, the initial load won't be so painful if it's done in the order of the clustered index (which will also allow you to do the load in a "Minimally Logged fashion with the CI in place) but subsequent loads may be, which is also why you ask about a Fill Factor to avoid heavy page splits during the subsequent loads.

    And with that in mind, people normally say that the CI must be unique, narrow, not null, ever increasing, and immutable.  During loads with such an index, you'll only heat up the logical end of the table and there will automatically be little in the form of the "bad" type of page splits.  The writes to disk will also supposedly be more sequential and cause less head thrashing.  I'll suggest that's usually not true on SANs unless the subsequent loads are really the only thing running or the table lives on a dedicated spindle with no read requests while the load is taking place.

    To make the proper choice on whether this particular PK should be CI or NCI, we'd have to know more about the load patterns, volume, and frequency so that we can do a good ol' fashioned "what if" test.  That's not quite as bad as an "It Depends" answer because there's a definitive method to find out by testing.

    3.  Concerning the FILL FACTOR for the CI, unless you do, in fact, have an index where the key column values will be ever increasing and the data in the variable length columns will not be made to suffer frequent updates that cause the data to expand, I'd probably set the FILL FACTOR so there's space for 3 or 4 additional "average size" rows during the initial load of this narrow table and then watch what happens.  The maximum size of each row after the change to 256 for the email address would be 327 bytes.  If you consider that the average length of an email address is typically between 18 and 23 bytes per THIS link, the average row size will be between 89 and 94 bytes.  10% of 8060 (the generally accepted data capacity of a page) is 806 bytes, which is enough room for 8 or 9 "normal average rows" and 2 rows that take up all 327 bytes.

    If you're into reorging or rebuilding indexes, then a FILL FACTOR of 90 isn't enough.  What happens is that, for things like this, a whole lot of pages will reach the max all at the same time and you'll get a huge amount of blocking for the loads for a day and then the "natural fill factor" caused by all those splits will take over and the blocking will go away.  So, if you're going to do index maintenance, use a FILL FACTOR of 80 and do your index maintenance based on the % of page fullness (rather than % fragmented) and do the index maintenance at 85 or 90% full using a FILL FACTOR of 80 to avoid that day of heavy blocking (this is all based on some research and tests I've been doing over the last month).  Or, do like I do... don't do any index maintenance unless % of page fullness gets ridiculously low.  I've not rebuilt any indexes on my production box for nearly two years and performance actually got better and I've had no significant blocking at all.  I do have one table that suffers because rows are inserted and then the varchars are updated from containing nothing to containing some wide stuff.  I will have to rebuild that one soon because it's wasting 50% in page fullness and it's the biggest table in that particular database.

    Speaking of that, you need to plan on this table getting that way and you need to plan for the eventual deletes you were talking about.   With that in mind, do NOT build this table in the PRIMARY file group.  Give it it's own file group so that when you eventually need to get rid of a bunch of data, you can easily create yet another file group and then rebuild the indexes to that other file group to compress and move the file at the same time and then drop the original/now empty file group.  Remember that any index over 128 Extents (that's only 8MB) will keep the original index active until the new one is built and then drop the old index, creating a huge amount of unnecessary unused space in whatever file group you do the rebuild in.  The new file group method allows you to avoid that.

    Other Stuff
    Since this table does have to do with the "General Data Protection Regulation" act, I can only assume that it actually needs to be audited and that might be the reason for the "Created" and "Modified" and "Source" six pack of columns.  Considering that the 6 of those columns make up 8+12+8+12+4+4  or 48 bytes, that's more than half of the expected average row size and actually does little to appease auditors because only the creation and latest rev to a row are kept.  If no modification to a row is ever made, then the row wastes 24 bytes of space.  With the number of rows you have and will have, you want to waste as little space per row as possible.

    With that, I'm going to suggest that you build a per-row (full row) audit table and an audit trigger that ONLY audits UPDATEs and DELETEs.  You never need to audit inserts because they'll either be in the original table as unmodified rows or the old original row will be copied to the audit table on the very first update of the row.  Auditing inserts is a bit unweildy and certainly unnecessary (totally unnecessary and instant duplication of data) because of that fact.  Except for DELETEs, the latest row will always be in the original table and only UPDATEs and DELETEs will be in the audit table.

    That also means that you can save 24 bytes per row because it's no longer necessary to have "Modified" or "Update Source" columns.  Instead of the six pack of current "audit" columns, you'll only need 3 and you might simply want to call them "Last Action" columns.  This will also tend to give range queries an automatic 25% improvement in performance because you'll be able to fit 25% more rows per page.

    If you do go with an audit table, you'll need to add 2 columns... one for the action type that put it in the audit table ("U" for update and "D" for delete) and an IDENTITY column, which should also be the CI, to allow for very high speed trigger action devoid of all splits.

    And that brings us all the way back to question number 2... what should be the CI for the main table?  Again, we need to know the volume and frequency of new data and whether or not they will be predominately INSERTs or UPDATES.  If the answer is INSERTS and the typical SELECT will usually be one row at a time, which doesn't really matter what the CI is, then consider adding an IDENTITY column and using that as the CI to help keep thrashing down and improve performance during the INSERTS.

    ... and now you know all of the things that go through my mind in the first 30 seconds or so when someone asks me about the design of such a table. Can you smell the smoke? 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, October 19, 2017 8:34 PM

    ... and now you know all of the things that go through my mind in the first 30 seconds or so when someone asks me about the design of such a table. Can you smell the smoke? 😉

    Awesome!  Glad I peeked at this thread just out of sheer curiosity 🙂
    This post is worthy of many more reads than it will probably get.

  • Awesome feedback Chris.  Thank you for the kudo.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • A huge thanks for your reply Jeff.

    Sorry for the delay in replying

    1.
    Approx 75% of current email adresses are between 20 and 30 chars but I will increase the size as you suggest.
    The original estimate is incorrect as with all projects in the early design phase things change.
    Current conservative number is 150,000 for first one off load and anything up to 150 added per day by automated process.
    Individual email addresses can also be added by a new Web app but number is unknown
    I can only see two reasons why the email address will not immutable, one is correction for an invalid address or someone changes their address (do not know if this will be an update or insert)
    Deletes are unknown at this point as retension rules are yet to be specified

    2.
    To begin with there will be 4 channels and 5 contact types (upto 20 combinations per address)
    Attrition rates for channel/contact type is unknown except as I explained before when GDPR compliance comes in to effect all channel/contact types not compliant will be deleted, this could be large or small
    Based on your observations I am now leaning towards splitting the table thus

    CREATE TABLE dbo.Preferences
     (
     EmailAddressID INT IDENTITY(1,1) NOT NULL,
     EmailAddress VARCHAR(256) NOT NULL,
     DateCreated DATETIME NOT NULL,
     CreatedBy VARCHAR(10) NOT NULL,
     CreatedSource INTEGER NOT NULL,
     UserName VARCHAR(10) NULL
     )

    CREATE TABLE dbo.Preferences
     (
     EmailAddressID INT NOT NULL,
     ChannelID INTEGER NOT NULL,
     ContactTypeID INTEGER NOT NULL,
     AllowContact BIT NOT NULL,
     GDPR BIT NOT NULL,
     DateCreated DATETIME NOT NULL,
     CreatedBy VARCHAR(10) NOT NULL,
     CreatedSource INTEGER NOT NULL,
     UserName VARCHAR(10) NULL
     )

    EmailAddress must be unique
    PK?
    Clustered?

    3.
    FILL FACTORs?

    Other Stuff
    Yes it will be audited and triggers will ONLY audit UPDATEs and DELETEs.
    Since the created date/by/source is immutable would you exclude it from the audit tables? Just considering saving space and best practice
    Is it worth considering different filegroups for tables, indexes (where possible) and audits

    ...and whether or not they will be predominately INSERTs or UPDATES
    I forsee inserts for email addresses with minimal updates (but that is a guess)
    Channel/contact data will be inserts and updates, deletes (excluding one time deletes mentioned above) are unknown at this time.

    In addition to the above there will be other tables (audited as well) but the tables above are my main concern
    One of the additional tables will contain actual PII data per person
    approx 1 million on first intial load and upto 5,000 per day
    including email address which needs to be indexed to be searchable

    If the above table change is a good solution then I will replace email address on the PII table with EmailAddressID

    The PII data will contain only one mailing address, I normally would separate mailing address into a separated table to avoid duplication but I am worried about potential updates relocating all people linked to an adrress to the new address whether intended or not
    I suppose this will down to tight control over mailing address changes

    Far away is close at hand in the images of elsewhere.
    Anon.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply