Let's handle the questions first.1.
Yes, I believe the email address should be increased. According to a link I quote from and can no longer find because I accidentally closed it, the maximum length should be 256 "octets" (bytes) with the caveat that must include the leading and trailing carats. So, for storage purposes, 254 seems like it should be the right number. but there's that extra 2 byte thing and you might not know ahead of time if the carats will ever be included. I'd set the column to 256 and call it a day. I can tell you it was a huge PITA when the folks at work started getting email data longer than the VARCHAR(100) they had setup for especially since they had used the email address as the key that some other tables pointed to (bad idea to begin with because it's not immutable).2.
Should the PK be Clustered? In the case of the 3 column key you posted, I'm strongly leaning towards a hearty "NO" because that's a fairly wide clustered index and, as you know, the clustered index columns are auto-magically added to every non-clustered index both in the BTree and in the Leaf Level. For such a narrow table, that's a huge duplication of data for virtually any NCI you may add. Still, it is a fairly narrow table and that just might not matter so much.
On that same note, the initial load won't be so painful if it's done in the order of the clustered index (which will also allow you to do the load in a "Minimally Logged fashion with the CI in place) but subsequent loads may be, which is also why you ask about a Fill Factor to avoid heavy page splits during the subsequent loads.
And with that in mind, people normally say that the CI must be unique, narrow, not null, ever increasing, and immutable. During loads with such an index, you'll only heat up the logical end of the table and there will automatically be little in the form of the "bad" type of page splits. The writes to disk will also supposedly be more sequential and cause less head thrashing. I'll suggest that's usually not true on SANs unless the subsequent loads are really the only thing running or the table lives on a dedicated spindle with no read requests while the load is taking place.
To make the proper choice on whether this particular PK should be CI or NCI, we'd have to know more about the load patterns, volume, and frequency so that we can do a good ol' fashioned "what if" test. That's not quite as bad as an "It Depends" answer because there's a definitive method to find out by testing.3.
Concerning the FILL FACTOR for the CI, unless you do, in fact, have an index where the key column values will be ever increasing and the data in the variable length columns will not be made to suffer frequent updates that cause the data to expand, I'd probably set the FILL FACTOR so there's space for 3 or 4 additional "average size" rows during the initial load of this narrow table and then watch what happens. The maximum size of each row after the change to 256 for the email address would be 327 bytes. If you consider that the average length of an email address is typically between 18 and 23 bytes per THIS
link, the average row size will be between 89 and 94 bytes. 10% of 8060 (the generally accepted data capacity of a page) is 806 bytes, which is enough room for 8 or 9 "normal average rows" and 2 rows that take up all 327 bytes.
If you're into reorging or rebuilding indexes, then a FILL FACTOR of 90 isn't enough. What happens is that, for things like this, a whole lot of pages will reach the max all at the same time and you'll get a huge amount of blocking for the loads for a day and then the "natural fill factor" caused by all those splits will take over and the blocking will go away. So, if you're going to do index maintenance, use a FILL FACTOR of 80 and do your index maintenance based on the % of page fullness
(rather than % fragmented) and do the index maintenance at 85 or 90% full using a FILL FACTOR of 80 to avoid that day of heavy blocking (this is all based on some research and tests I've been doing over the last month). Or, do like I do... don't do any index maintenance unless % of page fullness gets ridiculously low. I've not rebuilt any indexes on my production box for nearly two years and performance actually got better and I've had no significant blocking at all. I do have one table that suffers because rows are inserted and then the varchars are updated from containing nothing to containing some wide stuff. I will have to rebuild that one soon because it's wasting 50% in page fullness and it's the biggest table in that particular database.
Speaking of that, you need to plan on this table getting that way and you need to plan for the eventual deletes you were talking about. With that in mind, do NOT build this table in the PRIMARY file group. Give it it's own file group so that when you eventually need to get rid of a bunch of data, you can easily create yet another file group and then rebuild the indexes to that other file group to compress and move the file at the same time and then drop the original/now empty file group. Remember that any index over 128 Extents (that's only 8MB) will keep the original index active until the new one is built and then drop the old index, creating a huge amount of unnecessary unused space in whatever file group you do the rebuild in. The new file group method allows you to avoid that.Other Stuff
Since this table does have to do with the "General Data Protection Regulation" act, I can only assume that it actually needs to be audited and that might be the reason for the "Created" and "Modified" and "Source" six pack of columns. Considering that the 6 of those columns make up 8+12+8+12+4+4 or 48 bytes, that's more than half of the expected average row size and actually does little to appease auditors because only the creation and latest rev to a row are kept. If no modification to a row is ever made, then the row wastes 24 bytes of space. With the number of rows you have and will have, you want to waste as little space per row as possible.
With that, I'm going to suggest that you build a per-row (full row) audit table and an audit trigger that ONLY audits UPDATEs and DELETEs. You never need to audit inserts because they'll either be in the original table as unmodified rows or the old original row will be copied to the audit table on the very first update of the row. Auditing inserts is a bit unweildy and certainly unnecessary (totally unnecessary and instant duplication of data) because of that fact. Except for DELETEs, the latest row will always be in the original table and only UPDATEs and DELETEs will be in the audit table.
That also means that you can save 24 bytes per row because it's no longer necessary to have "Modified" or "Update Source" columns. Instead of the six pack of current "audit" columns, you'll only need 3 and you might simply want to call them "Last Action" columns. This will also tend to give range queries an automatic 25% improvement in performance because you'll be able to fit 25% more rows per page.
If you do go with an audit table, you'll need to add 2 columns... one for the action type that put it in the audit table ("U" for update and "D" for delete) and an IDENTITY column, which should also be the CI, to allow for very high speed trigger action devoid of all splits.
And that brings us all the way back to question number 2... what should be the CI for the main table? Again, we need to know the volume and frequency of new data and whether or not they will be predominately INSERTs or UPDATES. If the answer is INSERTS and the typical SELECT will usually be one row at a time, which doesn't really matter what the CI is, then consider adding an IDENTITY column and using that as the CI to help keep thrashing down and improve performance during the INSERTS.
... and now you know all of the things that go through my mind in the first 30 seconds or so when someone asks me about the design of such a table. Can you smell the smoke?
is pronounced ree-bar and is a Modenism for R
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair
When you put the right degree of spin on it, the number 318
is also a glyph that describes the nature of a DBAs job. Helpful Links:
How to post code problemsHow to post performance problemsForum FAQs