Metadata-Data-Ratio in a table

  • Hi all,

    My understanding about the structure of data within SQL Server is that table data and non-clustered index data are stored in 8K pages. These are filled up according to the fill factor. Rows are sorted according to the clustering key, which is usually the primary key. Pages that have been read are stored in the buffer cache until they are either dropped or replaced. The buffer cache is RAM and is the fastest way to read data from SQL Server (except, possibly from memory-optimised tables). These are logical reads. Ideally, all of the database would be stored in the buffer cache. However, it is rarely the case and many pages, usually less frequently accessed ones (aka cold data), reside only on disk. These are physical reads.

    A table row can not be longer than 8096 bytes, unless LOB datatypes are used for they are not stored within the pages but elsewhere. With frequently-accessed data (aka hot data), it is obviously optimal to store as much of the hot pages within the buffer cache and the more pages that can be stored in the buffer cache, the more efficiently pages can be accessed.

    My question concerns table-design: should I be encouraging those who design tables to aim for lean, thin tables so that the figurative signal-to-noise ratio is as high as possible within the table. By this, I mean both using the smallest datatypes that will fit and either moving metadata columns to another table or removing the metadata columns altogether. I define metadata in this context as all data that gives information about the data in the row or allows the application to do extra things, such as:

    •  GUID columns as random external keys for use with webservices;

    • Rowversion columns so that the application knows if a row has been altered in its session;

    • Any column with DATETIME2(9) — we don't need almost nanosecond precision;

    • Columns like CreateBy, CreateDate, UpdateBy, UpdateDate: well, 2 out of 4 would be useful but not all four.

    • Columns like Comments so that those who amend a row can leave a little note.

    • VARCHAR instead on NVARCHAR if Unicode doesn't bring any benefits.

    Now, I'm not against metadata columns, I'm just curious what the consensus is on both the ratio of data: metadata as well as the notion of wide tables.

    As an example, we have a table which will contain hot-data and will be written to often and whose recent data (especially) will be often read from. It has the following structure:

    create table SchemaName.TableToBeFilled

    (

    TableToBeFilledID int identity(1,1) not null, -- 04 » 004 bytes

    CreateBy varchar(50) not null, -- 15 » 019 bytes -- estimated average value

    CreateDate datetime2(7) not null, -- 08 » 027 bytes

    UpdateBy varchar(50) null, -- 15 » 043 bytes -- estimated average value

    UpdateDate datetime2(7) null, -- 08 » 051 bytes

    Comment varchar(100) null, -- 15 » 066 bytes -- estimated average value. probably should be sparse.

    TableAFK int not null, -- 04 » 070 bytes

    TableBFK int not null, -- 04 » 074 bytes

    TableCFK int null, -- 04 » 078 bytes

    RegistrationDate datetime2(7) not null, -- 08 » 086 bytes

    NumberOfInterest varchar(20) null, -- 15 » 101 bytes -- estimated average value

    SOAPGUID uniqueidentifier not null, -- 16 » 117 bytes

    constraint PK_LabelRegistration_LR primary key clustered ( TableToBeFilledID asc )

    ;

    Now, this is an ordinary table and most people wouldn't look twice at it, but only 5 columns and the primary key are actually core to the table. These five columns (TableAFK, TableBFK, TableCFK, RegistrationDate,NumberOfInterest) are what are being sought after yet space-wise, they take up  33% of the table. This means that barely a third of every row contains signal and the rest is noise.

    Does it make sense for a performance perspective to make tables lean (namely remove these metadata columns or move them to auxiliary tables)? On the plus-side, page density would greatly increase, that is, it would mean that fewer hot-pages would contain the same amount of data and the likelihood of having more dense, hot pages in the buffer cache would be higher. On the negative side, it would mean more joins, if the metadata columns were put into auxiliary tables and make data-transfer by webs-service that little bit slower. Of course, the company could invest in more RAM for the buffer cache and simply have a much larger buffer cache. This is though a design question.

    Or am I over thinking the theory and that really lean tables don't play as big a role as I believe they may?

    Does anyone have a good rule-of-thumb for the ratio of data:metadata?

    Alternatively, if anyone has a link to a good article on the optimal row-size of a table that is both read from and written quite often.

    I would test it but, as always, the simulation of a representative load is not so easy. I could replay a trace, I suppose.

    Thanks everyone for making it this far,

    Sean Redmond

  • Very good question. You've given me something to consider. Like many, I tend to tack on such metadata/tracking columns w/o question. But as they are all part of the clustered index, it probably is worth considering the impact.

    I see two key factors:

    * How often do you query the metadata columns? If they are rarely accessed, moving them into a separate table could very well make the common queries more efficient, and probably wouldn't impact the much less common "when was this updated & by whom?" queries terribly. More but rarely used-joins are probably not a problem... but you would want to prove that with your data.

    * Do your queries tend to use the clustered index or instead use non-clustered indexes? If they tend not to hit the clustered index, the benefit of separating may not be significant.

    Theoretical optimal row size is as small as possible. If you look at benchmarks that show amazing throughput for a new/improved technology, you'll often see that they are returning only a couple of narrow columns.

    Unfortunatley, real optimal row size depends -- it should as small/big as the data model dictates... with limits/caveats for data integrity  and normalization-- Use the right datatypes, normalize until proven to be a detriment. For varchar(max), columns, forcing data out of row w/ sp_tableoption is probably a good idea.

  • The CreateBy, CreatedDate, UpdateBy, Updatedate & Comment columns are rarely accessed. They are only ever of use when a customer has a complaint or if one of the DBAs got an update-query wrong in the past. We store the ticket/task-no. in the CreateBy column and work on from there. So the answer is occasionally and even rarer than that, they are absolutely indispensable.

    I've no idea how often rows are read by webservices. I must see if that is possible with Extended Events.

    In the case of this table, the 3 foreign keys and the various non-clustered indexes on them will do most of the work. Your second point about the indexes is good. I shall bear that in mind.

     

  • I think when you say "lean" you really are referring to normalized, and I do recommend normalizing tables that have a high number of write operations.

    Tables with high number of read operation, indexes are going to be a better process than normalizing.  Build an index on the 5 columns you care about and SQL won't even need to look back to the main table.

    Now, the above is only true if the less queried columns are actually queried at some point.  For example, if you ONLY ever look at the 5 columns you mentioned, I would stop capturing the remaining columns.  Why capture data you don't need?  I would much rather only capture data that is required.

    Your reference of "signal to noise" here is less relevant if you have good indexes.  Storing the data just uses disk space if you have good indexes.  If you have a covering Non-Clustered index, SQL won't even bother to look back to the main table as all of the information it needs is stored in the index.  Now, adding non-clustered indexes eats up disk as it is a copy of a subset of the database.  Plus it slows down write operations as each write now needs to write to your clustered index and each non-clustered index.  So if the table is write heavy, I would normalize it.  If it is read heavy, I would index it.  If it is mixed, you may want to look at replicating the data (ETL for example) and do the read heavy operations on a de-normalized copy of the data and the write heavy operations on a normalized version of the data.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!

  • I would definitely move the comments column to its own table - with keys to identify when the comment was made and who made the comment.  I would also look at how you setup/use CreateBy - normally that would be an identifier and tied back to a lookup table for reference (e.g. a UserID).

    Those types of columns are normally utilized for basic auditing - letting you know the last time the row was edited and who edited the data.  But that doesn't track what actually changed - which could be a problem if you need to be able to reconstruct the previous data.  If that is the case then you really should look at a better method of auditing - which could be CDC, triggers or even temporal tables.

     

    Jeffrey Williams
    Problems are opportunities brilliantly disguised as insurmountable obstacles.

    How to post questions to get better answers faster
    Managing Transaction Logs

  • I would not move a comments column to its own table.  That requires two tables and the comments table is going to frag like crazy just as it would if you left the comments in the table.

    Instead, set the table to store LOBs out of row and make sure you specify a default comment of an empty string because the bloody 16 byte pointers are ExpAnsive in nature and that causes page splits and fragmentation.  I would also make sure that the out of row stuff (LOBs) is stored in a file group other than Primary.  Also, storing LOBs and even just long VARCHAR(s) out of row will seriously help queries that scan the table or do range scans on the Clustered Index.  It will also help prevent low page densities in the CI by eliminating what I call "Trapped Short Rows", which can have page densities as low as 1%.

    Another advantage of out of row storage (I wish MS had NOT created the defaults for the MAX datatypes to be in row) in the fact that the B-Tree on the pages is not order sensitive when it comes to any data.

    Last but not least, while the max length of a row is 8092 internally, that has to also include the row header the slot array.  With that, MS made the maximum definable row length to be 8060.

    Even if everything else were perfect, the UPDATEBY column is going to be a huge source of fragmentation and reducing the fill factor will not help at all if newly inserted pages(which try to go in at 100% full on a CI with an ever-increasing key) are being updated prior to a rebuild to make extra space.

    One way to get around that would be to make the Comments and UpdatedBy columns a MAX datatype.  Another way is to make the Clustered Index on the GUID column.  And, no.... the massive fragmentation that supposedly occurs on GUIDs isn't the fault of GUIDs.  It's the improper method of maintaining them that perpetuates pages splits and the resulting fragmentation all day every day.  I've done tests where I've demonstrated inserting 100,000 rows per day for 58 day (that's 5.8 MILLION rows, folks) with almost zero page splits and <1% fragmentation and now index maintenance until it gets to 1.04% fragmentation on the eve of the 58th day.

    For proof of my claim, please see the following 'tube where I destroy the myth of Random GUID fragmentation and lay waste to the supposed "Best Practices" for Index Maintenance that people have been using for the last two decades that are NOT a "Best Practice", were never meant to be a "Best Practice", and are actually a VERY WORST PRACTICE for most indexes that are fragmenting where a Fill Factor actually could help.  I also cover the fragmentation explosions on ever-increasing indexes.

    https://www.youtube.com/watch?v=qfQtY17bPQ4

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks for your comments everybody and thanks especially for the presentation on YouTube, Jeff. I have started watching it.

    @Mr. Brian Gale: The tables are normalised. By 'lean', I mean the ratio of data:metadata. From the perspective of the application, only the 5 columns and the primary key are necessary. The GUID is to provide a non-sequential key in webservices and the other fields are to help DBAs solve problems when a customer has a problem (or when one of us makes a mistake and we want to try and undo what was done). If I have understood my theory correctly, the fewer pages the query engine will have to retrieve from disk, the better the performance. The leaner the table row (that is, the lower the percentage of metadata), the more denser the pages in the buffer cache.

    As for capturing data that is not needed, it is occasionally needed and it is standard company policy that all tables (except enumeration tables) have these columns. I want to either move or remove them but first I would like to make sure that I am on a solid footing.

    You make a good point about indexes. the table in question is written to many, many times daily and read from as much, if not more. However, indexes on the five columns might make the size of the table row and its corresponding ratio of data:metadata much less of a concern.

    As for replicating the data, I am considering scaling out, that is having one table for writes and many tables for reads. I am familiar with transactional replication and plan sometime in the coming years to experiment with an availability groups setup.

     

    @Jeffrey Williams: I have to test temporal tables. I tried them out when learning for the 70 461 exam. I am thinking about seeing how it would be to move all of the metadata columns to another table whose primary key is also the foreign key to the primary key of the main table. I read many years back that CDC is quite a load on the system. I must revisit this to see if that is still true.

     

    @jeff Moden: Let me finish watching your presentation on YouTube first. It seems as if I am being asked to commit heresy but you know your stuff and I have found your posts helpful & sensible in the past.

    I'm not sure that I follow you regarding the comments column. The Comments column is a varchar(100). That the UpdatedBy column is a huge source of fragmentation is not a surprise. I'm not against a second table that serves an auxiliary function. It wouldn't be read very often and written to more often than read. I don't mind a big database, I just try to keep the tables that read most often as small as possible.

  • sean redmond wrote:

    I'm not sure that I follow you regarding the comments column. The Comments column is a varchar(100). That the UpdatedBy column is a huge source of fragmentation is not a surprise. I'm not against a second table that serves an auxiliary function. It wouldn't be read very often and written to more often than read. I don't mind a big database, I just try to keep the tables that read most often as small as possible.

    The key is going to be on how the comments column is used.  If it is populated during the original INSERT of the row and never changed, no problem and it can actually stay in the table as currently designed.

    But that's not the normal case that I see for comments columns.  What I normally see is that they are usually not populated by the INSERT and are UPDATEd later to contain some value.  That causes explosive fragmentation even if as little as only 3.3% (the level I've tested at in past tests) or the rows have updated comments.

    The GUID thing I spoke of is just one way to avoid such explosive fragmentation.  There are other ways and one of them is to change them to a LOB and force them out of row to get them out of your Clustered Index without having to juggle a separate table.  If only the modified-by column is going to be updated, there are a couple of methods (1 is to move them to out of row so long as something bigger is out of row, as well).  Another is to always make it the same as the created-by during the initial insert which isn't a 100% cure but can help.  I'd have to know a bit more as to what it will be populated with before we can make a call on that.

    Or, depending on how it fragments, let it fragment and occasionally pick up the pieces.  Not my first choice but "It Depends".

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • The Comments column is a new development and I'm not sure how it will be used. My guess is that it will be used as an explanation only when the data in the main row is updated. However, free-text fields have funny effects on people.

    Forcing a GUID out of the row does sound like an interesting idea. I shall have to look into this.

  • Oh no.  You've miss understood a bit.  I wouldn't force a GUID out of row.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I do think large columns should be out of row, as Jeff noted.

    However, I also do like vertical partitions of tables, especially when developers struggle to avoid SELECT *.

    My main view here is that as a table widens, physically, we get less rows per page. Pages are how we read and write data. Even with indexes. While I can cover lots of normal queries with indexes, there are times when I find adding more indexes to a table to cover more selectivity columns means there is an impact to the insert/update activity. A 20 column table is more likely to have 4-5 indexes that work great then a 100 column table. However, a 20 col table that is often queried, and 2 40 (or 41)  column tables that are rarely queried, could have different indexes.

    I don't have a hard and fast rule here, but once I'm in the 20-30 column range, I do think to examine the table and see if there are rarely used columns that potentially could help us shrink the row size without causing lots of joins between the tables. That being said, I've certainly worked with 50 column tables that posed no real issue, often because most of those 50 were numeric values.

     

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply