My understanding about the structure of data within SQL Server is that table data and non-clustered index data are stored in 8K pages. These are filled up according to the fill factor. Rows are sorted according to the clustering key, which is usually the primary key. Pages that have been read are stored in the buffer cache until they are either dropped or replaced. The buffer cache is RAM and is the fastest way to read data from SQL Server (except, possibly from memory-optimised tables). These are logical reads. Ideally, all of the database would be stored in the buffer cache. However, it is rarely the case and many pages, usually less frequently accessed ones (aka cold data), reside only on disk. These are physical reads.
A table row can not be longer than 8096 bytes, unless LOB datatypes are used for they are not stored within the pages but elsewhere. With frequently-accessed data (aka hot data), it is obviously optimal to store as much of the hot pages within the buffer cache and the more pages that can be stored in the buffer cache, the more efficiently pages can be accessed.
My question concerns table-design: should I be encouraging those who design tables to aim for lean, thin tables so that the figurative signal-to-noise ratio is as high as possible within the table. By this, I mean both using the smallest datatypes that will fit and either moving metadata columns to another table or removing the metadata columns altogether. I define metadata in this context as all data that gives information about the data in the row or allows the application to do extra things, such as:
• GUID columns as random external keys for use with webservices;
• Rowversion columns so that the application knows if a row has been altered in its session;
• Any column with DATETIME2(9) — we don't need almost nanosecond precision;
• Columns like CreateBy, CreateDate, UpdateBy, UpdateDate: well, 2 out of 4 would be useful but not all four.
• Columns like Comments so that those who amend a row can leave a little note.
• VARCHAR instead on NVARCHAR if Unicode doesn't bring any benefits.
Now, I'm not against metadata columns, I'm just curious what the consensus is on both the ratio of data: metadata as well as the notion of wide tables.
As an example, we have a table which will contain hot-data and will be written to often and whose recent data (especially) will be often read from. It has the following structure:
create table SchemaName.TableToBeFilled
TableToBeFilledID int identity(1,1) not null, -- 04 » 004 bytes
CreateBy varchar(50) not null, -- 15 » 019 bytes -- estimated average value
CreateDate datetime2(7) not null, -- 08 » 027 bytes
UpdateBy varchar(50) null, -- 15 » 043 bytes -- estimated average value
UpdateDate datetime2(7) null, -- 08 » 051 bytes
Comment varchar(100) null, -- 15 » 066 bytes -- estimated average value. probably should be sparse.
TableAFK int not null, -- 04 » 070 bytes
TableBFK int not null, -- 04 » 074 bytes
TableCFK int null, -- 04 » 078 bytes
RegistrationDate datetime2(7) not null, -- 08 » 086 bytes
NumberOfInterest varchar(20) null, -- 15 » 101 bytes -- estimated average value
SOAPGUID uniqueidentifier not null, -- 16 » 117 bytes
constraint PK_LabelRegistration_LR primary key clustered ( TableToBeFilledID asc )
Now, this is an ordinary table and most people wouldn't look twice at it, but only 5 columns and the primary key are actually core to the table. These five columns (TableAFK, TableBFK, TableCFK, RegistrationDate,NumberOfInterest) are what are being sought after yet space-wise, they take up 33% of the table. This means that barely a third of every row contains signal and the rest is noise.
Does it make sense for a performance perspective to make tables lean (namely remove these metadata columns or move them to auxiliary tables)? On the plus-side, page density would greatly increase, that is, it would mean that fewer hot-pages would contain the same amount of data and the likelihood of having more dense, hot pages in the buffer cache would be higher. On the negative side, it would mean more joins, if the metadata columns were put into auxiliary tables and make data-transfer by webs-service that little bit slower. Of course, the company could invest in more RAM for the buffer cache and simply have a much larger buffer cache. This is though a design question.
Or am I over thinking the theory and that really lean tables don't play as big a role as I believe they may?
Does anyone have a good rule-of-thumb for the ratio of data:metadata?
Alternatively, if anyone has a link to a good article on the optimal row-size of a table that is both read from and written quite often.
I would test it but, as always, the simulation of a representative load is not so easy. I could replay a trace, I suppose.
Thanks everyone for making it this far,