At various times in my database-oriented career I have had conversations with people regarding what datatype to use for a particular field (or some other data modeling question), resulting in them discounting my desire to use a smaller datatype and exclaiming: Disk Is Cheap! I'm sure you have heard this as well so I figured I would share why this phrase confounds me.
I find that people uttering such a phrase usually have not thought about the implications of what it really means. From what I gather, the unspoken meaning of the phrase is grounded in factual information that is out of context. By that I mean the cost of storage 30 years ago (70s and 80s) was so great that people conserved bytes whenever possible and the world ended up with the Y2K bug as a result. Today the price of a Gigabyte of disk space doesn't truly matter when we are measuring in Gigabytes to begin with. I remember buying my first 650 Megabyte HD and then the 1.26 Gig when it came out and even then thinking how much space it was even for that price given my first computer was an 8088 with a 10 Meg HD. Now 500 Gigs is nothing, so of course, disk is cheap; they are practically giving it away! This is what people usually have in mind when saying this great phrase but let's dig a little deeper into the real cost of disk space.
There are actually two meanings to "Disk is Cheap": financial cost and I/O cost. Since we have already been talking in terms of financial cost, let's start with that meaning.
Cost in Money
Trying to imply relative cost of disk for what we do as database professionals based on what we can walk into a store and buy is a bit disingenuous. The reason is that most of the applications that we, again as professionals, work on are NOT stored on the type of SATA storage that we can purchase at a local store. Enterprise applications are often stored on SAN which is certainly not cheap. If your project is not on a SAN it is likely still on a SCSI (or SAS: Serial Attached SCSI) RAID setup with many disks which individually still cost more than the price per GB for the SATA drives.
And what about total-cost of ownership (TCO)? If you have either an internal SCSI / SAS array or a separate storage device, they require electricity and that is an additional cost. They also produce heat which requires additional cooling and hence additional cost. If you collocate your servers then these costs are built into the colocation pricing. So when you need additional rack space to handle more storage you are paying for more than the hardware whether you host it or pay someone else to.
But wait, there's more! When you waste space in a data-model it is realized not only in the table itself and its indexes. There are two other areas that are typically never considered: logs and backups. While logs get overwritten and so the impact is not as great, wasted space does take a toll in the sense that you still need more room than you would have otherwise, especially for large transactions. Backups are a bigger problem because ideally you will have more than just the prior nights data backed up. Yes, some third-party vendors offer backup software that has compression and SQL Server 2008 even has built-in compression for backups (Enterprise Edition only). However, the more space each backup takes means the fewer backups you will be able to store, at least before spending more money on new storage (assuming you can get that purchase approved).
Of course, someone is going to argue that, relatively speaking, SAN and other configurations of truly reliable storage are both cheaper than they were years ago and cheap compared to the department or company budget. I always find it interesting to hear engineers say this but reality does not always seem to apply when asking upper-level management for said funds. At some point there is a bottom-line that a CFO is watching. While $50,000 (or $500,000) for new storage seems relatively cheap given the cost of employees or going out of business if you were to run out of storage, the conversation doesn't always go the way that seems logical to us engineers and we are often told: "Sorry, find another solution".
Cost in Performance
As previously mentioned the other cost of disk space is I/O. Disk I/O is very expensive. So expensive, in fact, that there are several technologies dedicated to reducing this cost. Write behind caches (at the OS level; not to be confused with Transaction Logs being write-ahead), buffer pools, memcached, etc. exist to reduce the amount of times you need to go to disk to get information. Storage vendors are now even offering hundreds of GBs of solid-state cache to reduce the disk I/O cost. Each time you have to go to the disk for data you have to wait for one or more spindles to spin around and the heads to move back and forth so it can take many revolutions before the system has read enough information to even start answering your query. And that is just a simplistic view.
When you factor in DML operations needing to lock rows, pages, or tables then you start to see how you need to wait yet again. The more bytes you are writing to disk produces larger rows hence fewer rows per page hence more pages. Locking is done in memory but that is a more finite resource than disk so if need be the locking might need to be done in the page file which is back on disk which means we not only took up the valuable resource of memory but are now waiting even longer for the disk to spin in order to read and write the page file.
As database professionals, part of our job is to do physical data modeling, not just logical modeling. A core aspect of physical modeling is making sure that model will hold up as the data volume increases. The last thing you want to have happen is for an increase in the disk I/O requirements of a system to make everything take longer. So you need to keep in mind things like (for SQL Server at least) NonClustered indexes typically contain the data from the Clustered index (exceptions are: Unique Indexes and NonClustered Indexes that contain the Cluster Key) . Meaning, if you pick a four column Clustered index that has an NVARCHAR(100) field and three INT fields, that is up to 212 bytes that will be in all Non Clustered indexes before you even add the fields you need for the Non Clustered Index. Following this example, even if you picked a single DATETIME field to be a Non Clustered index, the total Non Clustered index size would be up to (given that it is NVARCHAR and not NCHAR) 220 bytes, not just the 8 for the DATETIME field. This is important to keep in mind as a 10 GB index takes longer to traverse than a 2 GB index. Yes, disk is cheap enough for you to store the additional 8 GB but reading it is certainly not cheap in terms of time / available IOPS (I/Os per second). IOPS have not increased nearly to the scale that CPU power and disk capacity have. And while Solid-State Drives do offer a huge increase in IOPS, they are far from cheap!
To restate the point: more wasted space can make queries take longer. The fewer rows that fit into a data page (approx. 8060 bytes) means the more data pages that are needed to satisfy the query. It should be obvious that more data pages on disk require more revolutions of the disk to read and hence more time. In addition, consider memory usage. SQL Server will try to store / buffer data in memory to use for repetitive requests. The more space each row takes up the more buffer space is required for one query leaving less available to other queries. And this means ALL queries since DML statements, not just SELECT statements, operate in the buffer pool (i.e. memory) first. Hence wasted space reduces Page Life Expectency and leads to more frequent disk access. It is, however, possible to actually want fewer rows per data and/or index page to increase concurrency on an oft-updated table, but that should be by-design and can be handled via the Index FILLFACTOR and/or PAD_INDEX options (or maybe some other method).
The cost of disk I/O is one of the reasons why we don't create too many indexes on each table. If there was no drag on the system for each index then we might as well have 50 per table. However, each index is another write per each INSERT and DELETE and sometimes UPDATE and these certainly add up.
Wasted space also makes maintenance tasks take longer. These tasks can include: Rebuilding Indexes, Reorganizing Indexes, Consistency Checks, Updating Statistics, Backups, etc. If you have to restore a database that will certainly take longer the larger the file(s) is/are. And while Enterprise Edition allows for Online Rebuilding of indexes, that takes up system resources (e.g. memory, CPU, tempdb, etc.) while it is running so the sooner that is done the better (and not all Indexes are even eligible for being rebuilt online, such as Clustered Indexes on tables with LOB datatypes).
Lastly, larger-than-necessary datatypes can take up additional CPU and Network Bandwidth. CPU is required for all operations on the data (reading, writing, comparing, etc.). Network Bandwidth is used when communicating to the application as well as several other possible areas: offloading the data for backup, Log Shipping, Replication, Mirroring, ETL to a Data Warehouse, etc.
What Does All of This Mean?
All of this means that trying to save space is not academic; it is quite practical. It is practical in terms of the real cost of enterprise class storage. It is practical in terms of overall efficiency and responsiveness of your system. Needing an additional $250,000 for more SAN space is not cheap. A query that takes several minutes to return when it could return in a few seconds is not cheap.
With these consequences in mind, we need to be aware of our data-modeling decisions as they have a greater impact in the long-term than they do in the short-term. Nearly all data-models, no matter how poorly designed, will work in the development environment with up to a few thousand rows in the tables. Many QA environments don't even simulate the volume of data that exists in a production environment as it is common to test functionality and not performance.
Here are a few suggestions to keep in mind as you approach new projects that will hopefully also get you thinking about other areas of consideration:
- If you are going to store something like a status, do not store the text of the status but instead create a lookup table. Repeating string values of "Active" and "Inactive" takes a lot more space than storing lookup values of 1 and 2. Sure, a table that has a few hundred rows won't ever show a difference between the two but it can be quite impacting at a million rows. Also, it is far more efficient to search on a numeric datatype than a string, especially when the default collation is case-Insensitive so it has to match both upper and lower-case letters at each position in the word.
- If you have a lookup table with a text key, such as ISO codes for Countries (most people use the two character ISO code), and the number of characters is always the same, then use a CHAR or NCHAR datatype instead of VARCHAR or NVARCHAR. For example, if using the two-character ISO Country Code, just use a CHAR(2) instead of VARCHAR(2). Using VARCHAR() actually takes an additional byte in the page header that you won't be using so why waste it?
- If you have a table that will be limited in how many rows it will ever have, such as a lookup table of statuses, use a TINYINT instead of INT for the ID field. Again, those 3 bytes don't seem like much in the lookup table itself as there might only be 10 rows for a total of 30 bytes, but keep in mind that as a lookup table the purpose of the ID field is to be stored in at least one if not more other tables that can be quite large. The impact of those 3 bytes when that field is part of several multi-million row tables is much more noticeable.
- If you know that a field can never have a NULL value, make it NOT NULL. Not only is this better for data integrity, but a NULLable field takes up an additional byte in the page header since the NULL value cannot be stored in the field itself (since it is the absence of a value).
- Consider setting the IDENTITY seed at the lower bound (e.g. -32768 for SMALLINT or -2147483648 for INT) if you really don't care about the number anyway. This gets you twice as many rows as starting at 1 at the same physical cost. And surrogate keys are by their very nature arbitrary so if they are not exposed to the end user (and they shouldn't be) then it doesn't matter if they are negative (although some people will complain that they are "ugly").
- If you are tracking a date and don't need the time component and you are using SQL Server 2005 or even SQL Server 2000 (yikes!), use a SMALLDATETIME field as it is 4 bytes instead of 8. Some people even use an INT field and represent the date as YYYYMMDD which is still properly ordered for doing BETWEEN and ORDER BY operations and you don't need to do CONVERT to chop off the time component, but doing DATEDIFF operations isn't as easy. Do keep in mind that SMALLDATETIME has a max-year of 2079 so if there is any reason you might need dates extending many years into the future then this might not be a good choice. Of course, if you are using SQL Server 2008 or beyond then just use the DATE datatype.
As far as indexes are concerned, if you are not going to have the Primary Key be CLUSTERED then it is best to make the CLUSTERED index UNIQUE (Primary Keys are automatically UNIQUE). The reason is that SQL Server must ensure that each Key value in the CLUSTERED index is a unique pointer to a row. So, if you do not specify the UNIQUE keyword when creating the CLUSTERED index then SQL Server will insert a hidden "uniqueidentifier" field that is 4 bytes. This is essentially wasted space since you cannot use it. This again adds up over millions of rows, especially since the entire Cluster Key (including the uniqueidentifier) is copied into each row of each Non-Clustered index. In some cases you might need to create a CLUSTERED index on a non-unique key, but it is certainly worth some time thinking about that decision and making sure it is the correct one.
Note: To be clear, the uniqueifier is only added to non-unique rows. This means that if all rows (in terms of their Key value(s)) really are unique, then no additional space is taken up by not having declared the CLUSTERED Index as UNIQUE. And the first row to have a particular value will NOT have the uniqueifier added if one or more rows are added later that duplicate its Key value(s); only the newly added rows, if not unique, will have the uniqifier added.
- Do not use GUIDs (i.e. UNIQUEIDENTIFIER) as CLUSTERED indexes and/or Primary Keys since they simply are not scalable. If generating them via the NEWID() function or being passed in from the application, the values are not sequential so produce a lot of fragmentation in the CLUSTERED index. Even if you are using the NEWSEQUENTIALID() which started in SQL Server 2005, the values are still 16 bytes wide and one GUID field by itself is not a problem but remember that CLUSTERED indexes are copied into all non-clustered indexes. Also, whatever field(s) you select for a PK has implications beyond the initial tables since PKs show up in other tables as Foreign Keys. Hence a system that has GUIDs as PKs will have many tables that have several GUID fields when accounting for both PK and one or more FKs. Having all of these 16-byte fields adds up when multiple tables have many millions of rows. And generating GUID values takes longer than generating the next IDENTITY value which certainly has a negative impact on tables with large volumes of INSERTs. Yes, you can certainly find some articles and forum posts stating that using GUID values are not as bad as most people claim, but you really need to have a good reason for using them and a good idea of how your system will grow over the years before making the decision to use them. In most cases, if you simply need transportability of the data between environments or systems, use an IDENTITY field or Natural Key as the PK and then add a separate GUID field with a UNIQUE NONCLUSTERED index on it to be an Alternate Key.
- Be mindful of the differences between the CHAR / VARCHAR datatypes and NCHAR / NVARCHAR. The NCHAR / NVARCHAR types are Unicode-compatible which is double-byte so any value takes up twice the space as it would with a CHAR / VARCHAR field. Sometimes you can know for a fair-degree of certainty that you will not need any Unicode support so in that case use the regular CHAR / VARCHAR types. This decision can be tricky, though, as sometimes things change. For example, you business could start out as US-only but then expand to other markets. Or, most people store URLs as VARCHAR since they do not accept Unicode characers. Well, at least that's how it was for many years. But in 2003 that changed so it is possible now to see Domain Names (QueryString should still be URL encoded) with Unicode characters. For more info, see: http://unicode.org/faq/idn.html
- Do not prematurely de-normalize OLTP systems. While there are valid reasons for de-normalization of certain fields, it sometimes seems as if people do not trust that the "R" in RDBMS really stands for "Relational" and that these systems (not just SQL Server) are tuned to handle JOINs rather efficiently. While an increasing number of JOINs decreases the optimizer's ability to find a good plan, solving the problem with a better query or with part of the query stored in a temp table that is later JOINed in should be attempted before de-normalizing. Having multiple copies of a field across many tables not only takes up valuable space but it can introduce data-integrity issues as it is now up to the programmer to make sure that the data across all instances of a particular field are kept in sync. This is an unnecessary risk on both of those levels unless you really know that you will benefit from the performance gain of the de-normalization. The desire to de-normalize often points to problems with the initial data-model and/or indexing strategy so you should look at those two areas before deciding to de-normalize.
Other datatypes to consider based on actual need:
BIT : Takes up 1 physical byte per each 8 BIT fields. Meaning, the first 8 BIT fields on a particular table will fit into the first byte and the 9th – 16th BIT fields will require a second byte. A single BIT field is the same physical space on the data pages as a TINYINT field but with no ability to store values 2 - 255. Space savings only comes with multiple BIT fields. ( http://msdn.microsoft.com/en-us/library/ms177603.aspx ).
SMALLMONEY : If you won’t be storing anything below -214,748.3648 or above 214,748.3647, consider SMALLMONEY which is 4 bytes instead of 8 for MONEY. ( http://msdn.microsoft.com/en-us/library/ms179882.aspx )
FLOAT(24) / REAL : 4 bytes instead of 8 for default FLOAT. See the linked documention to understand how this affects precision. ( http://msdn.microsoft.com/en-us/library/ms173773.aspx )
DECIMAL / NUMERIC (synonyms) : These take up 5, 9, 13, or 17 bytes depending on the precision. Please see the linked documention to understand the relationship between precision and space required. ( http://msdn.microsoft.com/en-us/library/ms187746.aspx )
TIME : similar to DATE, it can save space over DATETIME / SMALLDATETIME if you don’t need the DATE component as it takes up 3 – 5 bytes depending on precision. The default (i.e. no specified precision) is to use all 5 bytes. Please see the linked documention to understand the relationship between precision and space required. ( http://msdn.microsoft.com/en-us/library/bb677243.aspx )
The various datatypes exist for very valid reasons (and it is not simply backwards compatibility with old mainframe systems): there is a real cost to using more space. If there was no cost then we could get away with BIGINT and never need INT, SMALLINT, or TINYINT. We wouldn't need SMALLDATETIME or SMALLMONEY either. Why not just make all text fields MAX? And why would Microsoft add support for VARDECIMAL starting with SQL Server 2005 SP2 (for Enterprise Edition)?
This is not to say that you should pick a smaller datatype than is necessary for the use-case of the field. If you suspect that the range of data might exceed a particular limit, then use the datatype that will fullfill that need. But if you have, for example, a LookUp table that has 5 possible values and might grow to use a total of 10 values then it is unnecessary to use anything larger than TINYINT. The application should continue to function as data is added to the system. This works in both directions: You should no more use INT for a field that will only ever store values 1 - 10 (so that natural growth doesn't slow things down [more than necessary]) than you should use SMALLINT for UserIDs if you ever expect to have more than 65,536 users (so that natural growth doesn't shut the system down).
Here are a few other features to help reduce disk I/O, but they should be used in addition to, not instead of, good modeling practices. These features have drawbacks so please research before using them as they will have a negative impact on certain situations.
- Row and Page Compression - This feature started in SQL Server 2008 and allows for fitting more rows on a datapage. ( http://msdn.microsoft.com/en-us/library/cc280449(v=sql.110).aspx ). This feature cannot be enabled if there are any SPARSE columns on the table, but does a lot more than the SPARSE option so you would need to first remove the SPARSE option from any columns using it.
SPARSE columns - This feature started in SQL Server 2008 and greatly reduces space taken up by NULL fields (remember that fixed-length fields take up their space even if NULL), BUT non-null fields take slightly longer to read and take more memory on updates so pay close attention to what % of NULL values you need for this to be a benefit. ( http://msdn.microsoft.com/en-us/library/cc280604.aspx )
Something to consider, while certainly not applicable to all situations, is possibly treating a particular value -- usually 0 -- that appears in a high percentage of rows in a currently NOT NULL field as being NULL with the SPARSE option. For example, if you have a Sales table with a DiscountAmount SMALLMONEY NOT NULL field and 80% (or more) of the time the DiscountAmount = 0, you could change the field to be SPARSE NULL (remember to rebuild the clustered index if doing ALTER TABLE) and UPDATE all 0 values to be NULL. This will change the space taken by the field from 4 bytes per each row to 8 bytes for only those rows with an actual > 0.00 value. For a 10 MM row table with 80% of rows set to 0, this would mean taking up 8 bytes on 2 MM rows (roughly 16 MB -- 15,625 KB) as opposed to 4 bytes on all 10 MM rows (roughly 40 MB -- 39,062.5 KB), for a savings of 24 MB !! The main down-side is that you need to remember to COALESCE(DiscountAmount, 0.0) everywhere that it is used, or maybe provide a non-persisted Computed Column to do that and have everyone SELECT from the Computed Column field. You only need to INSERT or UPDATE the real column using NULLIF(@IncomingValue, 0). However, yes, this does increase the complexity of the table and increase the chances that somebody who doesn't know about this setup will either SELECT the real column and get NULLs or even INSERT / UPDATE the real column to be 0. Again, this is just an idea to consider; use your best judgement regarding applicability. You can (and should) always document the code and even the columns using Extended Properties, but that requires people to actually read the comments ;-).
- VARDECIMAL - This was a very short-lived feature: introduced in SP2 for SQL Server 2005 and deprecated in SQL Server 2008 (in favor of Row and Page Compression). But, if you are stuck on SQL Server 2005 for some reason and cannot upgrade, then maybe check this out ( http://msdn.microsoft.com/en-us/library/bb326755(v=sql.105).aspx )
With these points in mind let's take a quick look at two real world examples.
At the company I presently work for we have several tables that have well over 100 million rows. Forgoing the specifics of the exact table structure I will talk about, it is enough to say that we had a composite Clustered PK made up of an INT IDENTITY field, an INT CustomerID, and an INT GroupID. The GroupID, however, will never have a value over 100 and it will never be negative. The INT datatype can hold from -2.14 billion to +2.14 billion. We won't be needing that. The SMALLINT datatype can hold from -32k to +32k. We won't be needing that either. The TINYINT datatype can hold from 0 to 255. That sounds perfect.
But is it worth the effort to change the datatype of a field in the Clustered index? Well, this table has 150 million rows in it. The initial savings of going from INT to TINYINT and thus gaining 3 bytes per row is 450 MB of space right there. And this field is part of the CLUSTERED PK so it gets copied into each of the Non Clustered indexes. There are four Non Clustered indexes on this table so we can repeat that 450 MB savings another 4 times for Non Clustered index savings of 1.8 GB and a total savings of 2.25 GB. And we have a child table related by FK to this large table that also gets a slight reduction although it is not nearly as impacting given it is a 1-to-zero-to-1 relationship (but it certainly doesn't hurt).
As you can see from this one example, the decision to make GroupID an INT seemed to have no impact when looking at the Group table alone but when you look at the down-stream effect of everywhere that lookup value resides it is quite another story. This point is made even stronger when considering that it was not just this one 150 million row table that has the GroupID lookup value in it.
I have now pointed out the system effects of wasted space: more space on disk, longer time to read from disk, and more space used in memory. But what about the financial ramifications of wasting space? I would be remiss to ignore the real cost given that one of the two meanings of "Disk is Cheap" is indeed financial.
We have another table that has 450 million rows across four servers. This table is on the "many" side of a "one-to-many" relationship. It also has an NVARCHAR(100) field that also happens to be in the parent table. While there are valid situations to denormalize some data there is no good reason for it being done in this table. The average size of the data in this field is 10 characters and that equals 20 bytes per row given that it is an NVARCHAR instead of VARCHAR field (i.e. it is double-byte / Unicode). For now we will ignore the additional few bytes of space taken up in the page header. So the total amount of wasted space for this one field is about 9 GB. By itself it does not seem that bad but do consider that SAN space costs roughly $30 per GB (yes, you can find some that is cheaper or more expensive based on a wide variety of factors). While this might not be a lot of money, keep in mind that you don't purchase a few or even a few hundred GB at a time. If we look at all of the wasted space we have it could easily be hundreds of GB across many servers and many tables. And that is not even considering the Log Files and Backups. I personally feel that if this money is just going to be thrown away then it might as well go to my paycheck as opposed to a storage vendor!
Be aware that your data-modeling decisions can have ripple-effects. While these effects will be hard to calculate or even notice when the system is in development or brand-new, you hope that as time goes on your project will have lots of customers and survive for a long time which means the data volume will inevitably grow. It will only be in the long-run that your decisions at the beginning will show their impact in terms of both cost and efficiency. Hence, some additional time spent planning can save time and money trying to fix problems down the road (especially when the labor cost to fix a problem doesn't seem, or simply isn't, worth it).
Be aware, also, of your situation. There are trade-offs for most of the decisions we make and this advice should be applied pragmatically, not idealistically. There are situations where a slightly larger datatype might make more sense if the cost of using it when not truly needed by the system is less than the cost of developer time in terms of training, development, and maintenance. The cost is usually greater on the system side when tables have one million or more rows. However, systems that are designed very poorly can have higher costs on the system side well before hitting those levels. And so if you find your system costs outweighing your developer costs, then consider using the recommendations above to refactor key areas as opposed to just applying them to new tables and columns.
Disk is far from cheap. In fact, it might be one of the most expensive parts of your system. So be sensible, not wasteful. And remember: when someone tells you that "disk is cheap", that is most likely because the reality of the situation down the road will never be their problem.
Thanks for reading.
Copyright © 2010, 2012, 2013 Solomon Rutzky - All Rights Reserved