Click here to monitor SSC
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in
Home       Members    Calendar    Who's On

Add to briefcase «««45678»»»

Disk Is Cheap! ORLY? Expand / Collapse
Posted Saturday, January 1, 2011 3:41 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Friday, October 10, 2014 2:06 PM
Points: 89, Visits: 482
1) Please keep in mind that this issue is a compounded one since a field can show up in other tables as well as using memory, CPU, network, log space, etc. We are not talking about one field in one table here. If we were, then maybe I would agree even if we had 500 million rows (although I would still try to model it properly if it is a new project but I wouldn't worry about it for an existing system). When people don't pay attention to datatypes they generally carry that idea onto many fields in many tables so even if you don't have many large tables you still have many fields across many tables that could be taking up resources unnecessarily.

I could make the same argument against any data type. "Think of the savings you are wasting by using a 4-byte integer instead of a 2 byte integer for a primary key!" In the grander scheme of things, what matters is whether affects actual real world performance given actual data sizes. The vast majority of databases do not even contain a million rows must less 10 million in any one table and thus tiny differences in datatypes will make no perceptible difference. What does make a significant difference are data integrity rules.

2) Regarding the idea that most database do not get to the hundreds of millions of rows, I think that is an issue of looking back 20 years as opposed to ahead 20 years. I certainly respect your experience in the field but I think it is becoming clear that more and more applications are becoming highly data intensive and given that storing large volumes of data today is much more feasible and economical than 10 - 20 years ago, and that BI and data analysis is becoming more popular, I think we are already seeing a growing trend in data retention. And regardless of table size, if operations on a table take longer due to additional resource consumption then that does affect the user experience and more applications these days are web-based with users expecting instantaneous response times and they don’t care if there are 1000 other concurrent users on the system who might be hitting the same table(s).

This is more an argument about archive tables than it is in-use tables. I've seen a few systems that had a one or two tables in the 50 million row+ range and those were primarily archive, auditing tables or data warehouses (where you can play tricks to cut down the pointer size). However, let's not forget that archive tables serve a very different purpose than in-use tables. Again, I see medium size company databases regularly and rarely do I see tables even remotely close to a million rows and almost never if you exclude archive tables.

3) Regardless of how many rows a table might have over a 5 or 10 year period, if I have the chance to model something properly in the beginning then I will certainly do so because there is a very minimal extra cost in terms of me thinking about the problem and maybe asking a few more questions of the person requesting the feature.

When modeling, the choice of implementation data type (vs the data type in the model) is secondary. You do not care whether you use an int, smallint, tinyint or bit when modeling a database. What matters is that the value is an integer with some given range. That the DBA decides they can use a smallint instead of a 4-byte integer is done at the implementation stage. Still important, but not a part of modeling.

A few extra minutes of work now can save many hours of work in the future so why not do it?

Because it can also cost you a significant amount in development time. If you chose a tinyint for example and after the system goes into production it is determined that this is too small, it is likely that quite a bit will have to be changed.

And who can predict that their project will never grow to such sizes?

There is a cost to such an assumption. Why not assume your database could grow to 1 trillion exabytes? The obvious answer is that to build a system to scale to those dimensions would require an enormous amount of effort even though the probability that the system could grow to those levels is remote. Yes, we should make *reasonable* assumptions about growth, accounting for archival processes, and in most systems I have seem that is far less than billions or even hundreds of millions of rows.

4) I don't think it is entirely relevant to state that we now have 100 or even 1000 times more efficient hardware when the financial resources to get such hardware are not infinite. Yes, there will always be faster CPUs and more memory to add but that doesn't mean those budget items will be approved. That is a very short-sighted way of looking at this issue: a vendor selling the proper hardware in no way implies my ability to purchase it to solve my problems.

You are also discounting cost. In addition to more powerful computers, cost of that power generally drops. Thus, it is likely that there is a system that could be built today that would scale your database to 100 million rows but ten years from now that system might cost the same as workstation today.

I'm not saying that developers should entirely discount performance related to data types but I am saying that I would prefer they focus more on data integrity and good relation design than on minor performance boosts from say using a smalldatetime vs a datetime.

5) DATEIME2 datatype does not replace GUIDs since GUIDs are used as record-locators and I cannot see that happening with datatime data even if it can point to a unique record

A significant reason against using DateTime in the past is that its resolution was far too low (1/3 of a millisecond) to work in heavy transaction systems. However, at nanosecond resolution, you would be hard pressed to intentionally create dups. A big reason that guids are used over integers is the ability to generate them on the client instead of requiring a round trip that Identity values require. DateTime2 could work well enough for that purpose in half the size. I haven't personally built a system using a DateTime2 as a PK but I can see the potential.

6) Using SMALLDATETIME when the time itself is not wanted is nearly always a good idea and the limit of year 2079 is not much of an issue as a "non-intuitive max value". The reason being is that date values are nearly always, if looking into the future, looking into the more immediate future. Most dates start as being current which always work and some future dates for scheduled events or appointments that don't typically happen 50 years or more out. The current max value of a SMALLDATETIME is 68 years in the future which is perfectly workable in 99% of situations (again, where only the date portion is relevant to begin with). If you consider that computer science as a career is not much more than 40 years looking back, the max value for a SMALLDATETIME is more than that amount of time in the future. For any project that I am working on now, it is almost guaranteed that IF my company is still around in 68 years, by then they will have made changes to this data model and application many times over, especially given how many changes we have done in the past 5 years alone. And I cannot predict what will be available to use or if SQL Server will even still be around in 68 years so that amount of planning is wasted effort. And given that I have been thinking along these lines for 8+ years now, the industry has proven me to be correct since SQL Server, starting with 2008, offers a DATE only datatype that is perfect for this use-case and everything I designed in the past 6 years that uses SMALLDATETIME can easily be converted over (IF that is ever even necessary to begin with) and this entire time my projects have benefited from not being wasteful of the extra 4 bytes for the full DATETIME.

Completely disagree. I've run into many issues with that ridiculous max value. Example 1: you have a table that tracks when an entity is "valid". You have a start and end date. Null end date = currently available. The client side code wants to use a max datetime value to represent that some entity is valid in perpetuity. In most programming languages, that date is 9999-12-31. Can't do it with smalldatetime. You have to store a mystery constant somewhere in the middle tier code that is used to represent that value. Example 2: copyrights. 10 years ago, you might have thought it safe that no copyright could extend beyond 2079 for something today or in the past. Surprise, surprise thanks to Disney, now it is a 100 years. Example 3: contracts. Yes, I've run into contracts that specify certain rights 20 and 50 years into the future. Saving those four bytes, in the long run, simply do not provide nearly enough benefit to justify them. If you are worried about storing time values, then add a check constraint which prevents time values from being stored in that datetime field.

As for conversion, you better hope that your ORM does not account for the difference between smalldatetime and datetime. If so, that means additional unit testing to ensure that the change of the datatype will not break the existing code or that the code accommodates the expanded values.
Post #1041578
Posted Saturday, January 1, 2011 4:24 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Sunday, January 2, 2011 5:56 PM
Points: 4, Visits: 9
What no one has mentioned is that you also have to consider the mechanics of the I/O.
SQL Server is going to fetch a block (or several) at a time, not a arbitrary number of bytes.

What was discussed is valid, but if the cumulative byte savings, per row, don't get an extra row (or more) into a block. Your savings are Zero.
A complete block will be read, whether it's full or not.

The advice in indexes is still valid, you can almost certainly get extra index entries into an index block.
Post #1041583
Posted Sunday, January 2, 2011 1:33 PM

SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, September 19, 2016 3:00 PM
Points: 46, Visits: 170
Someone mentioned in an earlier post that all tables must have a cluster definition. This is not true. It's called a hash table.

Jay Quincy Allen, Managing Partner, ADAMA Systems
Post #1041659
Posted Sunday, January 2, 2011 3:22 PM
SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Yesterday @ 8:06 AM
Points: 4,377, Visits: 9,710
pro-1019688 (1/2/2011)
Someone mentioned in an earlier post that all tables must have a cluster definition. This is not true. It's called a hash table.

Actually - it is called a heap, but that is just words. ;)

Jeffrey Williams
Problems are opportunities brilliantly disguised as insurmountable obstacles.

How to post questions to get better answers faster
Managing Transaction Logs
Post #1041670
Posted Sunday, January 2, 2011 7:14 PM

SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, September 19, 2016 3:00 PM
Points: 46, Visits: 170
Yes, sorry Jeffrey, a heap index, not a hash.

Jay Quincy Allen, Managing Partner, ADAMA Systems
Post #1041686
Posted Monday, January 3, 2011 6:18 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Monday, May 4, 2015 2:33 PM
Points: 997, Visits: 355
Post #1041795
Posted Monday, January 3, 2011 8:42 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Wednesday, October 26, 2016 2:18 PM
Points: 917, Visits: 2,673
Essentially; know your system, know your data, know your bottlenecks, know the plateaus, know your growth patterns, and keep an eye on future technology.

You have limited CPU, limited disk space, limited disk IO, limited memory (note: taking more space on disk also takes more space in RAM), limited time, limited concurrency, and limited humans. All these factors must be balanced.

Know your bottlenecks, but don't waste anything. If you want an 8k row tally table, use smallint, not int. If you have a date dimension table that covers SMALLDATETIME, every single day can be represented by a unique SMALLINT value, starting at -32768 for Jan 1, 1900.

Note: It's not uncommon for a 146GB 15k FC "upgrade" disk to cost right around $1000... each. Not counting the tray cost. Not counting the cabinet cost. Not counting license costs for the SAN to be allowed to see it. Not counting parity/mirror disk costs. Not counting hot spare costs. Not counting cold spare costs. Not counting electricity use. Not counting heat generation.

Plateaus are critical: Adding another couple drives is "easy" and "cheap"... until you're full, at which time you may need a new tray, which needs a new cabinet, which needs more floor space in the server room, and which needs a new electrical circuit, which needs a new enterprise UPS, which needs a new generator... and the additional heat load needs a new air conditioning unit. Did we mention the fire suppression system, too?

Now, if SQL Server gave us a comprehensive range of integer datatypes, we'd be a lot better off; we should have both signed and unsigned 8, 16, 32, and 64 bit integers. We don't.
Post #1041871
Posted Monday, January 3, 2011 8:56 AM
Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Wednesday, February 2, 2011 4:11 AM
Points: 671, Visits: 356
Amen! I hear that "disk is cheap" all the time from developers. Laziness keeps prevailing where I work, and I'm doing my best to try and make developers aware of the reasons behind watching datatypes as well as nullable fields. I believe I will forward this article to all developers.
Post #1041885
Posted Monday, January 3, 2011 10:53 AM

SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Thursday, December 1, 2016 8:16 PM
Points: 945, Visits: 2,364
Points made by Thomas-282729 and others are valid for consideration. You have to apply all of these through a filter of sense. The other point that was good is "does this get me an extra row per page".

The the other thing that has not been mentioned is those of us stuck on the low end of the spectrum. Mobile. We have less space than a postage stamp and processors with less power than a one horse sleigh. If we are not engineered to the max the user will throw the application (and the device) in the trash. We will not have a billion rows in any table, that is true. We don't have a billion of anything. We do have compatibility constraints as often we are at the mercy of "how much code do you write for Sync Services".

Scalability is a big consideration at both ends of the spectrum and in the middle too.


Charles Kincaid

Post #1041954
Posted Monday, January 3, 2011 11:38 AM

Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Monday, November 24, 2014 4:29 PM
Points: 752, Visits: 920
I have to agree with you. Taking steps to ensure that disk space is not wasted is definitely beneficial.

But there is a counterbalancing point: Disk space is normally cheaper than problems that can be solved by sacraficing disk space. I hate wasting disk space and the increases in IO time that it brings, but I will trade vast swaths of disk space for even very small performance gains and normally say it is a good trade.

When I am thinking about the trade offs for creating an index, I will consider carefully how often the table is written to and the trade offs in write performance, but I won't give a second thought to the disk space because it is cheap relative to making my users wait. I will create complicated and large indexed views in the same way.

Similarly, I tend to log procedures extensively. This costs disk space, but that disk space is cheap compared to having a problem in production and not being able to track down in detail when, why, how, and who caused it along with enough information to know what the answer should have been.

So, I agree, it was a good article with a good point and one that developers and DBAs should remember. I just want to caution against taking it to an extreme. Disk space should never be wasted, but at least in the situations I regularly see trading disk space for detailed logging or improved perofrmance is almost always a good trade. Disk space is cheap enough to justify that.

Timothy A Wiseman
SQL Blog:
Post #1041977
« Prev Topic | Next Topic »

Add to briefcase «««45678»»»

Permissions Expand / Collapse