SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Disk Is Cheap! ORLY?


Disk Is Cheap! ORLY?

Author
Message
Thomas-282729
Thomas-282729
SSC-Enthusiastic
SSC-Enthusiastic (113 reputation)SSC-Enthusiastic (113 reputation)SSC-Enthusiastic (113 reputation)SSC-Enthusiastic (113 reputation)SSC-Enthusiastic (113 reputation)SSC-Enthusiastic (113 reputation)SSC-Enthusiastic (113 reputation)SSC-Enthusiastic (113 reputation)

Group: General Forum Members
Points: 113 Visits: 482
1) Please keep in mind that this issue is a compounded one since a field can show up in other tables as well as using memory, CPU, network, log space, etc. We are not talking about one field in one table here. If we were, then maybe I would agree even if we had 500 million rows (although I would still try to model it properly if it is a new project but I wouldn't worry about it for an existing system). When people don't pay attention to datatypes they generally carry that idea onto many fields in many tables so even if you don't have many large tables you still have many fields across many tables that could be taking up resources unnecessarily.


I could make the same argument against any data type. "Think of the savings you are wasting by using a 4-byte integer instead of a 2 byte integer for a primary key!" In the grander scheme of things, what matters is whether affects actual real world performance given actual data sizes. The vast majority of databases do not even contain a million rows must less 10 million in any one table and thus tiny differences in datatypes will make no perceptible difference. What does make a significant difference are data integrity rules.

2) Regarding the idea that most database do not get to the hundreds of millions of rows, I think that is an issue of looking back 20 years as opposed to ahead 20 years. I certainly respect your experience in the field but I think it is becoming clear that more and more applications are becoming highly data intensive and given that storing large volumes of data today is much more feasible and economical than 10 - 20 years ago, and that BI and data analysis is becoming more popular, I think we are already seeing a growing trend in data retention. And regardless of table size, if operations on a table take longer due to additional resource consumption then that does affect the user experience and more applications these days are web-based with users expecting instantaneous response times and they don’t care if there are 1000 other concurrent users on the system who might be hitting the same table(s).


This is more an argument about archive tables than it is in-use tables. I've seen a few systems that had a one or two tables in the 50 million row+ range and those were primarily archive, auditing tables or data warehouses (where you can play tricks to cut down the pointer size). However, let's not forget that archive tables serve a very different purpose than in-use tables. Again, I see medium size company databases regularly and rarely do I see tables even remotely close to a million rows and almost never if you exclude archive tables.

3) Regardless of how many rows a table might have over a 5 or 10 year period, if I have the chance to model something properly in the beginning then I will certainly do so because there is a very minimal extra cost in terms of me thinking about the problem and maybe asking a few more questions of the person requesting the feature.


When modeling, the choice of implementation data type (vs the data type in the model) is secondary. You do not care whether you use an int, smallint, tinyint or bit when modeling a database. What matters is that the value is an integer with some given range. That the DBA decides they can use a smallint instead of a 4-byte integer is done at the implementation stage. Still important, but not a part of modeling.

A few extra minutes of work now can save many hours of work in the future so why not do it?


Because it can also cost you a significant amount in development time. If you chose a tinyint for example and after the system goes into production it is determined that this is too small, it is likely that quite a bit will have to be changed.

And who can predict that their project will never grow to such sizes?


There is a cost to such an assumption. Why not assume your database could grow to 1 trillion exabytes? The obvious answer is that to build a system to scale to those dimensions would require an enormous amount of effort even though the probability that the system could grow to those levels is remote. Yes, we should make *reasonable* assumptions about growth, accounting for archival processes, and in most systems I have seem that is far less than billions or even hundreds of millions of rows.

4) I don't think it is entirely relevant to state that we now have 100 or even 1000 times more efficient hardware when the financial resources to get such hardware are not infinite. Yes, there will always be faster CPUs and more memory to add but that doesn't mean those budget items will be approved. That is a very short-sighted way of looking at this issue: a vendor selling the proper hardware in no way implies my ability to purchase it to solve my problems.


You are also discounting cost. In addition to more powerful computers, cost of that power generally drops. Thus, it is likely that there is a system that could be built today that would scale your database to 100 million rows but ten years from now that system might cost the same as workstation today.

I'm not saying that developers should entirely discount performance related to data types but I am saying that I would prefer they focus more on data integrity and good relation design than on minor performance boosts from say using a smalldatetime vs a datetime.

5) DATEIME2 datatype does not replace GUIDs since GUIDs are used as record-locators and I cannot see that happening with datatime data even if it can point to a unique record


A significant reason against using DateTime in the past is that its resolution was far too low (1/3 of a millisecond) to work in heavy transaction systems. However, at nanosecond resolution, you would be hard pressed to intentionally create dups. A big reason that guids are used over integers is the ability to generate them on the client instead of requiring a round trip that Identity values require. DateTime2 could work well enough for that purpose in half the size. I haven't personally built a system using a DateTime2 as a PK but I can see the potential.

6) Using SMALLDATETIME when the time itself is not wanted is nearly always a good idea and the limit of year 2079 is not much of an issue as a "non-intuitive max value". The reason being is that date values are nearly always, if looking into the future, looking into the more immediate future. Most dates start as being current which always work and some future dates for scheduled events or appointments that don't typically happen 50 years or more out. The current max value of a SMALLDATETIME is 68 years in the future which is perfectly workable in 99% of situations (again, where only the date portion is relevant to begin with). If you consider that computer science as a career is not much more than 40 years looking back, the max value for a SMALLDATETIME is more than that amount of time in the future. For any project that I am working on now, it is almost guaranteed that IF my company is still around in 68 years, by then they will have made changes to this data model and application many times over, especially given how many changes we have done in the past 5 years alone. And I cannot predict what will be available to use or if SQL Server will even still be around in 68 years so that amount of planning is wasted effort. And given that I have been thinking along these lines for 8+ years now, the industry has proven me to be correct since SQL Server, starting with 2008, offers a DATE only datatype that is perfect for this use-case and everything I designed in the past 6 years that uses SMALLDATETIME can easily be converted over (IF that is ever even necessary to begin with) and this entire time my projects have benefited from not being wasteful of the extra 4 bytes for the full DATETIME.


Completely disagree. I've run into many issues with that ridiculous max value. Example 1: you have a table that tracks when an entity is "valid". You have a start and end date. Null end date = currently available. The client side code wants to use a max datetime value to represent that some entity is valid in perpetuity. In most programming languages, that date is 9999-12-31. Can't do it with smalldatetime. You have to store a mystery constant somewhere in the middle tier code that is used to represent that value. Example 2: copyrights. 10 years ago, you might have thought it safe that no copyright could extend beyond 2079 for something today or in the past. Surprise, surprise thanks to Disney, now it is a 100 years. Example 3: contracts. Yes, I've run into contracts that specify certain rights 20 and 50 years into the future. Saving those four bytes, in the long run, simply do not provide nearly enough benefit to justify them. If you are worried about storing time values, then add a check constraint which prevents time values from being stored in that datetime field.

As for conversion, you better hope that your ORM does not account for the difference between smalldatetime and datetime. If so, that means additional unit testing to ensure that the change of the datatype will not break the existing code or that the code accommodates the expanded values.
rlobbe
rlobbe
Forum Newbie
Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)Forum Newbie (4 reputation)

Group: General Forum Members
Points: 4 Visits: 9
What no one has mentioned is that you also have to consider the mechanics of the I/O.
SQL Server is going to fetch a block (or several) at a time, not a arbitrary number of bytes.

What was discussed is valid, but if the cumulative byte savings, per row, don't get an extra row (or more) into a block. Your savings are Zero.
A complete block will be read, whether it's full or not.

The advice in indexes is still valid, you can almost certainly get extra index entries into an index block.
JQAllen
JQAllen
Valued Member
Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)

Group: General Forum Members
Points: 68 Visits: 170
Someone mentioned in an earlier post that all tables must have a cluster definition. This is not true. It's called a hash table.

Jay Quincy Allen, Managing Partner, ADAMA Systems
Jeffrey Williams 3188
Jeffrey Williams 3188
SSCertifiable
SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)

Group: General Forum Members
Points: 7623 Visits: 9962
pro-1019688 (1/2/2011)
Someone mentioned in an earlier post that all tables must have a cluster definition. This is not true. It's called a hash table.


Actually - it is called a heap, but that is just words. Wink

Jeffrey Williams
Problems are opportunities brilliantly disguised as insurmountable obstacles.

How to post questions to get better answers faster
Managing Transaction Logs

JQAllen
JQAllen
Valued Member
Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)Valued Member (68 reputation)

Group: General Forum Members
Points: 68 Visits: 170
Yes, sorry Jeffrey, a heap index, not a hash.

Jay Quincy Allen, Managing Partner, ADAMA Systems
anders-731262
anders-731262
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1029 Visits: 359
Excellent!
Nadrek
Nadrek
SSCommitted
SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)

Group: General Forum Members
Points: 1842 Visits: 2726
Essentially; know your system, know your data, know your bottlenecks, know the plateaus, know your growth patterns, and keep an eye on future technology.

You have limited CPU, limited disk space, limited disk IO, limited memory (note: taking more space on disk also takes more space in RAM), limited time, limited concurrency, and limited humans. All these factors must be balanced.

Know your bottlenecks, but don't waste anything. If you want an 8k row tally table, use smallint, not int. If you have a date dimension table that covers SMALLDATETIME, every single day can be represented by a unique SMALLINT value, starting at -32768 for Jan 1, 1900.

Note: It's not uncommon for a 146GB 15k FC "upgrade" disk to cost right around $1000... each. Not counting the tray cost. Not counting the cabinet cost. Not counting license costs for the SAN to be allowed to see it. Not counting parity/mirror disk costs. Not counting hot spare costs. Not counting cold spare costs. Not counting electricity use. Not counting heat generation.

Plateaus are critical: Adding another couple drives is "easy" and "cheap"... until you're full, at which time you may need a new tray, which needs a new cabinet, which needs more floor space in the server room, and which needs a new electrical circuit, which needs a new enterprise UPS, which needs a new generator... and the additional heat load needs a new air conditioning unit. Did we mention the fire suppression system, too?

Now, if SQL Server gave us a comprehensive range of integer datatypes, we'd be a lot better off; we should have both signed and unsigned 8, 16, 32, and 64 bit integers. We don't.
amarshall-568002
amarshall-568002
Say Hey Kid
Say Hey Kid (679 reputation)Say Hey Kid (679 reputation)Say Hey Kid (679 reputation)Say Hey Kid (679 reputation)Say Hey Kid (679 reputation)Say Hey Kid (679 reputation)Say Hey Kid (679 reputation)Say Hey Kid (679 reputation)

Group: General Forum Members
Points: 679 Visits: 356
Amen! I hear that "disk is cheap" all the time from developers. Laziness keeps prevailing where I work, and I'm doing my best to try and make developers aware of the reasons behind watching datatypes as well as nullable fields. I believe I will forward this article to all developers.
Charles Kincaid
Charles Kincaid
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1615 Visits: 2384
Points made by Thomas-282729 and others are valid for consideration. You have to apply all of these through a filter of sense. The other point that was good is "does this get me an extra row per page".

The the other thing that has not been mentioned is those of us stuck on the low end of the spectrum. Mobile. We have less space than a postage stamp and processors with less power than a one horse sleigh. If we are not engineered to the max the user will throw the application (and the device) in the trash. We will not have a billion rows in any table, that is true. We don't have a billion of anything. We do have compatibility constraints as often we are at the mercy of "how much code do you write for Sync Services".

Scalability is a big consideration at both ends of the spectrum and in the middle too.

ATBCharles Kincaid
timothyawiseman
timothyawiseman
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1190 Visits: 920
I have to agree with you. Taking steps to ensure that disk space is not wasted is definitely beneficial.

But there is a counterbalancing point: Disk space is normally cheaper than problems that can be solved by sacraficing disk space. I hate wasting disk space and the increases in IO time that it brings, but I will trade vast swaths of disk space for even very small performance gains and normally say it is a good trade.

When I am thinking about the trade offs for creating an index, I will consider carefully how often the table is written to and the trade offs in write performance, but I won't give a second thought to the disk space because it is cheap relative to making my users wait. I will create complicated and large indexed views in the same way.

Similarly, I tend to log procedures extensively. This costs disk space, but that disk space is cheap compared to having a problem in production and not being able to track down in detail when, why, how, and who caused it along with enough information to know what the answer should have been.

So, I agree, it was a good article with a good point and one that developers and DBAs should remember. I just want to caution against taking it to an extreme. Disk space should never be wasted, but at least in the situations I regularly see trading disk space for detailed logging or improved perofrmance is almost always a good trade. Disk space is cheap enough to justify that.

---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search