RE: Disk Is Cheap! ORLY? – SQLServerCentral

Right there with Babe

Points: 715

January 1, 2011 at 3:41 pm

1) Please keep in mind that this issue is a compounded one since a field can show up in other tables as well as using memory, CPU, network, log space, etc. We are not talking about one field in one table here. If we were, then maybe I would agree even if we had 500 million rows (although I would still try to model it properly if it is a new project but I wouldn't worry about it for an existing system). When people don't pay attention to datatypes they generally carry that idea onto many fields in many tables so even if you don't have many large tables you still have many fields across many tables that could be taking up resources unnecessarily.

I could make the same argument against any data type. "Think of the savings you are wasting by using a 4-byte integer instead of a 2 byte integer for a primary key!" In the grander scheme of things, what matters is whether affects actual real world performance given actual data sizes. The vast majority of databases do not even contain a million rows must less 10 million in any one table and thus tiny differences in datatypes will make no perceptible difference. What does make a significant difference are data integrity rules.

2) Regarding the idea that most database do not get to the hundreds of millions of rows, I think that is an issue of looking back 20 years as opposed to ahead 20 years. I certainly respect your experience in the field but I think it is becoming clear that more and more applications are becoming highly data intensive and given that storing large volumes of data today is much more feasible and economical than 10 - 20 years ago, and that BI and data analysis is becoming more popular, I think we are already seeing a growing trend in data retention. And regardless of table size, if operations on a table take longer due to additional resource consumption then that does affect the user experience and more applications these days are web-based with users expecting instantaneous response times and they don’t care if there are 1000 other concurrent users on the system who might be hitting the same table(s).

This is more an argument about archive tables than it is in-use tables. I've seen a few systems that had a one or two tables in the 50 million row+ range and those were primarily archive, auditing tables or data warehouses (where you can play tricks to cut down the pointer size). However, let's not forget that archive tables serve a very different purpose than in-use tables. Again, I see medium size company databases regularly and rarely do I see tables even remotely close to a million rows and almost never if you exclude archive tables.

3) Regardless of how many rows a table might have over a 5 or 10 year period, if I have the chance to model something properly in the beginning then I will certainly do so because there is a very minimal extra cost in terms of me thinking about the problem and maybe asking a few more questions of the person requesting the feature.

When modeling, the choice of implementation data type (vs the data type in the model) is secondary. You do not care whether you use an int, smallint, tinyint or bit when modeling a database. What matters is that the value is an integer with some given range. That the DBA decides they can use a smallint instead of a 4-byte integer is done at the implementation stage. Still important, but not a part of modeling.

A few extra minutes of work now can save many hours of work in the future so why not do it?

Because it can also cost you a significant amount in development time. If you chose a tinyint for example and after the system goes into production it is determined that this is too small, it is likely that quite a bit will have to be changed.

And who can predict that their project will never grow to such sizes?

There is a cost to such an assumption. Why not assume your database could grow to 1 trillion exabytes? The obvious answer is that to build a system to scale to those dimensions would require an enormous amount of effort even though the probability that the system could grow to those levels is remote. Yes, we should make *reasonable* assumptions about growth, accounting for archival processes, and in most systems I have seem that is far less than billions or even hundreds of millions of rows.

4) I don't think it is entirely relevant to state that we now have 100 or even 1000 times more efficient hardware when the financial resources to get such hardware are not infinite. Yes, there will always be faster CPUs and more memory to add but that doesn't mean those budget items will be approved. That is a very short-sighted way of looking at this issue: a vendor selling the proper hardware in no way implies my ability to purchase it to solve my problems.

You are also discounting cost. In addition to more powerful computers, cost of that power generally drops. Thus, it is likely that there is a system that could be built today that would scale your database to 100 million rows but ten years from now that system might cost the same as workstation today.

I'm not saying that developers should entirely discount performance related to data types but I am saying that I would prefer they focus more on data integrity and good relation design than on minor performance boosts from say using a smalldatetime vs a datetime.

5) DATEIME2 datatype does not replace GUIDs since GUIDs are used as record-locators and I cannot see that happening with datatime data even if it can point to a unique record

A significant reason against using DateTime in the past is that its resolution was far too low (1/3 of a millisecond) to work in heavy transaction systems. However, at nanosecond resolution, you would be hard pressed to intentionally create dups. A big reason that guids are used over integers is the ability to generate them on the client instead of requiring a round trip that Identity values require. DateTime2 could work well enough for that purpose in half the size. I haven't personally built a system using a DateTime2 as a PK but I can see the potential.

6) Using SMALLDATETIME when the time itself is not wanted is nearly always a good idea and the limit of year 2079 is not much of an issue as a "non-intuitive max value". The reason being is that date values are nearly always, if looking into the future, looking into the more immediate future. Most dates start as being current which always work and some future dates for scheduled events or appointments that don't typically happen 50 years or more out. The current max value of a SMALLDATETIME is 68 years in the future which is perfectly workable in 99% of situations (again, where only the date portion is relevant to begin with). If you consider that computer science as a career is not much more than 40 years looking back, the max value for a SMALLDATETIME is more than that amount of time in the future. For any project that I am working on now, it is almost guaranteed that IF my company is still around in 68 years, by then they will have made changes to this data model and application many times over, especially given how many changes we have done in the past 5 years alone. And I cannot predict what will be available to use or if SQL Server will even still be around in 68 years so that amount of planning is wasted effort. And given that I have been thinking along these lines for 8+ years now, the industry has proven me to be correct since SQL Server, starting with 2008, offers a DATE only datatype that is perfect for this use-case and everything I designed in the past 6 years that uses SMALLDATETIME can easily be converted over (IF that is ever even necessary to begin with) and this entire time my projects have benefited from not being wasteful of the extra 4 bytes for the full DATETIME.

Completely disagree. I've run into many issues with that ridiculous max value. Example 1: you have a table that tracks when an entity is "valid". You have a start and end date. Null end date = currently available. The client side code wants to use a max datetime value to represent that some entity is valid in perpetuity. In most programming languages, that date is 9999-12-31. Can't do it with smalldatetime. You have to store a mystery constant somewhere in the middle tier code that is used to represent that value. Example 2: copyrights. 10 years ago, you might have thought it safe that no copyright could extend beyond 2079 for something today or in the past. Surprise, surprise thanks to Disney, now it is a 100 years. Example 3: contracts. Yes, I've run into contracts that specify certain rights 20 and 50 years into the future. Saving those four bytes, in the long run, simply do not provide nearly enough benefit to justify them. If you are worried about storing time values, then add a check constraint which prevents time values from being stored in that datetime field.

As for conversion, you better hope that your ORM does not account for the difference between smalldatetime and datetime. If so, that means additional unit testing to ensure that the change of the datatype will not break the existing code or that the code accommodates the expanded values.