Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««56789»»»

Disk Is Cheap! ORLY? Expand / Collapse
Author
Message
Posted Monday, January 3, 2011 12:10 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 12:58 AM
Points: 354, Visits: 1,869
Thomas-282729 (1/1/2011)
I could make the same argument against any data type. "Think of the savings you are wasting by using a 4-byte integer instead of a 2 byte integer for a primary key!" In the grander scheme of things, what matters is whether affects actual real world performance given actual data sizes. The vast majority of databases do not even contain a million rows must less 10 million in any one table and thus tiny differences in datatypes will make no perceptible difference. What does make a significant difference are data integrity rules.


All things being equal, there is no reason not to take advantage of methods to be as efficient as you can. We are not talking about data integrity; we are talking about data validity. If your system has good reason to make use of an INT instead of TINYINT or DATETIME instead of SMALLDATETIME then by all means do that as you would be negligent not to. I am talking about not being wasteful. And I do not buy the argument that the "vast majority of databases do not even contain a million rows". To whatever extent that is true at the moment, it is a decreasing trend.

This is more an argument about archive tables than it is in-use tables. I've seen a few systems that had a one or two tables in the 50 million row+ range and those were primarily archive, auditing tables or data warehouses (where you can play tricks to cut down the pointer size). However, let's not forget that archive tables serve a very different purpose than in-use tables.


Yes they have a different purposes but there are benefits to be gained by some foresight for both archive and transactional tables, as has already been discussed.

When modeling, the choice of implementation data type (vs the data type in the model) is secondary. You do not care whether you use an int, smallint, tinyint or bit when modeling a database. What matters is that the value is an integer with some given range. That the DBA decides they can use a smallint instead of a 4-byte integer is done at the implementation stage. Still important, but not a part of modeling.


Data-modeling is two parts: logical then physical. Implementation that is so far removed from the reasoning for the logical design is much more prone to error. And again, nobody is talking about constraining the logical model to fit into the physical model. We are talking about making good implementation choices based on the logical model and the reasoning behind the logical model.

A few extra minutes of work now can save many hours of work in the future so why not do it?


Because it can also cost you a significant amount in development time. If you chose a tinyint for example and after the system goes into production it is determined that this is too small, it is likely that quite a bit will have to be changed.


This is why we are talking about making reasonable choices as opposed to arbitrarily using the smallest datatypes available. In my experience, I have spent more time reducing datatypes than increasing them because the vast majority of the time people choose INT over TINYINT to store values from 1 to 5.

Completely disagree. I've run into many issues with that ridiculous max value. Example 1: you have a table that tracks when an entity is "valid". You have a start and end date. Null end date = currently available. The client side code wants to use a max datetime value to represent that some entity is valid in perpetuity. In most programming languages, that date is 9999-12-31. Can't do it with smalldatetime. You have to store a mystery constant somewhere in the middle tier code that is used to represent that value. Example 2: copyrights. 10 years ago, you might have thought it safe that no copyright could extend beyond 2079 for something today or in the past. Surprise, surprise thanks to Disney, now it is a 100 years. Example 3: contracts. Yes, I've run into contracts that specify certain rights 20 and 50 years into the future. Saving those four bytes, in the long run, simply do not provide nearly enough benefit to justify them. If you are worried about storing time values, then add a check constraint which prevents time values from being stored in that datetime field.


You only disagree because you missed the point I was making. The point is: be sensible. That means using the best datatype for the purpose. Using an INT to store values 1 - 5 is not sensible. Using a DATETIME to store future dates that might be 20 - 40 years out is entirely sensible and nobody would begrudge you that choice. So your examples are all instances where the business-case dictates that DATETIME is the correct choice. Great. This takes nothing away from what I (and others) have been saying. And that you found an edge-case where copyright law was changed is just that: an edge-case. And if there is some distinction between "currently available" and "always available", then again that is either a good case for DATETIME or maybe a different modeling choice to begin with.

As for conversion, you better hope that your ORM does not account for the difference between smalldatetime and datetime. If so, that means additional unit testing to ensure that the change of the datatype will not break the existing code or that the code accommodates the expanded values.


1) People use ORMs because they provide an easy data-layer. It takes little to no effort to update the ORM model.

2) Why would you even mention testing in the sense that it might not happen? Don't all changes get tested? Not only was the article about future planning which implies that it is up-front work that is already being tested, even if we go back to make changes the assumption is that regression testing is still be done. Regression (not Unit) testing is assumed to be part of a project whether it is a new feature or updated feature (which includes changes to datatypes).





SQL# - http://www.SQLsharp.com/
Post #1042004
Posted Monday, January 3, 2011 12:15 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 12:58 AM
Points: 354, Visits: 1,869
rlobbe (1/1/2011)
What no one has mentioned is that you also have to consider the mechanics of the I/O.
SQL Server is going to fetch a block (or several) at a time, not a arbitrary number of bytes.


Actually, this specifically was stated in the article. I mentioned that smaller rows give a greater probability of more rows fitting onto a page. However, I mentioned it from the forward-looking perspective in that using INT PKs for LookUp tables that have values that never even get close to 255 causes tables that have 5 or so FKs to various LookUp tables to have larger than necessary rows and hence fewer fit on a page.





SQL# - http://www.SQLsharp.com/
Post #1042009
Posted Monday, January 3, 2011 12:19 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 12:58 AM
Points: 354, Visits: 1,869
Nadrek (1/3/2011)
Essentially; know your system, know your data, know your bottlenecks, know the plateaus, know your growth patterns, and keep an eye on future technology.

You have limited CPU, limited disk space, limited disk IO, limited memory (note: taking more space on disk also takes more space in RAM), limited time, limited concurrency, and limited humans. All these factors must be balanced.


Yes, this is exactly the point I was making.

Now, if SQL Server gave us a comprehensive range of integer datatypes, we'd be a lot better off; we should have both signed and unsigned 8, 16, 32, and 64 bit integers. We don't.


I agree: UN-signed INTs (of the various sizes) would be great!





SQL# - http://www.SQLsharp.com/
Post #1042013
Posted Monday, January 3, 2011 12:23 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 12:58 AM
Points: 354, Visits: 1,869
Charles Kincaid (1/3/2011)
You have to apply all of these through a filter of sense.


Of course. That is what the article was about: making sensible/reasonable choices.





SQL# - http://www.SQLsharp.com/
Post #1042016
Posted Monday, January 3, 2011 12:34 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 12:58 AM
Points: 354, Visits: 1,869
timothyawiseman (1/3/2011)
When I am thinking about the trade offs for creating an index, I will consider carefully how often the table is written to and the trade offs in write performance, but I won't give a second thought to the disk space because it is cheap relative to making my users wait. I will create complicated and large indexed views in the same way.

Similarly, I tend to log procedures extensively. This costs disk space, but that disk space is cheap compared to having a problem in production and not being able to track down in detail when, why, how, and who caused it along with enough information to know what the answer should have been.


I completely agree. And that is the point of the article: since the table is the foundation for where that data will keep showing up (indexes, log tables, etc.) then making good choices at the beginning makes it that much easier to create whatever indexes and logging are needed. Those are valid business reasons to use disk space and their impact to performance will be "appropriate" for the benefit of having the feature if the columns making them up were not arbitrarily wasting space to begin with.


So, I agree, it was a good article with a good point and one that developers and DBAs should remember. I just want to caution against taking it to an extreme. Disk space should never be wasted, but at least in the situations I regularly see trading disk space for detailed logging or improved perofrmance is almost always a good trade. Disk space is cheap enough to justify that.


Thanks. And the intention of the article was not that anyone would go to the extreme but instead they would start thinking about things that they previously had not and in doing so make better choices. It is not a matter of sacrificing something that is necessary; it is a matter of not doing things that are unnecessary. I am not saying: do not use the disk. I am saying: try to use disk wisely because it is far too easy to use it foolishly.





SQL# - http://www.SQLsharp.com/
Post #1042026
Posted Tuesday, January 4, 2011 12:00 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 12:02 PM
Points: 2,892, Visits: 1,785
Wish I had a few more databases of sub-billion row tables

The thing that has made MPP so attractive in the BI world is the explosion in data volumes. Not only are people wanting to analyse vast data sets but they are wanting to do it in near real time.

Clickstream data for a fair sized e-commerce site will easily blast through the billion record table limit. Anything that collects mechanical data is likely to have to deal with huge data volumes. Joderell Bank generates 6TB/minute, or it did 3 years ago.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1042627
Posted Wednesday, January 5, 2011 6:52 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, July 28, 2014 12:15 AM
Points: 94, Visits: 125
Fantastic article with some really neat real-world examples. Thanks for taking the time to write it!
Post #1043441
Posted Tuesday, April 26, 2011 2:08 PM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, July 1, 2014 10:50 AM
Points: 54, Visits: 423
I know I am a few months late to this game, but I wanted to give you a compliment and issue a small caveat.

Compliment: great article. It is reassuring, for me anyway as an RDBMS-agnostic designer, that I wholeheartedly agree with your physical modeling approaches. Sizing indexes by understanding column data types is critically important. Nulls matter, as does premature denormalization. I already apply 9 of the 10 suggestions in practice and all 10 in spirit, which leads me to the caveat.

Caveat: If you are in a Microsoft technology-only shop with .NET, C# and other such technologies then no worries. Otherwise be careful with using non-ANSI standard data types like "tinyint" and "bit". The issue isn't one of ANSI compliance for its own sake, but for the sake of compatibility with non-Microsoft external drivers, add-on reporting systems and integration tools.

Once again thanks for the confirmation on my physical data modeling approaches. - Todd
Post #1099000
Posted Tuesday, April 26, 2011 8:04 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 12:58 AM
Points: 354, Visits: 1,869
Todd M. Owens (4/26/2011)
Compliment: great article.

Caveat: If you are in a Microsoft technology-only shop with .NET, C# and other such technologies then no worries. Otherwise be careful with using non-ANSI standard data types like "tinyint" and "bit".


Hello Todd. Thanks for the compliment . And thanks for mentioning the potential issue with non-standard datatypes. I do not specifically deal with that caveat for two reasons. First, I had not thought of it since I have never run into any problems . Second, I feel that by properly testing, unintended consequences from datatype decisions will be found. So if one starts out in their development environment with these changes, then any problems will not even make it to the QA environment.





SQL# - http://www.SQLsharp.com/
Post #1099168
Posted Wednesday, April 27, 2011 11:31 AM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, July 1, 2014 10:50 AM
Points: 54, Visits: 423
Oh another thing that chews up space is thoughtlessly adding or misusing a set of audit columns to every single table regardless of how the table is used. For example, if a table only allows inserts in a version type approach, there is no reason to add columns that capture UpdateDate and/or UpdateUserID. If a table's use case(s) calls for a row to be updated, only then are the Update fields necessary.

And please think twice before setting the CreateDate and the UpdateDate to the same exact value on insert: this a clear violation of the limited redundancy principle. If 90% or more of the columns are never updated, then there will be an extraneous 8 bytes on each row from the UpdateDate; multiply that by the number of rows where it should be null and in every table where it is misused, then the numbers get quite large quite quickly.
Post #1099676
« Prev Topic | Next Topic »

Add to briefcase «««56789»»»

Permissions Expand / Collapse