Disk Is Cheap! ORLY?

  • GPO (1/1/2011)


    What my argument doesn't address here is the opportunity cost of parsimony. Does it cost more to hire competent DBAs than it does to get the same outcome through better hardware? You don't have an unlimited budget. How do you decide where the money gets spent?

    I would typically err on the side of the human over the machine. Yes, the cost of the machine can be depreciated on tax returns for the company and won't take sick/vacation days but:

    1) an employee can usually be replaced at the same cost whereas inadequate hardware requires additional spending and trying to sell old hardware doesn't recoup much of the cost

    2) hardware has a very limited scope of what it can help. making one machine faster only effects that one machine whereas an employee can help fix an entire system and even help in areas that are not system related

    3) hardware will never creatively contribute to solving problems: it can only ask for more hardware. an employee can re-architect a project to work more efficiently on the same hardware and then go on to adding new features.

    Obviously there are limits. At some point a problem will simply require more hardware and no amount of re-engineering will help. But of course each situation needs to be evaluated on the merits of the particular situation as opposed to having a hard-and-fast rule. Meaning: "it depends" ;-).

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • CPU power has increased, but a great talk from Dr. David Dewitt in 2009 at the PASS Summit showed that disk xfer rates have not. The capacity has, but the IOPS is not keeping pace withe everything else.

    Therefore it's not that a GUID v int makes a difference everywhere, but it can add up. It is the micro-second differences across your whole system can add up.

    Does this mean you should never use GUIDs? No, it means that you shouldn't blindly assume a GUID is needed and your hardware will keep up. In most systems it can cover poor design and coding, but in quite a few, it doesn't. So you consider that disk matters as a piece of your design. Not necessarily the most important part, but a piece, which is what I think Solomon was getting at.

  • If you ask your manager whether she'd rather wait 10 minutes for her answer, or 10 seconds, she's going to go for the 10 second option.

    Of course this is a misrepresentation. Here, I can do that too: "Would you rather wait only 10 seconds some time in the future for a query to complete and that will cost you 1000 hours of additional development time at X rate per hour now or would you rather take a small risk that your query might take 10 minutes in the future because the hardware has not caught up to the processing or the hardware currently in use at that time is insufficient but save that money now?" I bet you most managers would rather save money now that maybe save money in the future.

    Designed correctly you can get comparable performance to an integer using guids that will scale well in most cases. However, most opponents of guids discount a COMB guid in their analysis of performance.

    There are bigger questions to answer. It doesn't matter what the hardware status quo is at any given point in time, if you're inefficient with your data, retrieval, maintenance and analysis will take longer than they should. The organization down the road that is competing with you, and who DOES take parsimony seriously, will have a competitive advantage (all else being equal).

    Inefficient with your data types is a relative problem. Is storing the year out to its full four digits inefficient? It was 40 years ago. You are discounting development time, maintenance costs and most importantly hardware scale.

  • Steve Jones - SSC Editor (1/1/2011)


    CPU power has increased, but a great talk from Dr. David Dewitt in 2009 at the PASS Summit showed that disk xfer rates have not. The capacity has, but the IOPS is not keeping pace withe everything else.

    So you consider that disk matters as a piece of your design. Not necessarily the most important part, but a piece, which is what I think Solomon was getting at.

    Yes, this is it exactly. I am saying that we cannot ignore disk space usage as a relevant factor to system design as so many people often do because of this misperception that "disk is cheap". And I even said in the article that GUIDs sometimes are very valid, just 99% of the time NOT for Clustered Indexes and/or PKs. Good design takes into account as much of the environment in which the system exists as possible.

    Thanks Steven for clarifying this. I think I should add this into any revision of this article.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Thomas-282729 (1/1/2011)


    Inefficient with your data types is a relative problem. Is storing the year out to its full four digits inefficient? It was 40 years ago. You are discounting development time, maintenance costs and most importantly hardware scale.

    Inefficiency is a relative problem only if we are talking about radically different system models. A desktop application that holds data for one person is far different than the growing trend of web-based applications holding data for everyone in a shared setting. You are missing the point in that I am advocating people take a few moments here and there to think more about what they are doing going forward, not looking for stuff to fix. By spending a few extra minutes here and there a company can greatly reduce maintenance and hardware costs with a minimum of development time. Using larger than necessary datatypes for everything is just plain lazy and expecting hardware to catch up to hasty decisions only works if you have the money to spend on new hardware.

    And your example of the Y2K issue: that could have been handled a lot better and was much less of an issue than was originally thought. Also, not storing the full 4 digits produced a possibility of incorrect computations on seemingly correct data whereas using a SMALLDATETIME and not being able to go beyond 2079 is simply a matter of potentially not being able to enter in a valid piece of data as opposed to entering in something that looks correct but might someday produce an erroneous result when doing date arithmetic on it. So this is moot point.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • 1) Please keep in mind that this issue is a compounded one since a field can show up in other tables as well as using memory, CPU, network, log space, etc. We are not talking about one field in one table here. If we were, then maybe I would agree even if we had 500 million rows (although I would still try to model it properly if it is a new project but I wouldn't worry about it for an existing system). When people don't pay attention to datatypes they generally carry that idea onto many fields in many tables so even if you don't have many large tables you still have many fields across many tables that could be taking up resources unnecessarily.

    I could make the same argument against any data type. "Think of the savings you are wasting by using a 4-byte integer instead of a 2 byte integer for a primary key!" In the grander scheme of things, what matters is whether affects actual real world performance given actual data sizes. The vast majority of databases do not even contain a million rows must less 10 million in any one table and thus tiny differences in datatypes will make no perceptible difference. What does make a significant difference are data integrity rules.

    2) Regarding the idea that most database do not get to the hundreds of millions of rows, I think that is an issue of looking back 20 years as opposed to ahead 20 years. I certainly respect your experience in the field but I think it is becoming clear that more and more applications are becoming highly data intensive and given that storing large volumes of data today is much more feasible and economical than 10 - 20 years ago, and that BI and data analysis is becoming more popular, I think we are already seeing a growing trend in data retention. And regardless of table size, if operations on a table take longer due to additional resource consumption then that does affect the user experience and more applications these days are web-based with users expecting instantaneous response times and they don’t care if there are 1000 other concurrent users on the system who might be hitting the same table(s).

    This is more an argument about archive tables than it is in-use tables. I've seen a few systems that had a one or two tables in the 50 million row+ range and those were primarily archive, auditing tables or data warehouses (where you can play tricks to cut down the pointer size). However, let's not forget that archive tables serve a very different purpose than in-use tables. Again, I see medium size company databases regularly and rarely do I see tables even remotely close to a million rows and almost never if you exclude archive tables.

    3) Regardless of how many rows a table might have over a 5 or 10 year period, if I have the chance to model something properly in the beginning then I will certainly do so because there is a very minimal extra cost in terms of me thinking about the problem and maybe asking a few more questions of the person requesting the feature.

    When modeling, the choice of implementation data type (vs the data type in the model) is secondary. You do not care whether you use an int, smallint, tinyint or bit when modeling a database. What matters is that the value is an integer with some given range. That the DBA decides they can use a smallint instead of a 4-byte integer is done at the implementation stage. Still important, but not a part of modeling.

    A few extra minutes of work now can save many hours of work in the future so why not do it?

    Because it can also cost you a significant amount in development time. If you chose a tinyint for example and after the system goes into production it is determined that this is too small, it is likely that quite a bit will have to be changed.

    And who can predict that their project will never grow to such sizes?

    There is a cost to such an assumption. Why not assume your database could grow to 1 trillion exabytes? The obvious answer is that to build a system to scale to those dimensions would require an enormous amount of effort even though the probability that the system could grow to those levels is remote. Yes, we should make *reasonable* assumptions about growth, accounting for archival processes, and in most systems I have seem that is far less than billions or even hundreds of millions of rows.

    4) I don't think it is entirely relevant to state that we now have 100 or even 1000 times more efficient hardware when the financial resources to get such hardware are not infinite. Yes, there will always be faster CPUs and more memory to add but that doesn't mean those budget items will be approved. That is a very short-sighted way of looking at this issue: a vendor selling the proper hardware in no way implies my ability to purchase it to solve my problems.

    You are also discounting cost. In addition to more powerful computers, cost of that power generally drops. Thus, it is likely that there is a system that could be built today that would scale your database to 100 million rows but ten years from now that system might cost the same as workstation today.

    I'm not saying that developers should entirely discount performance related to data types but I am saying that I would prefer they focus more on data integrity and good relation design than on minor performance boosts from say using a smalldatetime vs a datetime.

    5) DATEIME2 datatype does not replace GUIDs since GUIDs are used as record-locators and I cannot see that happening with datatime data even if it can point to a unique record

    A significant reason against using DateTime in the past is that its resolution was far too low (1/3 of a millisecond) to work in heavy transaction systems. However, at nanosecond resolution, you would be hard pressed to intentionally create dups. A big reason that guids are used over integers is the ability to generate them on the client instead of requiring a round trip that Identity values require. DateTime2 could work well enough for that purpose in half the size. I haven't personally built a system using a DateTime2 as a PK but I can see the potential.

    6) Using SMALLDATETIME when the time itself is not wanted is nearly always a good idea and the limit of year 2079 is not much of an issue as a "non-intuitive max value". The reason being is that date values are nearly always, if looking into the future, looking into the more immediate future. Most dates start as being current which always work and some future dates for scheduled events or appointments that don't typically happen 50 years or more out. The current max value of a SMALLDATETIME is 68 years in the future which is perfectly workable in 99% of situations (again, where only the date portion is relevant to begin with). If you consider that computer science as a career is not much more than 40 years looking back, the max value for a SMALLDATETIME is more than that amount of time in the future. For any project that I am working on now, it is almost guaranteed that IF my company is still around in 68 years, by then they will have made changes to this data model and application many times over, especially given how many changes we have done in the past 5 years alone. And I cannot predict what will be available to use or if SQL Server will even still be around in 68 years so that amount of planning is wasted effort. And given that I have been thinking along these lines for 8+ years now, the industry has proven me to be correct since SQL Server, starting with 2008, offers a DATE only datatype that is perfect for this use-case and everything I designed in the past 6 years that uses SMALLDATETIME can easily be converted over (IF that is ever even necessary to begin with) and this entire time my projects have benefited from not being wasteful of the extra 4 bytes for the full DATETIME.

    Completely disagree. I've run into many issues with that ridiculous max value. Example 1: you have a table that tracks when an entity is "valid". You have a start and end date. Null end date = currently available. The client side code wants to use a max datetime value to represent that some entity is valid in perpetuity. In most programming languages, that date is 9999-12-31. Can't do it with smalldatetime. You have to store a mystery constant somewhere in the middle tier code that is used to represent that value. Example 2: copyrights. 10 years ago, you might have thought it safe that no copyright could extend beyond 2079 for something today or in the past. Surprise, surprise thanks to Disney, now it is a 100 years. Example 3: contracts. Yes, I've run into contracts that specify certain rights 20 and 50 years into the future. Saving those four bytes, in the long run, simply do not provide nearly enough benefit to justify them. If you are worried about storing time values, then add a check constraint which prevents time values from being stored in that datetime field.

    As for conversion, you better hope that your ORM does not account for the difference between smalldatetime and datetime. If so, that means additional unit testing to ensure that the change of the datatype will not break the existing code or that the code accommodates the expanded values.

  • What no one has mentioned is that you also have to consider the mechanics of the I/O.

    SQL Server is going to fetch a block (or several) at a time, not a arbitrary number of bytes.

    What was discussed is valid, but if the cumulative byte savings, per row, don't get an extra row (or more) into a block. Your savings are Zero.

    A complete block will be read, whether it's full or not.

    The advice in indexes is still valid, you can almost certainly get extra index entries into an index block.

  • Someone mentioned in an earlier post that all tables must have a cluster definition. This is not true. It's called a hash table.

    [font="System"]Jay Quincy Allen, Managing Partner, Relational Models LLC[/font]

  • pro-1019688 (1/2/2011)


    Someone mentioned in an earlier post that all tables must have a cluster definition. This is not true. It's called a hash table.

    Actually - it is called a heap, but that is just words. πŸ˜‰

    Jeffrey Williams
    β€œWe are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Yes, sorry Jeffrey, a heap index, not a hash.

    [font="System"]Jay Quincy Allen, Managing Partner, Relational Models LLC[/font]

  • Excellent!

  • Essentially; know your system, know your data, know your bottlenecks, know the plateaus, know your growth patterns, and keep an eye on future technology.

    You have limited CPU, limited disk space, limited disk IO, limited memory (note: taking more space on disk also takes more space in RAM), limited time, limited concurrency, and limited humans. All these factors must be balanced.

    Know your bottlenecks, but don't waste anything. If you want an 8k row tally table, use smallint, not int. If you have a date dimension table that covers SMALLDATETIME, every single day can be represented by a unique SMALLINT value, starting at -32768 for Jan 1, 1900.

    Note: It's not uncommon for a 146GB 15k FC "upgrade" disk to cost right around $1000... each. Not counting the tray cost. Not counting the cabinet cost. Not counting license costs for the SAN to be allowed to see it. Not counting parity/mirror disk costs. Not counting hot spare costs. Not counting cold spare costs. Not counting electricity use. Not counting heat generation.

    Plateaus are critical: Adding another couple drives is "easy" and "cheap"... until you're full, at which time you may need a new tray, which needs a new cabinet, which needs more floor space in the server room, and which needs a new electrical circuit, which needs a new enterprise UPS, which needs a new generator... and the additional heat load needs a new air conditioning unit. Did we mention the fire suppression system, too?

    Now, if SQL Server gave us a comprehensive range of integer datatypes, we'd be a lot better off; we should have both signed and unsigned 8, 16, 32, and 64 bit integers. We don't.

  • Amen! I hear that "disk is cheap" all the time from developers. Laziness keeps prevailing where I work, and I'm doing my best to try and make developers aware of the reasons behind watching datatypes as well as nullable fields. I believe I will forward this article to all developers.

  • Points made by Thomas-282729 and others are valid for consideration. You have to apply all of these through a filter of sense. The other point that was good is "does this get me an extra row per page".

    The the other thing that has not been mentioned is those of us stuck on the low end of the spectrum. Mobile. We have less space than a postage stamp and processors with less power than a one horse sleigh. If we are not engineered to the max the user will throw the application (and the device) in the trash. We will not have a billion rows in any table, that is true. We don't have a billion of anything. We do have compatibility constraints as often we are at the mercy of "how much code do you write for Sync Services".

    Scalability is a big consideration at both ends of the spectrum and in the middle too.

    ATBCharles Kincaid

  • I have to agree with you. Taking steps to ensure that disk space is not wasted is definitely beneficial.

    But there is a counterbalancing point: Disk space is normally cheaper than problems that can be solved by sacraficing disk space. I hate wasting disk space and the increases in IO time that it brings, but I will trade vast swaths of disk space for even very small performance gains and normally say it is a good trade.

    When I am thinking about the trade offs for creating an index, I will consider carefully how often the table is written to and the trade offs in write performance, but I won't give a second thought to the disk space because it is cheap relative to making my users wait. I will create complicated and large indexed views in the same way.

    Similarly, I tend to log procedures extensively. This costs disk space, but that disk space is cheap compared to having a problem in production and not being able to track down in detail when, why, how, and who caused it along with enough information to know what the answer should have been.

    So, I agree, it was a good article with a good point and one that developers and DBAs should remember. I just want to caution against taking it to an extreme. Disk space should never be wasted, but at least in the situations I regularly see trading disk space for detailed logging or improved perofrmance is almost always a good trade. Disk space is cheap enough to justify that.

    ---
    Timothy A Wiseman
    SQL Blog: http://timothyawiseman.wordpress.com/

Viewing 15 posts - 46 through 60 (of 101 total)

You must be logged in to reply to this topic. Login to reply