Disk Is Cheap! ORLY?

  • rlobbe (1/1/2011)


    What no one has mentioned is that you also have to consider the mechanics of the I/O.

    SQL Server is going to fetch a block (or several) at a time, not a arbitrary number of bytes.

    Actually, this specifically was stated in the article. I mentioned that smaller rows give a greater probability of more rows fitting onto a page. However, I mentioned it from the forward-looking perspective in that using INT PKs for LookUp tables that have values that never even get close to 255 causes tables that have 5 or so FKs to various LookUp tables to have larger than necessary rows and hence fewer fit on a page.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Nadrek (1/3/2011)


    Essentially; know your system, know your data, know your bottlenecks, know the plateaus, know your growth patterns, and keep an eye on future technology.

    You have limited CPU, limited disk space, limited disk IO, limited memory (note: taking more space on disk also takes more space in RAM), limited time, limited concurrency, and limited humans. All these factors must be balanced.

    Yes, this is exactly the point I was making.

    Now, if SQL Server gave us a comprehensive range of integer datatypes, we'd be a lot better off; we should have both signed and unsigned 8, 16, 32, and 64 bit integers. We don't.

    I agree: UN-signed INTs (of the various sizes) would be great!

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Charles Kincaid (1/3/2011)


    You have to apply all of these through a filter of sense.

    Of course. That is what the article was about: making sensible/reasonable choices.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • timothyawiseman (1/3/2011)


    When I am thinking about the trade offs for creating an index, I will consider carefully how often the table is written to and the trade offs in write performance, but I won't give a second thought to the disk space because it is cheap relative to making my users wait. I will create complicated and large indexed views in the same way.

    Similarly, I tend to log procedures extensively. This costs disk space, but that disk space is cheap compared to having a problem in production and not being able to track down in detail when, why, how, and who caused it along with enough information to know what the answer should have been.

    I completely agree. And that is the point of the article: since the table is the foundation for where that data will keep showing up (indexes, log tables, etc.) then making good choices at the beginning makes it that much easier to create whatever indexes and logging are needed. Those are valid business reasons to use disk space and their impact to performance will be "appropriate" for the benefit of having the feature if the columns making them up were not arbitrarily wasting space to begin with.

    So, I agree, it was a good article with a good point and one that developers and DBAs should remember. I just want to caution against taking it to an extreme. Disk space should never be wasted, but at least in the situations I regularly see trading disk space for detailed logging or improved perofrmance is almost always a good trade. Disk space is cheap enough to justify that.

    Thanks. And the intention of the article was not that anyone would go to the extreme but instead they would start thinking about things that they previously had not and in doing so make better choices. It is not a matter of sacrificing something that is necessary; it is a matter of not doing things that are unnecessary. I am not saying: do not use the disk. I am saying: try to use disk wisely because it is far too easy to use it foolishly.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Wish I had a few more databases of sub-billion row tables:-P

    The thing that has made MPP so attractive in the BI world is the explosion in data volumes. Not only are people wanting to analyse vast data sets but they are wanting to do it in near real time.

    Clickstream data for a fair sized e-commerce site will easily blast through the billion record table limit. Anything that collects mechanical data is likely to have to deal with huge data volumes. Joderell Bank generates 6TB/minute, or it did 3 years ago.

  • Fantastic article with some really neat real-world examples. Thanks for taking the time to write it! 😀

  • I know I am a few months late to this game, but I wanted to give you a compliment and issue a small caveat.

    Compliment: great article. It is reassuring, for me anyway as an RDBMS-agnostic designer, that I wholeheartedly agree with your physical modeling approaches. Sizing indexes by understanding column data types is critically important. Nulls matter, as does premature denormalization. I already apply 9 of the 10 suggestions in practice and all 10 in spirit, which leads me to the caveat.

    Caveat: If you are in a Microsoft technology-only shop with .NET, C# and other such technologies then no worries. Otherwise be careful with using non-ANSI standard data types like "tinyint" and "bit". The issue isn't one of ANSI compliance for its own sake, but for the sake of compatibility with non-Microsoft external drivers, add-on reporting systems and integration tools.

    Once again thanks for the confirmation on my physical data modeling approaches. - Todd

  • Todd M. Owens (4/26/2011)


    Compliment: great article.

    Caveat: If you are in a Microsoft technology-only shop with .NET, C# and other such technologies then no worries. Otherwise be careful with using non-ANSI standard data types like "tinyint" and "bit".

    Hello Todd. Thanks for the compliment :-). And thanks for mentioning the potential issue with non-standard datatypes. I do not specifically deal with that caveat for two reasons. First, I had not thought of it since I have never run into any problems ;-). Second, I feel that by properly testing, unintended consequences from datatype decisions will be found. So if one starts out in their development environment with these changes, then any problems will not even make it to the QA environment.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Oh another thing that chews up space is thoughtlessly adding or misusing a set of audit columns to every single table regardless of how the table is used. For example, if a table only allows inserts in a version type approach, there is no reason to add columns that capture UpdateDate and/or UpdateUserID. If a table's use case(s) calls for a row to be updated, only then are the Update fields necessary.

    And please think twice before setting the CreateDate and the UpdateDate to the same exact value on insert: this a clear violation of the limited redundancy principle. If 90% or more of the columns are never updated, then there will be an extraneous 8 bytes on each row from the UpdateDate; multiply that by the number of rows where it should be null and in every table where it is misused, then the numbers get quite large quite quickly.

  • Todd M. Owens (4/27/2011)


    ...if a table only allows inserts in a version type approach, there is no reason to add columns that capture UpdateDate and/or UpdateUserID. If a table's use case(s) calls for a row to be updated, only then are the Update fields necessary.

    And please think twice before setting the CreateDate and the UpdateDate to the same exact value on insert...If 90% or more of the columns are never updated, then there will be an extraneous 8 bytes on each row from the UpdateDate...

    Hey there. I definitely agree with both of these and practice them myself.

    There might be a circumstance, however, to have both CreateDate and UpdateDate be set upon initial INSERT (and hence UpdateDate will be the same value as CreateDate if the row is never updated). I have seen cases where an ETL process looks at the UpdateDate field to know if it should grab that row and will have the UpdateDate field indexed for that purpose. In this case UpdateDate serves dual-duty by indicating both new and updated rows without having to scan both CreateDate and UpdateDate fields. And to only have a single field (assuming someone might suggest not having a CreateDate field in this case), it is still helpful to see the CreateDate since that information will be gone once the row is updated if you only have the single UpdateDate field. But outside of this specific case I tend to agree that the UpdateDate field should be NULL.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Todd M. Owens (4/27/2011)


    And please think twice before setting the CreateDate and the UpdateDate to the same exact value on insert: this a clear violation of the limited redundancy principle. If 90% or more of the columns are never updated, then there will be an extraneous 8 bytes on each row from the UpdateDate; multiply that by the number of rows where it should be null and in every table where it is misused, then the numbers get quite large quite quickly.

    Hey Todd. I did not notice until just a moment ago when I was working on something else that this statement is not entirely true. While I still do agree that using NULL for UpdateDate when a record has not yet been updated is a better practice, that is merely just a logical consideration (outside of the potential indexing issue I mentioned before). But a fixed-width column (numeric types, date/time types, and char/nchar/binary types) will always take up the same amount of space regardless if it is set to NULL or a value. Meaning, if UpdateDate is a DATETIME field, it will always take up 8 bytes even if it is NULL. And to make sure I did just test this out.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • But a fixed-width column (numeric types, date/time types, and char/nchar/binary types) will always take up the same amount of space regardless if it is set to NULL or a value. Meaning, if UpdateDate is a DATETIME field, it will always take up 8 bytes even if it is NULL. And to make sure I did just test this out.

    Hmm I wonder if that is true for Oracle, DB2, Teradata, et.al....

  • Solomon,

    I've recently had the need to refer to this article again and I thank you from the bottom of my heart. This "right sizing on steroids" article should be required reading for anyone and everyone who has or will ever write the words CREATE TABLE. My hat is off to you, good Sir. Well done!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/5/2012)


    Solomon,

    I've recently had the need to refer to this article again and I thank you from the bottom of my heart. This "right sizing on steroids" article should be required reading for anyone and everyone who has or will ever write the words CREATE TABLE. My hat is off to you, good Sir. Well done!

    Hi Jeff and thank you very much for such positive feedback. I apologize for the delayed response but new baby came along less than 2 weeks before your comment and things have been, well, C-R-A-Z-Y!! :hehe: Also, I had thought of a few minor additions to make and wanted to get them in and published before replying.

    I have added:

    • A reference to decreased Page Life Expectancy in the paragraph about memory usage
    • Info about several other datatypes at the end of the main recommendations list, things like: SMALLMONEY, REAL, TIME, etc.
    • Info about some newer features that can help save space in addition to (not in place of) a good model. Namely: SPARSE columns and Row / Page compression

    Hopefully the updates make for an even stronger article :-).

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Excellent article. I think the network bandwidth issue is really important and often overlooked. It is mentioned in the article, but I'd add that a WAN circuit of even just moderate bandwidth can cost as much per month to operate as some disk drives cost to buy. And can easily cost significantly more. So if you are syncing to a DR site, getting the backup image over there, and as mentioned, log shipping or replication, can take a big byte out of it. Bandwidth is also not something you can quickly address if you run out of it! Disks are usually a two week turnaround to buy more and rack them up. WAN circuits are running 120 days or more!

Viewing 15 posts - 61 through 75 (of 101 total)

You must be logged in to reply to this topic. Login to reply