Disk Is Cheap! ORLY?

  • Very well written article with great practical examples of the downstream effect of "lazy" design decisions.

  • While I agree in general principle with the gist of the article, I don't think everyone needs to run into work this weekend and take their servers down and rebuild all their tables with the 'correct' data types. Again, I do agree in principle and it is a great article and the points are things that data modelers should take to heart.

    But for existing systems, I'm going to put out a slightly contrairan viewpoint here for arguments sake.

    Everything is relative. Large organizations that have huge databases generally have larger budgets and the reality is the even at enterprise disk-space cost, a savings of a few bytes per row--even if this savings could be achieved in every table in the system--is not going to make a CFO jump for joy and run out and by that yacht.

    And lets look at the real savings (not talking about i/o yet--just disk cost). Let's say in every table in the system, you find a couple datetimes that could be Date instead (4 byte savings per field) a few varchar(2) that could be char(2) (1 byte saved). Lets throw in a few more to get really generous and say you found a lot of little changes and saved 30 bytes per row in every table and index in the entire system.

    Now let's say there's a few detail tables that have 150 million rows. Maybe some audit tables with a another few hundred million. A hundreds of other smaller tables...lets say you have a total of a billion rows--no--how about 10 billion rows. I know that many organizations have even more than that but again--everything is relative. So 10 billion row--and things were so bad that you saved 30 bytes for every row in every table in the entire system. What have you saved? 300 gigs if my math is right. How much is that in enterprise disk cost? And remember--this is an organization who's in-house system has 10 billion rows (and we're not talking about 3rd party systems like SAP with over 100,000 tables or things like MS ISA which can generate many billions of rows in even an average company).

    Are you going to run to the powers that be and tout how much you're going to save after you...what?...take down the servers for how long?...Test the applications with the new data types--and what about those apps that now need to be rebuilt because, say, smalldatetime will now error when a vb app passes its version of null...go through all that to rebuild your applications and tables for how much savings?

    And yes, I get the part about I/O cost. Again, relativity, and pick the low-hanging fruit. If there are butt-stupid models out there in which the modeler was so lazy he had a default type of char(50) and left that in for all his "status" fields--of course by all means I'd make the time to fix that. But as many WTF's that might be out there in data models, the reality is that most places won't have that much low-hanging fruit (If you you can beg to differ on that one, then your company probably has more pressing IT issues than just this). If you're experiencing massive i/o delays on the server then of course--if you see low hanging fruit like that then you've probably found the cause.

    But a more 'expensive' thing in terms of user productivity is going to be network bandwith. And no matter how efficient you are in the data model, when you return the rows over the pipe the application is going to get the same amount of data--whether the db has to chug a few more i/o cyles to put it together or not--the network is going to see the same amount of data. That's where application-level WTF's can save bandwith that's much more costly than some extra spindle time, but that's another article.

    Really though, excellent info and by all means use it if you're in the modeling stage, but if you've got an existing system, you really need to do the math and see how much you're really going to save and if you have the resources to do this.

    --Jim

  • mwenner (12/31/2010)


    ...

    Can someone show me how a NonClustered index size (disk cost) is affected by the Clusterd Index Size?

    Thanks,

    Mark

    If you choose an identity for your clustering key, that identity is in every non-clustered index. Regardless of what columns are in the nonclustered index, the clustering key is there. So let's assume you have this:

    CustomerID int

    FirstName varchar(50)

    LastName varchar( 50)

    Status char(5)

    City varchar(100)

    Now, if my CI is on Customer ID, all the data is ordered by CustomerID.

    If I create a nonclustered index on Status, I will also have CustomerID in the nonclustered index. That is the way that I "look up" (row look up/bookmark lookup) the data. So I have 5 bytes (char(5) ) + 4 bytes (int) for every row in the nonclustered index.

    If, however, I created my CI on LastName, then my same nonclustered index on Active, now has 5 bytes for the Staus, but it also has a varchar(50) for the lastname key. On top of that, you have 2 bytes in each row for the variable column overhad (just like a table) , and if LastName is not unique, then each "duplicate" of LastName will have a unique value added (2 bytes I believe) to each duplicate row.

    Any other nonclustered indexes also have the larger size. So a CI key with variable columns potentially can really effect your nonclustered sizes.

    Now, is 4 bytes v 25 (Avg) a big deal? Maybe, maybe not. Depends on size of data, volume of queries, etc., but it is something to be aware of and understand what impact you could have. You would have to really test if this is significant in any way for your application.

  • Good description there Steve. One minor point - the 'uniquefier' for non-unique clustered indexes is 4 bytes, not 2.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks, couldn't remember that. I was thinking it was 2 + 2 for the variable portion. I was wrong.

    http://msdn.microsoft.com/en-us/library/ms177484.aspx

  • dforck (12/31/2010)


    I find it amusing that he bothered to mention using char instead of varchar, but never mentioned that using nvarchar, or nchar, effectively doubles how much space is being used for that column.

    But then sometimes you are stuck. Like when you are dealing with international an multi-cultural data. Also if you are doing Mobile development you have no choice. The Compact Edition only supports nchar, nvarchar, and ntext.

    ATBCharles Kincaid

  • Wonderful article. Good job. A couple of points though.

    For the online kind of storage you are right, and right on the mark. Still the "disk is cheep" argument has one place where it must be stressed. Backup. Some of my customer complain at the database size and don't want to backup because of cost. I point out that here in the U.S. we have electronics retailers that have terabyte or larger USB drives one the shelf and they are cheep. Well less expensive than a day of business down time for certain.

    I had initially thought that you were going to talk about partitioning and what to put on what kind of storage. Then you launched into choosing correct data types to provide efficiency in both space utilization and speed. There are some marvelous observations there. Had this been a busy day for me I might have skipped your article based on the title. Your observations should almost be required reading as prep for any design meeting.

    ATBCharles Kincaid

  • Thanks, Well explained article to all those DBA's & engineers who do not have an eye for details.

    In computers even with advances in hardware it is good to be a conservative.

  • Good Article. Your comments about i/o are dead on. Solid State devices are the next new hard drive. You can get 500 GB and put your entire database on a solid state device. Backups taking 4 hours now run in 20 minutes. ETL load take all night? Now in 1 hour.

    Patrick.

  • @GrassHopper, yeah, making your tables a little leaner today doesn't save you that much disk space in the short term. However, there's a great advantage in the long term. The main thing being archiving. Yeah, 200 bytes of data isn't much up front, but tack on 5 years of data growth and archiving, then you've saved yourself a decent amount of space, and money.

    That 300gb, if your archiving your database, that's an extra 300gb less that you have to pay for.

  • Charles Kincaid (12/31/2010)


    dforck (12/31/2010)


    I find it amusing that he bothered to mention using char instead of varchar, but never mentioned that using nvarchar, or nchar, effectively doubles how much space is being used for that column.

    But then sometimes you are stuck. Like when you are dealing with international an multi-cultural data. Also if you are doing Mobile development you have no choice. The Compact Edition only supports nchar, nvarchar, and ntext.

    that's true. i jsut wanted to point that out though becasue that's a good place for inflating the size of a database needlessly if you don't need it.

  • Keep in mind this is a piece of the puzzle. Some languages, like some of the Java clients, will only work with nvarchar formats, and if your db is in varchar, you end up converting back and forth from varchar/nvarchar for parameters and result sets, which can be expensive as well.

    The decision over which data types should be made intelligently, based on the specific situation.

  • Excellent reading. Now to push management to let me get back to work on some of my "inherited" DBs. 🙂

  • At a place I worked that shall remain nameless, all the integers HAD to be bigint. This was required by the architects. Stupidest thing ever.

    Signature is NULL

  • Sorry it has taken a while for me to respond today but work has been busier than expected.

    First off, thanks to everyone for the compliments. They are truly appreciated. 🙂

    Next, I will try to address individuals within a few posts as opposed to one-post per item so that I do not inundate people with 20 new posts.

    pro-1019688 (12/30/2010)


    Great article. I used to get very frustrated with programmers AND DBA's that thought the normalization aspects of relational models would cause a degradation in performance. I know that's a long and not so simple topic, but you have alluded to it here.

    Thanks. And I agree that some people tend to not trust that the “R” in RDBMS really stands for “Relational” and that these systems are specifically tuned to work this way.

    Michael Valentine Jones (12/31/2010)


    Columns that are too large also use more CPU time even if they are cached in memory.

    I was just looking at the database for a vendor supplied application, and was shocked to see they were using GUIDs in the form of strings as the primary keys for all tables.

    Thanks for mentioning that point about CPU. I will try to include that if/when I update this article. And as far as VARCHAR(36) GUIDs go, I can guarantee you that is not the only product with those as I have seen it as well.

    SSH (12/31/2010)


    Just like to add, another potential problem of wasting space when incorrectly defining data types is the issue of clogging up precious network bandwidth.

    As with the CPU time consequence noted by Michael Valentine Jones, I will try to add this as well to any revision I make to this article. Thanks for pointing this out.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 15 posts - 16 through 30 (of 101 total)

You must be logged in to reply to this topic. Login to reply