billy-yons and billy-yons of rows

  • Perfect question to end the week.

    Thanks Steve.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Easy QotD.

    -----------------
    Gobikannan

  • tks for the question and tks for the contributions today everyone!

  • Really easy - thanks, Steve!

  • Hugo Kornelis (1/20/2012)


    Adding all these numbers together gives me a total requirement of 15,990,985,574,732,742 8K pages to store this table.

    So I guess that before I get to 15,990,985,574,732,742 pages I should add a second table to my database. ๐Ÿ™‚

    Very interesting, thanks for doing the math.

  • Hugo Kornelis (1/20/2012)


    Hugo Kornelis (1/20/2012)


    Richard Warr (1/20/2012)


    Or from http://msdn.microsoft.com/en-us/library/ms175997.aspx we see that COUNT() returns an Integer. What happens to SELECT COUNT(1) FROM myTable when myTable contains more rows than an integer can store?

    You get an error message.

    So if you ever need to get a rowcount on a table that might exceed 2,147,483,647 rows, use COUNT_BIG() instead - this function returns bigint, with an upper limit of 9,223,372,036,854,775,807 rows. If you exceed that, I don't know what happens - but in that case, I would definitely like 0.01% of your hardware budget! ๐Ÿ˜‰

    Just for kicks, I decided to do the math. (See "Estimating the size of a clustered index" in Books Online)

    A relational table should always have a primary key to be useful. To store more than 9,223,372,036,854,775,807 with primary key violations, the primary key must be at least 8 bytes, so the smallest possible table to exceed the max of COUNT_BIG() has a single 8-byte (e.g bigint) column.

    The row size is the 8 bytes for the data, plus three for the NULL bitmap, 4 for row header overhead, and 2 for the row's entry in the slot array; that makes 14 total. This means that SQL Server can store at most 8,096 / 14 = 578 (rounded down) rows on a single data page. For 9,223,372,036,854,775,807 + 1 rows, that makes 15,957,391,067,222,796 (rounded up) pages for the leaf level of the clustered index.

    For the non-leaf pages, no NULL bitmap is used if the column is not nullable, but the overhead increases a bit - here, the size for each row is 8 (data) + 1 (overhead) + 6 (child page pointer) + 2 (slot array) = 17 bytes; this means SQL Server can fit 8,096 / 17 = 476 (rounded down) index rows on a page.

    For the first level above the leaf pages, we would need 15,957,391,067,222,796 / 476 = 33,523,930,813,494 (rounded up) pages.

    For the second level above the leaf pages, we would need 15,957,391,067,222,796 / (476 ^ 2) = 70,428,426,079 (rounded up) pages.

    For the third level above the leaf pages, we would need 15,957,391,067,222,796 / (476 ^ 3) = 147,958,879 (rounded up) pages.

    For the fourth level above the leaf pages, we would need 15,957,391,067,222,796 / (476 ^ 4) = 310,838 (rounded up) pages.

    For the fifth level above the leaf pages, we would need 15,957,391,067,222,796 / (476 ^ 5) = 653 (rounded up) pages.

    For the sixth level above the leaf pages, we would need 15,957,391,067,222,796 / (476 ^ 6) = 2 (rounded up) pages.

    And for the root level, seventh level above the leaf pages, we would need one more page.

    Adding all these numbers together gives me a total requirement of 15,990,985,574,732,742 8K pages to store this table. That is approximately 122,000,000,000 TB. If you want to build that using HP 6400/8400 Enterprise Virtual Array machines (see http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/12169-304616-304648-304648-304648-3900918.html?dnr=1 for specifications) at their maximum capacity of 628 TB a piece, you would need to have almost 200 million of those machines.

    Unfortunately, you would have to wait until Microsoft releases an improved an improved version of SQL Server - with its current supported maximum database size of "only" 524,272 terabytes, it seriously falls short for this kind of challenge.

    With a hardware budget to support this bad bad boy, I would settle for 0.0000001% of that hardware budget.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Richard Warr (1/20/2012)


    Hugo Kornelis (1/20/2012)


    Just for kicks, I decided to do the math. ....................

    Thanks, best post I've seen all week. I love stuff like that. ๐Ÿ™‚

    +1

    I hope you had as much fun to do it that I have to read it!:laugh:

    Thanks!

  • Hugo Kornelis (1/20/2012)


    Just for kicks, I decided to do the math. (See "Estimating the size of a clustered index" in Books Online)

    But why a clustered index? That table might be be a heap! So you might need only 99.7% of the amount of space you've calculated. :hehe:

    Tom

  • Britt Cluff (1/20/2012)


    Love my easy point on a Friday. Thanks Steve.

    Me too

    Tom

  • Hugo Kornelis (1/20/2012)


    Richard Warr (1/20/2012)


    Orinstead if from http://msdn.microsoft.com/en-us/library/ms175997.aspx we see that COUNT() returns an Integer. What happens to SELECT COUNT(1) FROM myTable when myTable contains more rows than an integer can store?

    You get an error message.

    So if you ever need to get a rowcount on a table that might exceed 2,147,483,647 rows, use COUNT_BIG() instead - this function returns bigint, with an upper limit of 9,223,372,036,854,775,807 rows. If you exceed that, I don't know what happens - but in that case, I would definitely like 0.01% of your hardware budget! ๐Ÿ˜‰

    I thought about this for a while, and I think it would make sense to redefine COUNT() as a function returning BIGINT. After all, there is an implicit conversion to INT, and if a variable in your script is declared as INT but the returned value overflows it, you are not worse off than when COUNT overflows in the function itself.

    Your thoughts, ladies and gentlemen?

  • L' Eomot Inversรฉ (1/20/2012)


    Hugo Kornelis (1/20/2012)


    Just for kicks, I decided to do the math. (See "Estimating the size of a clustered index" in Books Online)

    But why a clustered index? That table might be be a heap! So you might need only 99.7% of the amount of space you've calculated. :hehe:

    The answer to that question is right on the next line in the post you quoted:

    "A relational table should always have a primary key to be useful."

    (And with hindsight, I guess I could have left out the last three words of that sentence).

    You could of course create the primary key as a nonclustered index. In that case, you would save 0.3% space for the table itself, but you'd need over 100% extra space to store the nonclustered index. (If anyone feels inclined to do the math - be my guest. I've got too much of a headache at the moment).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Revenant (1/20/2012)


    I thought about this for a while, and I think it would make sense to redefine COUNT() as a function returning BIGINT. After all, there is an implicit conversion to INT, and if a variable in your script is declared as INT but the returned value overflows it, you are not worse off than when COUNT overflows in the function itself.

    Your thoughts, ladies and gentlemen?

    You could consider making a suggestion on connect. But my prediction is that it will be closed with an explanation that tells you to use COUNT_BIG if you want to return BIGINT, and that a change like you suggest could introduce backwards compatibility issues for people who might have some weird code that actually relies on COUNT overflowing once it goes past the integer domain.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Revenant (1/20/2012)


    Hugo Kornelis (1/20/2012)


    Richard Warr (1/20/2012)


    Orinstead if from http://msdn.microsoft.com/en-us/library/ms175997.aspx we see that COUNT() returns an Integer. What happens to SELECT COUNT(1) FROM myTable when myTable contains more rows than an integer can store?

    You get an error message.

    So if you ever need to get a rowcount on a table that might exceed 2,147,483,647 rows, use COUNT_BIG() instead - this function returns bigint, with an upper limit of 9,223,372,036,854,775,807 rows. If you exceed that, I don't know what happens - but in that case, I would definitely like 0.01% of your hardware budget! ๐Ÿ˜‰

    I thought about this for a while, and I think it would make sense to redefine COUNT() as a function returning BIGINT. After all, there is an implicit conversion to INT, and if a variable in your script is declared as INT but the returned value overflows it, you are not worse off than when COUNT overflows in the function itself.

    Your thoughts, ladies and gentlemen?

    I think it's an excellent idea.

    Tom

  • Hugo Kornelis (1/21/2012)


    L' Eomot Inversรฉ (1/20/2012)


    Hugo Kornelis (1/20/2012)


    Just for kicks, I decided to do the math. (See "Estimating the size of a clustered index" in Books Online)

    But why a clustered index? That table might be be a heap! So you might need only 99.7% of the amount of space you've calculated. :hehe:

    The answer to that question is right on the next line in the post you quoted:

    "A relational table should always have a primary key to be useful."

    (And with hindsight, I guess I could have left out the last three words of that sentence).

    You could of course create the primary key as a nonclustered index. In that case, you would save 0.3% space for the table itself, but you'd need over 100% extra space to store the nonclustered index. (If anyone feels inclined to do the math - be my guest. I've got too much of a headache at the moment).

    Ah, but in SQL tables are often not relational; you can make tables without primary keys. I won't claim they are useful, but people do use them. ๐Ÿ˜‰

    Besides, your post specified a table "with primary key violations" :w00t:; I don't actually think that it can be done in SQL Server, but I decided to pretend it could be since you had (typing error, I'm sure) suggested it. So my primary key index was going to be very small, several orders of magnitude less that that 0.3% reduction in table size, because no two rows in the heap would contain distinct values (if we're going to have primary key violations, let's do it thoroughly :cool:).

    Tom

  • Thank you for the question

    Iulian

Viewing 15 posts - 31 through 45 (of 49 total)

You must be logged in to reply to this topic. Login to reply