When is too many columns too many columns?

  • Again, you'll have to go with what works for you.  I don't know enough about the data.  In my system, an order is going to eventually be closed, whether via completion or cancellation.  If I were to pivot my table to change the granularity, there would be some loss of data for me, such as completion %, or items that related directly to the work order, like QA result.  So be careful.  Just because you saved 40% on space doesn't make it a good idea.  But if it allows you to better report on the things on which you need to report, then maybe it is.  And maybe you need one fact table with the original granularity and another with the new granularity.

  • This is not an uncommon thing.
    End-users are typically comfortable using Excel, so giving them the ability to "pull in everything they need" from one place and letting them do whatever they want saves lots of developer and DBA time and money!

    What may make sense is to create MORE of these big ol' flat tables that breaks things into more narrowly defined sets.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • What may make sense is to create MORE of these big ol' flat tables that breaks things into more narrowly defined sets. 

    As it's a fact table, breaking it out into multiple fact tables is probably not a good idea unless there's some operational value to the division.  As a good rule of thumb, fact tables with the same dimensionality should be combined for both performance and ease of analysis.  I've separated financial accounts between balance accounts and operational accounts even when they've had the same dimensionality because the metrics behave differently.  But there's also an operational value in separating the two.  But there's also nothing wrong with them combined, and I've seen it both ways.

  • I came across this link and I was curious what this group said about that especially when it came to large tables...

    http://michaeljswart.com/2013/05/swarts-ten-percent-rule/

    Number of Columns in a Table

    Max Size: 1024 Columns
    10% of restriction: 102 columns
    Risk of doing it wrong: High

    Maybe you need a lot of columns because you’re treating a table like a spreadsheet. The worst offender I’ve ever seen had a whopping 96 columns. It was designed by someone who didn’t know any better. By the way, the most notable modeling smell coming from this table was the columns named [Int01] to [Int30]). In this case, the database designer was definitely doing it wrong.

  • 900 columns is not per se an issue.

    But this particular table could have been and should have been further normalized.

    You've made the mistake of assuming that the table structure must match what the user sees.  The internal tables could be much more normalized, but the user could still a single row with masses of columns instead.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Kevlarmpowered - Tuesday, September 4, 2018 6:37 AM

    I came across this link and I was curious what this group said about that especially when it came to large tables...

    http://michaeljswart.com/2013/05/swarts-ten-percent-rule/

    Number of Columns in a Table

    Max Size: 1024 Columns
    10% of restriction: 102 columns
    Risk of doing it wrong: High

    Maybe you need a lot of columns because you’re treating a table like a spreadsheet. The worst offender I’ve ever seen had a whopping 96 columns. It was designed by someone who didn’t know any better. By the way, the most notable modeling smell coming from this table was the columns named [Int01] to [Int30]). In this case, the database designer was definitely doing it wrong.

    It's a facile rule.  The theory behind normalization is somewhat complex, and this website tries to boil it down to a superficial rule.  The link itself even indicates that it's oversimplified when it says that it doesn't really apply to the length of identifier names.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • One of the big problems that people tend to either be totally unaware of or just forget is the issue of wasted disk space and wasted memory.  For example, if your rows are consistently (for example) 7000 bytes wide, you're consistently wasting about 12.5% of your memory, disk space, restore times, and making updates more costly in the form of "bad" page splits and increased index maintenance because of those splits if it turns out that read_aheads do actually cause a performance problem for you.

    The other thing that people forget is that not all of the columns are going to ever be used in queries and some may only be occasionally used.  That means that anything that actually does end up using the Clustered Index is going to inherently be much slower because it'll need to read a shedload more pages from disk to memory if not already cached and use a shedload more memory than something more targeted.

    The concept of "Sister" tables to handle such rows works very well in the right hands.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I had a Dimension table that had over 520 columns.
    The updates to the table almost ground the system to a halt due to the high number of page splits.
    The fix was to split the table into two putting the most commonly used columns in the first table

  • Jonathan AC Roberts - Tuesday, September 4, 2018 12:20 PM

    I've had a Dimension table that had over 520 columns.
    The updates to the table almost ground the system to a halt due to the high number of page splits.
    The fix it was to split the table into two putting the most commonly used columns on the first table

    Glad I'm not the only one that has run into that.  Thanks for the feedback, Jonathan.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You may wish to create a view emulating the fat table and reading data from properly (well, let's say better) normalised tables.

    INSTEAD OF UPDATE trigger on that view can play a role of procedure saving the data.

    But if they say "joins are bad" then views are effectively outlawed.

    And I agree with what's been said earlier - without joins RDMS has no advantage comparing to flat file storage, it only adds overheads.

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.

    _____________
    Code for TallyGenerator

  • Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

  • RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    _____________
    Code for TallyGenerator

  • Jeff Moden - Tuesday, September 4, 2018 11:38 AM

    One of the big problems that people tend to either be totally unaware of or just forget is the issue of wasted disk space and wasted memory.  For example, if your rows are consistently (for example) 7000 bytes wide, you're consistently wasting about 12.5% of your memory, disk space, restore times, and making updates more costly in the form of "bad" page splits and increased index maintenance because of those splits if it turns out that read_aheads do actually cause a performance problem for you.

    The other thing that people forget is that not all of the columns are going to ever be used in queries and some may only be occasionally used.  That means that anything that actually does end up using the Clustered Index is going to inherently be much slower because it'll need to read a shedload more pages from disk to memory if not already cached and use a shedload more memory than something more targeted.

    The concept of "Sister" tables to handle such rows works very well in the right hands.

    I've had tables up to 220 fields because someone was building a massive pivot and outputting the results of said pivot to a table. Each field was used, basically doing a SELECT * FROM the entire table. The reason behind this table was because it was used for consumer path analysis. Each of the 200 fields was a consumer step in their journey with a cap of obviously, 200 steps. The remaining 20 fields were details about the final step (the conversion) the consumer took.

    In use, the user was exporting the entire table to do his/her own aggregation in Excel. In practice, the data could be unpivoted and the table thinned out to maybe 10 fields with more records (remember, it's 200 records per path with an additional 1 record for the conversion). Then whatever aggregation the user is doing can be done on the table in a vertical state versus a horizontal (pivot) state. Where the output can be just final results in a much smaller crosstab.

    The main reason it was 220 fields is because the end users lacked any SQL knowledge and didn't really comprehend how you could aggregate the data in a vertical table without pivoting it first. But, that's why I was there to comprehend it and show them.

    On another note, my main fact is about 100+ fields of metrics and dimensions. But, I am in a columnstore and rarely update the table. ETL is inserting new records and potentially dropping old ones for bad publishes. Data marts have the much thinner versions for reporting purposes.

  • Sergiy - Tuesday, September 4, 2018 4:46 PM

    You may wish to create a view emulating the fat table and reading data from properly (well, let's say better) normalised tables. INSTEAD OF UPDATE trigger on that view can play a role of procedure saving the data. But if they say "joins are bad" then views are effectively outlawed. And I agree with what's been said earlier - without joins RDMS has no advantage comparing to flat file storage, it only adds overheads. Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.

    NOLOCK is constantly being pushed on analysts and developers because of its necessity ... until the data consumer fully understands the problems nolock causes.  Then they fight back saying the system needs to be able to handle the load of their queries without the fear of dirty reads.

    The locking here is a constant problem so much that they are constantly killing connections to remove connections that are locking tables so the ETL processes can run and complete so other systems can get timely data.  I don't think we have a significant amount of data.  I'm used to dealing with tables with billions of records without an issue... here I have a few million records and the locking and concurrency for a handful of users is driving me nuts.

  • Re: NOLOCK
    As background, here are some of the bad things that can happen under the default READ COMMITTED (RC) isolation that seem to be forgotten when discussing NOLOCK:

    Missing previously committed rows
    Committed rows encountered multiple times
    Different committed versions of the same row encountered in a single statement/query plan
    Committed data from different points in time in the same row (but different columns)
    Committed data reads that appear to contradict enabled and checked constraints

    Thus, if you want truly clean data results, you must use a higher level of isolation than RC.

    NOLOCK, i.e. READ UNCOMMITTED (RU) is, yes, even worse, in that it also allows "dirty" reads, reads of uncommitted data.  Thus, the issues are almost exclusively with data that is being modified at the time (and it adds a new way to get duplicate rows).  But for static rows, such as historical tables, NOLOCK is not an issue.

    Overall, you must carefully weigh whether it's right or not for each specific read situation.  Simply banning NOLOCK is a rather sledgehammer approach.

    I've had horror stories of NOLOCK queries that "took down the company".   Geez, use some common sense, and in that case use SERIALIZABLE.  It's almost as much malpractice to use RC in that kind of situation as it is to use RU.

    Yet another step is snapshot isolation, which effectively solves the issues with RC, but also has a relatively high cost.  There will definitely be tables and/or databases where you'll want to use it.  In those cases, NOLOCK should be extraordinarily rare (like a quick inquiry just to verify that a row exists), because RC is guaranteed to find consistent rows available to read.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 16 through 30 (of 47 total)

You must be logged in to reply to this topic. Login to reply