When is too many columns too many columns?

  • I know you can have some insane number of columns in SQL server... but just because you could, doesn't mean you should.  Right now I am battling with a table that has almost 900 columns.  34 of those columns are dedicated to integer values which define what user is defined to what part of the process.  280 columns are defined to statuses of the particular thing.  It is actually 140 datetime values, but when it was built, it was built to split datetime into two columns, one for date and one for time.

    Locking is the first thing that comes to mind as a high traffic table is going to lock the entire thing up if you are updating one or many of the columns...

  • Kevlarmpowered - Friday, January 12, 2018 3:25 PM

    I know you can have some insane number of columns in SQL server... but just because you could, doesn't mean you should.  Right now I am battling with a table that has almost 900 columns.  34 of those columns are dedicated to integer values which define what user is defined to what part of the process.  280 columns are defined to statuses of the particular thing.  It is actually 140 datetime values, but when it was built, it was built to split datetime into two columns, one for date and one for time.

    Locking is the first thing that comes to mind as a high traffic table is going to lock the entire thing up if you are updating one or many of the columns...

    There's actually nothing wrong with a 900 column table IF the rules of normalization were applied... which also usually means that you won't have a table with 900 columns. 😉

    I've worked with tables with similar insane numbers of columns.  Most of those had to deal with CDRs (Call Detail Records).  For those, it made no sense to split the columns off into separate tables.

    If you status/datetime columns have a lot of NULLs in them, it may be worthwhile to build a "sister" table in the form of an EAV to store those statuses.  A whole lot of people will tell you that EAVs are the wrong way to go but this is one of those places where they can really shine if you put some thought into it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I'm with Jeff. The right number of columns is the number of columns needed to support the object stored inside the database, whether normalized or in a star schema, whatever. There isn't a number that's too many.

    That said, it sounds like, based on your description, that you're not dealing with a properly structured table. 180 columns to define a user and 280 columns to define a thing... That can't possibly be normalized or a fact table. It sounds like multiple functions have been jammed into a single table. In that case, it's not that there are too many columns, but that the structure is wrong. Bad structures will absolutely cause performance headaches, maintenance problems and even development and testing problems. Just overloading a column causes problems, let alone overloading a table.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Grant Fritchey - Monday, January 15, 2018 7:33 AM

    I'm with Jeff. The right number of columns is the number of columns needed to support the object stored inside the database, whether normalized or in a star schema, whatever. There isn't a number that's too many.

    That said, it sounds like, based on your description, that you're not dealing with a properly structured table. 180 columns to define a user and 280 columns to define a thing... That can't possibly be normalized or a fact table. It sounds like multiple functions have been jammed into a single table. In that case, it's not that there are too many columns, but that the structure is wrong. Bad structures will absolutely cause performance headaches, maintenance problems and even development and testing problems. Just overloading a column causes problems, let alone overloading a table.

    +100
    😎

    There are exceptions to this though, i.e. CDRs, Surveys etc. In those cases I'd recommend looking into the sparse column features and limitations.

  • 900 column table must be having a lot of indexes and a complex primary key. What will happen to Update and Delete performance on such tables?

  • T.Ashish - Monday, January 15, 2018 11:31 PM

    900 column table must be having a lot of indexes and a complex primary key. What will happen to Update and Delete performance on such tables?

    Lots of activity.

    Seriously though, pretty much what you would expect. You have to write the data once to the store (heap or clustered) and then you have to do the appropriate writes to each applicable index. Any of those writes could lead to the addition of new pages or page splits, with a whole bunch more writes. You do that every single time.

    I've said it before, I'll say it again, a properly normalized database enhances performance within an RDBMS. If you really only need a flat file, use a flat file storage system or a document database, not an RDBMS. They will work better with relational data than they work with non-relational data. They will work better with a star schema than with a flat file system. The whole "flat tables are faster" myth makes me insane. It generally comes from "flat tables are faster than my horrifically bad normalization/poor choice in fact table definition." Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Grant Fritchey - Tuesday, January 16, 2018 6:13 AM

    If you really only need a flat file, use a flat file storage system or a document database, not an RDBMS.

    I'll say, "It Depends".  Such a thing is certainly better than storing large XML or JSON junk and may be entirely necessary because an RDBMS will still handle files (once imported into a table) better than a whole lot of other things especially if it is data such as CDRs or similar information.

    Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?                       

    +1000 to that!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Sorry... posted to wrong thread on this post and moved it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks all for your input... unfortunately, I didn't build these tables.  They were built as data warehouse tables because they didn't want users writing SQL, so to make life easier, they made tables that had lots of columns so end users could basically select * from factTable where createDate >= dateadd(year,-2,getdate()).  To avoid having to do "joins" they ETL everything into single tables for users too.  So there is lots of repetition.  This table likely has 900 columns of which a bunch are duplicated in other tables... but because they say "joins are bad" they ETL everything into one giant flat table.  Sometimes it's values... sometimes it is just a flag 0 or 1.

    My thought at doing this was to create a sister table with a 1:1 relationship with the original fact table so that I could offload the user assignment and date and time columns to a separate table, but that's still pushing 314 columns into a table.  My next thought was to pivot those columns and have a table dedicated to statuses.  So for every "thing" I would have a 1:many relationship.  Each "thing" could have multiple users assigned (bonus it would allow multiple people with the same role to be assigned to a thing which can't be done now -- if multiple people are assigned, to a thing, the most recent assignment overwrites the existing assignment).  As it stands, eventually all of those 34 user assignment columns will be populated and each of those 280 date and time columns will have values in them as well.

    I was just trying to wrap my head around this because yes you can have 1000+ columns per table... but my brain says it's a bad idea because locking.  As the activity on the table grows, each update statement is going to lock the entire row.  If we are rapidly reassigning and tagging dates/times, that's going to be a problem with growth.  Going verticle would increase the number of rows, but would lessen the row locking as things are inserted/updated.

    For explanation purposes ... a "thing" is assigned to a user.  A user can do many different steps which is why there are 34 user assignment columns and 240 date and time columns.  One user could do 10-12 steps to a particular "thing".

  • Grant Fritchey - Tuesday, January 16, 2018 6:13 AM

    The whole "flat tables are faster" myth makes me insane. It generally comes from "flat tables are faster than my horrifically bad normalization/poor choice in fact table definition." Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?

    I'm 100% going to use this as I came across a table today with 1060 columns... 

    I think if I recommend abandoning our RDBMS for a flat file system my mugshot will end up on a wall with the title of "wanted dead or alive".

  • You might want to explore putting this into a columnstore index, see if that helps. It really depends on the types of queries being run against it, but pivoted storage, which you're looking at anyway, plus compression... it might make a big difference (might not).

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • That can't possibly be normalized or a fact table.

    Without a review of the data, this cannot be definitively established.  That said, that is a lot of columns.  I've never had anything close to that. 

    If it is a fact table, which isn't clear to me that it is, and it is otherwise correctly modelled, the number of columns could be reduced with the creation of one or more junk dimensions.  It it's an OLTP table and normal forms have been followed, if there are columns that don't apply to many or most records, these could be broken out into a smaller table.  [The theoretical example of a company car comes to mind: if most people don't have one, but no one has more than one, it's probably better that this is it's own table even though it could be in the employee table].

    As you say, you didn't build the tables, so you may be stuck.  Hopefully the table has a proper key and you can walk through the normal forms if it's an OLTP table or establish that it's all at the same level of granularity if an OLAP table.

  • It is a fact table... 

    Before I came across this, I had never seen a table as wide as this before so I just wanted to consult with the online community to see what the experts say if I was crazy or there was something I was missing.  I personally would have broken it down to be more normalized.  For example, there are 200 columns dedicated to status... each "thing" could have 100 statuses defined by a datetime (but since they split the datetime into two columns one date and one time... it's 200 columns).  If the "thing" does not reach that status, they populate the columns with a fake null 19000101 date and 000000 time.  If the "thing" reaches the status it gets populated with a date and time column.  If the thing hits a status twice, the newer status date overwrites the older status date and time columns.

    Imagine this over 200 columns

    IDStatus1DateStatus1TimeStatus2DateStatus2TimeStatus3DateStatus3TimeStatus4DateStatus4Time
    1234520180202 27900 190001010190001010201707150
  • Breaking out the date and time is the correct approach if both items are dimensionalized.  Otheriwse the row count gets out of hand quickly. 

    As for the dates, if it's an accumulating snapshot fact table (and it sounds as if it is) lots of dates are not uncommon.  In fact, it's a typical characteristic.  I've designed ones that captured the major milestones: create date/create time, customer contact date/customer contact time, scheduled date/scheduled time, closed date/closed time.  Not all the dates had times dimensions associated with them.  Not all statuses have independent columns however.  There are two dates that map to 'N/A' and 'Unknown.'  An open order does not have a closed date (NULL in the OLTP system, but 'N/A' in the warehouse).  If it's in one of the closed statuses and does not have a closed date, a NULL value is then mapped to Unknown.  These are the only two dates with a string that is not a date.  All the other dates have string values that displays the date. 

    Based on what you're saying, on it's face it doesn't sound wrong.  Without reviewing data that can't be known for certain.  Hopefully this helps you decided.  You could always try a fact table that has a granularity of item status change.  If that meets your reporting needs it would save a lot of space.  But that would make some things harder to report on than your current model.

  • Oh... I already reverse engineered it and unpivoted it back to a vertical for my data model.  There are many more rows, but the import and rendering in PowerBI works significantly faster/better.  This added overhead though worries me should they decide to change something in the base layer.  I'll have to chase it down in my reverse engineering again.

    I also took out the fake nulls and the data size dropped significantly.  I handle the NULLs on the application side converting NULL to either "not applicable" or "not available".  Moving less data helped the import process... there was a 40% data savings but removing the fake nulls.  It was tricky to determine when n/a was actually a NULL versus when the user did not answer the question.  "not available" versus "not answered".

Viewing 15 posts - 1 through 15 (of 48 total)

You must be logged in to reply to this topic. Login to reply