Improving intentionally denormalized table

  • I had a job interview about two weeks ago for a position for prviding support to teams of public health researchers that called for basic familiarity with database design and performance.

    One of the questions was about a database table created by some researchers that consisted of over 100 columns and several thousand rows. The rows were names of geographical regions and the columns were counts of various kinds related to those regions.

    An interviewer said that the data in the table was used to fill in colors and other details of a map on a web page by getting all of the data for each region and passing it to the client. The data formerly was stored in over 100 separate tables, but it was moved to one large table because the large number of joins required in the older implementation caused performance in updating the web page to suffer. He showed me a display of the table and asked, "Do you notice any possible problems this data?" "Yes," I said, "a huge number of the values are zero, so there is a lot of wasted space in the table."

    That was apparently the right answer, so far as it goes, but the next part of the question, of course, was what to do to improve it, and I didn't know what to say. I guess in the previous implementation each of the tables was much smaller, with only rows that had a non-zero value for the particular count stored in that table. In the large table we were looking at, any region with a non-zero value for any of the counts had to be listed as a row in the table with zeros in most of the columns. You could probably save a lot of space by going back to the 100 tables with one column each implementation, and figure out some way to avoid doing the huge number of joins to obtain the counts for the map display.

    Is going back to the one table per count type approach and figuring out some way to cache the data for use by the web page the right direction to proceed? If so, what would be a good way to cache the data? If not, what would be the best way to store the data efficiently and also provide it to the web client efficiently?

    Thanks,

    Mike

  • I would think a better design for something like that is to have two tables.

    Table 1. Region table with an integer ID column.

    Table 2. ResearchType table with a reference to the region integer ID, a ResearchType column, and a numeric column for the count.

    With this structure you have one record for each ResearchType for each Region where the count is not 0. This might require some development overhead on the application side but it's a good solution when you have many columns that are sparsely populated. This design would also allow additional research types to be added without the overhead changing your table schema to add an additional column. (Look into EAV)

  • richykong (11/12/2012)


    I would think a better design for something like that is to have two tables.

    Table 1. Region table with an integer ID column.

    Table 2. ResearchType table with a reference to the region integer ID, a ResearchType column, and a numeric column for the count.

    With this structure you have one record for each ResearchType for each Region where the count is not 0. This might require some development overhead on the application side but it's a good solution when you have many columns that are sparsely populated. This design would also allow additional research types to be added without the overhead changing your table schema to add an additional column. (Look into EAV)

    That's what I'd probably do although I'd also have a reference table for the ResearchType just for the sake of DRI.

    Another option would be to change all the zeros to NULLs and use sparse columns but I like that option a whole lot less.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Both of your answers make all kinds of sense (and even more so after I looked up what EAV and DRI stand for), and I wish I had thought of that solution during my interview. Thank you very much.

  • Just guessing a bit here since I don't know what data is in those 100 columns...

    But wouldn't it make more sense to have a set of normalized tables with properly-sized data types and then create a schema-bound view with indexes that could be called by the web page?

    Surely the data has to be maintained and in my experience a denormalized flat-file approach will usually turn into a nightmare for some poor developer who gets handed the job to "fix this" after something breaks or performance begins to crawl.

     

  • That is a great suggestion too, Steven. By the way, is schema binding of views a feature that is specific to SQL Server, or do other SQL database types provide a feature like that too? At the place I interviewed they use more than one of them.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply