Storing multiple variations of the same data?

  • Can anyone weigh in on why you would want to store multiple variations of the same data?  In all of my time, I have always seen the following columns.

    First Name
    Last Name
    Middle Name

    I came across a DWH which had the following columns which perplexed me...

    First_Name
    Last_Name
    Middle_Name
    FullName
    FullName_Caps
    FullName_Caps_LastFirst
    FullName_LastFirst
    FullName_TitleCase
    FullName_LastFirst_TitleCase

    I can't think of any reason why you would want/have to store all the different permutations of this data.  Possible indexing/searching came to mind, but I wasn't sure and the people who built it are long gone.

  • It looks like they're all formatted differently, the reason is so that they can be pulled in that format without having to do it at run time.  As for why who knows, maybe whatever tools that pulls the data isn't as good at formatting as whatever is loading the DW?

  • They are all formatted differently... I have always just done the formatting at runtime because of the overhead of storing all of those permutations of data I thought would be huge.  50+ million rows of data versus upper(concat(lastname,' ',firstname).  I first thought they might be calculated columns, but nope.  They are being pumped in via the ETL process...

    I thought there might be some hidden golden rule in the DWH toolkit book that I overlooked...

  • Kevlarmpowered - Thursday, May 10, 2018 11:55 AM

    They are all formatted differently... I have always just done the formatting at runtime because of the overhead of storing all of those permutations of data I thought would be huge.  50+ million rows of data versus upper(concat(lastname,' ',firstname).  I first thought they might be calculated columns, but nope.  They are being pumped in via the ETL process...

    I thought there might be some hidden golden rule in the DWH toolkit book that I overlooked...

    It's probably to avoid doing it at runtime, in the interests of speedy retrieval. I can think of no other reason.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Kevlarmpowered - Thursday, May 10, 2018 11:26 AM

    Can anyone weigh in on why you would want to store multiple variations of the same data?  In all of my time, I have always seen the following columns.

    First Name
    Last Name
    Middle Name

    I came across a DWH which had the following columns which perplexed me...

    First_Name
    Last_Name
    Middle_Name
    FullName
    FullName_Caps
    FullName_Caps_LastFirst
    FullName_LastFirst
    FullName_TitleCase
    FullName_LastFirst_TitleCase

    I can't think of any reason why you would want/have to store all the different permutations of this data.  Possible indexing/searching came to mind, but I wasn't sure and the people who built it are long gone.

    Rarely does a decent data warehouse NOT have a certain amount of "de-normalized" data that provide for useful ways to create dimensions without having to pre-process 50 million rows first.  Having all the variations at hand means NOT having to do a rather sizable bunch of work at run-time.    De-normalization is pretty much "normal" for a data warehouse precisely because the volumes are usually large enough to justify not wasting processing time on getting a variation on your theme, so to speak.   Cube processing can benefit significantly from that kind of thing.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Data warehouses are often denormalized.  However, this data is so unlikely to be needed that all columns other than, say, FullName, should be put into a separate table, which can be joined only when needed.  1-1 table relationships are also a lot more frequent in DW and other BI-type (Business Intelligence) solutions.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Really depends on the server too. In the MPP world, data is essentially in columnstores. You're paying for the extra compressed space to reduce the overhead of having to format the data when called up. You also run into similar if this is split across tables as in most instances, you get one hashing key per table, which means you can only essentially index on one dimension. Thus data is duplicated to create more than one hashed table.

  • I can understand the fullName equivalent because indexing and/or searching on that is likely better than putting a concat(a,' ',b) on the left side of the search string... I'm wondering how much actual cpu time is saved versus the cost of the storage involved.  I can't imagine concat being an expensive function.

  • Personally, I think it's a totally unnecessary duplication of data and it doesn't stop at just what is visible in the table.  It's also likely that the same people that rendered this expensive design also add an index to many, if not all, of the columns.  Don't forget that SQL Server can't actually use what is stored on disk.  It has to load it into memory before it can be used.

    What's really ironic about a lot of designers (certainly, not all of them) of DWs that I've seen is that they denormalize data for supposed purposes of performance and then they pitch a fit when it doesn't perform well because of the Catch-22 they've designed into their DWs.  Then they exacerbate the problems by adding a shedload of indexes and then wonder why it takes a month of Sundays to update the data in their DWs and then also wonder why they have huge split pages/fragmentation problems (because those same designers also don't know how to design the indexes correctly) and the also wonder why maintenance periods and backups take so long and why the backups are so large.

    On the other hand, it IS a good way to justify buying additional hardware even though that won't usually help performance, either. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, May 15, 2018 11:59 AM

    Personally, I think it's a totally unnecessary duplication of data and it doesn't stop at just what is visible in the table.  It's also likely that the same people that rendered this expensive design also add an index to many, if not all, of the columns.  Don't forget that SQL Server can't actually use what is stored on disk.  It has to load it into memory before it can be used.

    What's really ironic about a lot of designers (certainly, not all of them) of DWs that I've seen is that they denormalize data for supposed purposes of performance and then they pitch a fit when it doesn't perform well because of the Catch-22 they've designed into their DWs.  Then they exacerbate the problems by adding a shedload of indexes and then wonder why it takes a month of Sundays to update the data in their DWs and then also wonder why they have huge split pages/fragmentation problems (because those same designers also don't know how to design the indexes correctly) and the also wonder why maintenance periods and backups take so long and why the backups are so large.

    On the other hand, it IS a good way to justify buying additional hardware even though that won't usually help performance, either. 😀

    You guys are UPDATING your DW? What? 😀

  • xsevensinzx - Tuesday, May 15, 2018 1:43 PM

    Jeff Moden - Tuesday, May 15, 2018 11:59 AM

    Personally, I think it's a totally unnecessary duplication of data and it doesn't stop at just what is visible in the table.  It's also likely that the same people that rendered this expensive design also add an index to many, if not all, of the columns.  Don't forget that SQL Server can't actually use what is stored on disk.  It has to load it into memory before it can be used.

    What's really ironic about a lot of designers (certainly, not all of them) of DWs that I've seen is that they denormalize data for supposed purposes of performance and then they pitch a fit when it doesn't perform well because of the Catch-22 they've designed into their DWs.  Then they exacerbate the problems by adding a shedload of indexes and then wonder why it takes a month of Sundays to update the data in their DWs and then also wonder why they have huge split pages/fragmentation problems (because those same designers also don't know how to design the indexes correctly) and the also wonder why maintenance periods and backups take so long and why the backups are so large.

    On the other hand, it IS a good way to justify buying additional hardware even though that won't usually help performance, either. 😀

    You guys are UPDATING your DW? What? 😀

    😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The design pattern is to achieve one or more of the following, some of which have already been mentioned.
    Take the calculation hit once at load time rather than burn CPU cycles on conversion at read time
    Indexing

    and one I haven't seen mentioned - consistency.  If the data is conformed on the way in then you know that power users will use one of the conformed fields rather than roll their own.

    As with most things databasey - it depends - on whether this is a good design pattern or not.

  • aaron.reese - Friday, May 18, 2018 9:49 AM

    The design pattern is to achieve one or more of the following, some of which have already been mentioned.
    Take the calculation hit once at load time rather than burn CPU cycles on conversion at read time
    Indexing

    and one I haven't seen mentioned - consistency.  If the data is conformed on the way in then you know that power users will use one of the conformed fields rather than roll their own.

    As with most things databasey - it depends - on whether this is a good design pattern or not.

    Agreed... that's why I'm suggesting that people look a little bit more closely before they make this table 6 times larger than it needs to be.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Kevlarmpowered - Thursday, May 10, 2018 11:26 AM

    Can anyone weigh One of the rules of RDBMS is that we don't do display formatting in the database layer of a tiered architecture. It goes out to a presentation layer. In the case of names and addresses, you probably to get a tool to edit everything clean up your data to at least the CASS standards to keep the post office happy.

    Now, having said that, many decades ago I had a client who had five different ways of evaluating his inventory. Each one involved in getting accounting rules, a legal department, and other stuff that I don't even pretend to understand. Have you ever heard the joke about the Spanish bookkeeping system? You keep multiple sets of books; the creditors books, the governments books, the investors books, etc.

    It was not worth it, because the complexity involved for each of the methods, to compute them in a query so we allocated a column for each method. Then we picked the one we liked for whatever purpose we were using. Hey, I don't teach business ethics; I teach database.
    in on why you would want to store multiple variations of the same data?  

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • This is really a question of semantics, if it is necessary to have a properly conformed name or address for legal purposes, I would argue that that is not a presentation issue. This conformed data must be provided by the system,  Whether you compute it once and serve it up from a computed column for do it via a view or function at run-time is a design decision and there are valid reasons for doing both.  The lines become more blurred when trying to calculate subjective values (e.g. account delinquency).  Should the database return a scalar value for the application to assign to a banding and color code, or should the database rules apply the banding and just leave the display parameters up the application.  E.g. should delinquency be 82%, 'Severe' or 'Red'. - I would argue (almost) never Red - that is a presentation decision.  The score or banding is more subtle and really depends on the business requirements.

    @jeff, in your last statement you referred to making the database bigger than it needs to be - but who defines need.  Storage is cheap and the compute probably happens outside of business hours.  If speed of delivery is the most important thing then I say calculate the values and store them in derived columns, after all, speed of calculation is one reason you have built a data warehouse in the first place.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply