Data Warehouse question

  • This is more of a 'what is a general best practice' question than a specific code or syntax question.

    Recently the data users here have started to define names of specific combinations of data, or want data broken out by sub-parts of a field. I am considering different ways of making sure the various report writers and ad-hoc users are arriving at the same datasets when quering for these.

    The simplest example is: We have a source sales table with around 7 million rows. There is a new need to report against how the sales order was given to the company, that being electronic submission, paper, or scanned. The way to determine this is from a single character position in the order#. There are a dozen values, which break down to the three categories.

    I am considering these options to make sure all users of the data get these categories correct.

    A user function, which takes in the order ID, and returns the category. The function really is just a case statement, so not doing anyother selects or queries.

    A calculated field in the table which again, uses a case statement to set the category.

    A new field in the table, which is populated as the records are added.

    In this which makes the most sense in your opinion/experience to do? There is a possibility of perhaps a dozen other groupings or flags which would work like this.

    The second situation is a grouping which will require self-joins, or lookup queries to arive at the correct flag. In these I could either create a function, or add fields to the table with the categories. In this situation type, there could be twenty or more flags.

    My inclination is to in all cases add either fields to the table, which are marked during the data loads, or to create a new table with the key from the sales data, and fields for the flags being used. However, I do understand there are differences in how one sets up a data warehouse vs a transactional system, and I have found many times when browsing through this forum amazing ideas which seem slightly counter-intuitive to me. So, I think it is worth asking others their thoughts on this, perhaps there is another option I have not thought of which is considerably better.

    Any thoughts? I am sure 'It depends', what sorts of things should be considered in such a depends case?

    Thanks!

  • David Lester (5/7/2010)


    There are a dozen values, which break down to the three categories.

    Using the above statement as reference I would agree that adding a new column to the target FACT table is the right thing to do. After initial population of the column, ETL process should take care of setting the right value during load.

    A nice touch will be to create a DIM table describing the possible values of such new column.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Thanks Paul,

    So with the dimension table being nice thought, are you thinking nice for being able to lookup the possible values, and adding new ones sometime in the distant future, or is it a performance related idea?

  • David Lester (5/7/2010)


    ... are you thinking nice for being able to lookup the possible values, and adding new ones sometime in the distant future, or is it a performance related idea?

    More like as a place to store the meaning and description of each value - in general I like to have a dimension describing each coded column sitting in a factual table.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Creating a new table with the key from the sales data, and fields for the flags is the best option rather than addition.

  • cochran1010 (1/10/2011)


    Creating a new table with the key from the sales data, and fields for the flags is the best option rather than addition.

    Do you realize we are talking dimensional modeling here? 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Thanks Paul, and yes I do.

    Thankfully, after years of attempts, IT has ceased to be a wall. While they are still not ready to listen to me, they did hire a consultant. We are in process of redesigns, using these very things.

    (It is mildly amusing that every suggestion the consultant is giving them independently matches what I have been asking for these last years.)

  • David Lester (12/9/2011)


    Thanks Paul, and yes I do.

    Thankfully, after years of attempts, IT has ceased to be a wall. While they are still not ready to listen to me, they did hire a consultant. We are in process of redesigns, using these very things.

    (It is mildly amusing that every suggestion the consultant is giving them independently matches what I have been asking for these last years.)

    I know what you mean David. Sometimes organizations lack the ability to listen to the internal resources and waste sh*&^tloads of money on consultants that in the best case scenario come up with the same (brilliant) ideas internal resouces have been trying to convey for years with no luck.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Thanks for ideas above.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply