Fact-and-Dimension vs Storing Character Values

  • the data warehouse won't grow much

    This part doesn't sound correct.  I'm a big fan of views in a DW environment.  But as the fact table will add a row per invoice status change, it would seem to me to grow a lot.  Obviously I don't know the optempo at your company, but a new row will be required for each change.  You may consider that not a lot. 

  • RonKyle - Tuesday, July 31, 2018 8:52 AM

    the data warehouse won't grow much

    This part doesn't sound correct.  I'm a big fan of views in a DW environment.  But as the fact table will add a row per invoice status change, it would seem to me to grow a lot.  Obviously I don't know the optempo at your company, but a new row will be required for each change.  You may consider that not a lot. 

    Sorry, I meant the data warehouse won't grow much initially with this new table. The total size will be well under 1GB, putting the two tables together this way.

    Over time it certainly will grow, and that's something we'll add into the maintenance schedule. I was initially concerned that, by putting this in place, I would add unnecessary weight to the data warehouse for one to three ancillary reports.

  • jklenk86 - Tuesday, July 31, 2018 8:46 AM

    RonKyle - Tuesday, July 31, 2018 7:31 AM

    One more thing:
    If the report intention is to report on individual invoices, as your example indicates, than that might be better done through an OLTP system as another poster said earlier.  Data warehouses are better at aggregating large volumes of information to detect patterns (how long on average is an invoice at a particular stage) and identify outliers (this invoice has been open for a year--which is it and what's wrong?).  If you need to identify the history of an individual invoice, that is better suited to OLTP reporting.  With apologies to the previous poster who said something similar.

    Yeah, so I think that may have been me actually. We do have an OLTP Cube built that feeds from the data warehouse.

    I think ultimately, given everyone's answers and just talking about the different data sizes, I may stick to a fact table / dimension table scenario. Since the OLTP Cube stores data by highest cardinality, if I keep my fact table to Invoice Number and Phase ID, the data warehouse won't grow much. Using a view to populate the Cube with Invoice Number and Phase Name, the Cube storage won't take a significant hit, since we're already storing Invoice Number for other tables.

    If any of this sounds incorrect feel free to call it out, but I think this is the best bet. I can add covering indexes to my table to try and keep the view performing well.

    Sounds good to me...  post back and let us know how you make out with that in place...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • If' you're storing this data in an existing data warehouse, does your existing warehouse have a date dimension? If so I would reference that rather than store the date in your own fact table. Depending on the design of that dimension, it might allow you to do rollups by different periods if that's something folks might be interested in, though it's not really clear what kind of questions this thing is trying to answer so I'm not sure what use that might be. Referring to an existing date dimension might also save a few bytes per row in the fact table, if the date dimension key is smaller than a datetime.

    Also, in a data warehouse scenario, I wouldn't worry about the fact table becoming very large, that's generally pretty normal. 27 million rows really isn't that big -- wait until you get into the billions. 🙂

  • Other questions aside, I think storing the Phases as a dimension is the more correct design. The performance considerations would be more about the initial DW load, and then cube processing time on an ongoing basis. As invoices will be updated with subsequent phases, possibly across month or year boundaries, then some form of partitioning scheme may be useful to reduce cube processing time.

    MattF

  • dmbaker - Wednesday, August 1, 2018 6:57 AM

    If' you're storing this data in an existing data warehouse, does your existing warehouse have a date dimension? If so I would reference that rather than store the date in your own fact table. Depending on the design of that dimension, it might allow you to do rollups by different periods if that's something folks might be interested in, though it's not really clear what kind of questions this thing is trying to answer so I'm not sure what use that might be. Referring to an existing date dimension might also save a few bytes per row in the fact table, if the date dimension key is smaller than a datetime.

    Also, in a data warehouse scenario, I wouldn't worry about the fact table becoming very large, that's generally pretty normal. 27 million rows really isn't that big -- wait until you get into the billions. 🙂

    Great point, and yes I will be converting the datetime to an Int and storing a YYYYMMDD integer representation of the date. I'm glad to hear 27M rows isn't excessive, still new to data warehousing. 🙂

    MattF - Wednesday, August 1, 2018 7:46 AM

    Other questions aside, I think storing the Phases as a dimension is the more correct design. The performance considerations would be more about the initial DW load, and then cube processing time on an ongoing basis. As invoices will be updated with subsequent phases, possibly across month or year boundaries, then some form of partitioning scheme may be useful to reduce cube processing time.

    This is what I thought as well. I imagine we'll be partitioning Invoices by year, since our Accounting dept reports annually. 

  • Great point, and yes I will be converting the datetime to an Int and storing a YYYYMMDD integer representation of the date

    A smallint is normally sufficient. 

  • If your version of SQL supports column-store I would consider that as an option. Due to the repetition of data it would condense the volume of data you are storing. The "INT" value of a date is a must in most cases (least work to join and search on).

  • jklenk86 - Tuesday, July 31, 2018 8:57 AM

    RonKyle - Tuesday, July 31, 2018 8:52 AM

    the data warehouse won't grow much

    This part doesn't sound correct.  I'm a big fan of views in a DW environment.  But as the fact table will add a row per invoice status change, it would seem to me to grow a lot.  Obviously I don't know the optempo at your company, but a new row will be required for each change.  You may consider that not a lot. 

    Sorry, I meant the data warehouse won't grow much initially with this new table. The total size will be well under 1GB, putting the two tables together this way.

    Over time it certainly will grow, and that's something we'll add into the maintenance schedule. I was initially concerned that, by putting this in place, I would add unnecessary weight to the data warehouse for one to three ancillary reports.

    More food for thought and alternative solutions.

    Keep the transactional log level data, but slam it into a document store outside of the data warehouse. Then have that document store aggregate the changes of the invoices into one or more documents, then parallel load those documents into the data warehouse via Polybase with SQL Server or use SSIS. That way you still retain the transactional data and remove most of the overhead from the data warehouse while also reducing the size of the table in said warehouse all in one bout. The data warehouse in this instance is still creating the fact/dimensions/keys/etc from the data publish from the store, but it's less records with the option of ingesting the entire history if needed outside of the warehouse.

    I say this only because updating a fact table in a data warehouse is bad news.

  • I say this only because updating a fact table in a data warehouse is bad news.

    Why would you say this?  If it's an accumulating snapshot table, the records will be updated routinely until the item is completed.

  • RonKyle - Friday, August 17, 2018 7:23 AM

    I say this only because updating a fact table in a data warehouse is bad news.

    Why would you say this?  If it's an accumulating snapshot table, the records will be updated routinely until the item is completed.

    Because if you're using SSAS to make cubes out of your data, updating the fact table requires reprocessing? That's one major reason I can think f, and it's a pain in the butt unless you're doing some sort of ROLAP, but that's a whole other can of worms.

  • dmbaker - Friday, August 17, 2018 7:47 AM

    RonKyle - Friday, August 17, 2018 7:23 AM

    I say this only because updating a fact table in a data warehouse is bad news.

    Why would you say this?  If it's an accumulating snapshot table, the records will be updated routinely until the item is completed.

    Because if you're using SSAS to make cubes out of your data, updating the fact table requires reprocessing? That's one major reason I can think f, and it's a pain in the butt unless you're doing some sort of ROLAP, but that's a whole other can of worms.

    If it's an accumulating snapshot table, that can't be avoided.

  • RonKyle - Friday, August 17, 2018 7:52 AM

    dmbaker - Friday, August 17, 2018 7:47 AM

    RonKyle - Friday, August 17, 2018 7:23 AM

    I say this only because updating a fact table in a data warehouse is bad news.

    Why would you say this?  If it's an accumulating snapshot table, the records will be updated routinely until the item is completed.

    Because if you're using SSAS to make cubes out of your data, updating the fact table requires reprocessing? That's one major reason I can think f, and it's a pain in the butt unless you're doing some sort of ROLAP, but that's a whole other can of worms.

    If it's an accumulating snapshot table, that can't be avoided.

    Yes, that's true, but you asked why and I gave you my reason. 😛

  • dmbaker - Friday, August 17, 2018 7:47 AM

    RonKyle - Friday, August 17, 2018 7:23 AM

    I say this only because updating a fact table in a data warehouse is bad news.

    Why would you say this?  If it's an accumulating snapshot table, the records will be updated routinely until the item is completed.

    Because if you're using SSAS to make cubes out of your data, updating the fact table requires reprocessing? That's one major reason I can think f, and it's a pain in the butt unless you're doing some sort of ROLAP, but that's a whole other can of worms.

    I was also curious about this point. Would that not defeat the purpose of an SSAS Cube? Any data refresh would require reprocessing and reports are unusable if the data is stale.

    I am always learning, so let me know if I'm missing something.

  • RonKyle - Friday, August 17, 2018 7:23 AM

    I say this only because updating a fact table in a data warehouse is bad news.

    Why would you say this?  If it's an accumulating snapshot table, the records will be updated routinely until the item is completed.

    Well, yes, if you decide to go with a slowly changing fact then that's what you have to do. I'm saying that you should avoid slowly changing facts as updates to an extremely large fact table is bad. It will grow with uniques if you constantly update and aggregate the fact. That can lead to a downward spiral in performance over time as those uniques grow versus trying to avoid this all together and keep the reporting consistent the first time you process it and never reprocess it again.

    I say this because accumulation in that way you and others explain it is a farce, it's not accumulating the records,

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply