What kind of Fact table is this

  • New to DW and still having trouble switch my brain over to it. Im wanting to report on the journey a user has from registering on our site, creating an order, paying for an order and order being dispatched. The things im wanting to measure are the lead times in between each dates and i guess ill be asked what retailers have the longest lead times, what is ther average lead time etc so it is really all about the users journey through the site.

    Now I get how to deigne my dim tables but i awlays have toruble visulaising my fact tables. Would this Fact table be a row for each user with colums of the dates im interested in? would I also have colums with lead times between each date? Im thinking if i have it like this i could then rollup the leadtimes and divide for things like average etc.

    Am i way of the mark?

  • ps_vbdev (9/5/2016)


    New to DW and still having trouble switch my brain over to it. Im wanting to report on the journey a user has from registering on our site, creating an order, paying for an order and order being dispatched. The things im wanting to measure are the lead times in between each dates and i guess ill be asked what retailers have the longest lead times, what is ther average lead time etc so it is really all about the users journey through the site.

    Now I get how to deigne my dim tables but i awlays have toruble visulaising my fact tables. Would this Fact table be a row for each user with colums of the dates im interested in? would I also have colums with lead times between each date? Im thinking if i have it like this i could then rollup the leadtimes and divide for things like average etc.

    Am i way of the mark?

    Generally speaking, in the world of Data Warehousing a row in a transactional fact table (this is the most common type and what you are talking about) contains data for one event (one sale, one registration etc.) I certainly would never model a fact table that could change over time; that is, having registration date, order date, payment date etc. all on one row and updating as new data arrives. What if that customer orders more than once?

    This is how I would model this:

    Have separate fact tables for each of your events: Registrations, Orders, Payments, Dispatches and then use your dimensions to link them together in a meaningful way when you come to do a query (for example, customer id and order id)

    would I also have colums with lead times between each date?

    Definitely don't do this! Encapsulate this logic in a view, report, or a cube.


    I'm on LinkedIn

  • Hey thanks.

    Im not doing an incremental load. so its a full load all the time so there wouldn't be a need to update the fact table they either have a date(s) or dont at the time of snapshot. Dose this make a difference to your reply?

    thanks

  • ps_vbdev (9/6/2016)


    Hey thanks.

    Im not doing an incremental load. so its a full load all the time so there wouldn't be a need to update the fact table they either have a date(s) or dont at the time of snapshot. Dose this make a difference to your reply?

    thanks

    Yes and no. Yes in that it's not a transactional fact table but a snapshot one, since you don't mention whether or not you are going to be truncating the table prior to loading or just adding to what is already there it's impossible for me to tell whether it is a periodic or accumulating (in the latter you update the fact table). No in that you still shouldn't encapsulate logic into the table really, for a variety of reasons (performance being a big one).

    The problem I see is that if you treat this as an accumulating snapshot and you are reading all of the data each time then the performance will degrade over time. If you went with the method I described before then you could use the dates that you mention (if at the time of querying they all occur in the past) to build the fact tables incrementally and make them transactional.

    It's up to you though, it's your data 😀


    I'm on LinkedIn

  • yes id be truncating the fact table before every load. I have such difficult with understanding fact tables :angry: i would have never thought about splitting the dates into the there own fact table expecting it to be over kill. There wouldn't be a performance issue just now but yes you are right this would undoubtedly come.

  • ps_vbdev (9/6/2016)


    yes id be truncating the fact table before every load. I have such difficult with understanding fact tables :angry: i would have never thought about splitting the dates into the there own fact table expecting it to be over kill. There wouldn't be a performance issue just now but yes you are right this would undoubtedly come.

    Don't worry about it! I understand your frustration - I've been speaking in definitive terms (transactional, snapshot etc.) but to every business a fact might mean something that isn't that easy to categorise. Believe me I've seen some truly odd things being regarded as "facts" over the years. I just happen to have worked with similar data to what you are talking about and that was how I ended up modelling it and it worked well.

    Maybe give this book a read: https://www.amazon.co.uk/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247

    You should be able to pick up a used copy pretty cheaply or try to find an online pdf version you can download for free. However I would never condone the latter 😀


    I'm on LinkedIn

  • ps_vbdev (9/5/2016)


    New to DW and still having trouble switch my brain over to it. Im wanting to report on the journey a user has from registering on our site, creating an order, paying for an order and order being dispatched. The things im wanting to measure are the lead times in between each dates and i guess ill be asked what retailers have the longest lead times, what is ther average lead time etc so it is really all about the users journey through the site.

    Now I get how to deigne my dim tables but i awlays have toruble visulaising my fact tables. Would this Fact table be a row for each user with colums of the dates im interested in? would I also have colums with lead times between each date? Im thinking if i have it like this i could then rollup the leadtimes and divide for things like average etc.

    Am i way of the mark?

    Ideally, the fact table is the collection of numeric facts that are observed during the measurement event. Focus on what a measurement event may be in your case. That's where the transactional comment comes from the previous post. Every event that happens is a transaction.

    For example, let's say that we want to log events that happen from your website to a single table starting with two dimensions: datetime and event type. The datetime would be the date and timestamp of the event. The event type would be the events that you want to log on your website such as registration, order placement, purchasing and dispatching. Every user that goes to your website and does any of these events, regardless of order is logged to that table. You now have a history of transactions that are the beginnings of a fact table. The only thing that is missing now is your metrics. What are you measuring?

    In your case, the time between events in seconds/minutes/hours/days/etc. Now imagine those two fields of datetime and event type with a third field for you measurement, lead time. This is a numeric field in seconds from the last event based on datetime.

    So, how do you accomplish that? Well, you can start by logging each event first and then calculating the difference between each event second. The third and last thing you will do is dumping the end result of that calculation in another table that becomes your fact for lead times where you can then report on to the business. The only things to figure out next is what grain (dimensions) you want to add to that fact table to further describe the metrics such as user, origin of user, user demographics and whatever else you feel is important.

    Obviously, if you do something like my example with 3 fields: datetime, event type and lead time, then you run into the issue where the lead time factored in a previous event that is of another user. That's because you don't have the grain of the user in your fact. When you add a fourth field to your fact such as user dimension, then you can ensure that lead times only calculate the different between events OF THE SAME user. 😉

    Make sense in that example? Def pick up the book recommendation in the previous post. It's a reallllllly good one.

  • thats great info but can i ask whats the benefits in having a row for every event that im wanting to track as opposed to a row for every users and col for every event i want to track? Then its a simpler calculation getting lead times using datediff is not not?

    Oh i have just got that book but reading "Start Schema: The complete reference" just now.

  • ps_vbdev (9/12/2016)


    thats great info but can i ask whats the benefits in having a row for every event that im wanting to track as opposed to a row for every users and col for every event i want to track? Then its a simpler calculation getting lead times using datediff is not not?

    Oh i have just got that book but reading "Start Schema: The complete reference" just now.

    Let me rewrite my answer because I'm way too wordy!

    It really just depends.

    Adding new records per event is pretty easy and flexible in the order you want to calculate the lead times. This is the way I went with market attribution reporting, which is the tracking of consumer journies across advertising channels. The order of the events leading up to a conversion in my example always change based on how the end user wants to calculate that journey. Storing the events vertically in the table allow me to select and compare that journey easy, but at the cost of having more records in the table. It also allows me to filter each individual event as a record versus as a field, which I find easier to do on the record level versus the field level. For you, that means you can construct the event path with advanced filters with greater ease before you calculate that lead time.

    Horizontally, this could be easier for you in terms of the time difference, but at the cost of a wider table unless you normalize. You can either define the order and create a event per field or create a bunch of non-specific fields based on order (i.e.: Event1 - Event5) on the same record as the user. You won't have to search for records to do your lead time calculation, just do DATEDIFF(SECOND,EventTimestamp1,EventTimestamp2) to output the different between event1 and event2. But, what happens if the event path stretches to 200 events? You going to have 200 fields in one table or multiple normalized tables? What happens if you need to reorganize the event path? You will have to write complex filters on the functions that calculate the lead time in order to ensure your function is calculating the right non-specific fields based on the use case.

    Both have pros and cons. In my case, I have unlimited event paths. A user can take 1,000 events. That means I would need 1,000 fields per record to do the lead time calculation. I would need to split those 1,000 fields across multiple tables when really, I want to refine my results to a denormalized star schema in my warehouse. Wide tables are no good for SQL. So, I instead track each event as a record on the grain that makes sense for the business, calculate the lead times between paths and output the results as records in a denormalized structure within my star schema as such.

  • Without a doubt at least part of the solution will be a fact table that has both the order date and dispatch date in the same row. This is an accumulated snapshot fact table. Kimball says that it's the least common of the fact tables that he describes. It's been a common one for me as I also use it. It's most notable feature is the number of date fields it contains. One of mine has as a partial example creation date, ship date, survey date, close date. These rows are updated throughout the order process. And it has columns that measure the date difference for those time spans for those date intervals that have business significance, e.g. Days to Close = Close Date - Creation Date.

    If the customer has to register every time an order is placed, it should also be included. If this is a one time event, however, it will need to be somewhere else, maybe even in the customer dimension table.

    Regarding the dates, it's not uncommon for developers to advocate using an integer with the date in human friendly format, e.g. 20160912 for September 12, 2016. I would advise against this. By setting a date as Day 0 (usually the last day of the year before you have useful data) each date will have its own integer. This makes the date interval calculations easy to make. Instead of trying to do 20160912-20160830, you will simply determine the difference between these date IDs, e.g. 3456 - 3433.

    I've been doing this kind of warehouse for more than ten years now and it has worked well.

  • RonKyle (9/12/2016)


    Regarding the dates, it's not uncommon for developers to advocate using an integer with the date in human friendly format, e.g. 20160912 for September 12, 2016. I would advise against this. By setting a date as Day 0 (usually the last day of the year before you have useful data) each date will have its own integer. This makes the date interval calculations easy to make. Instead of trying to do 20160912-20160830, you will simply determine the difference between these date IDs, e.g. 3456 - 3433.

    I've been doing this kind of warehouse for more than ten years now and it has worked well.

    What's the benefit of using integers in that context versus just using dates? For example, is there performance gains? Isn't date datatypes less bytes than int for one good reference?

    What did you mean by integers in that context of "day 0"? Did you mean a date dimension where there is an attribute for a day integer you could join on the fact or just manually embedding the day integer on the fact itself where if you wanted to switch to minutes from days, you would have to recalculate that field for every record in the fact table? Bit confused there.

    I would think that taking advantage of the date datatype with the date functions would be a good win for flexibility. At least that's what I always do because it's normally date and time. So, I would surely have lead times in the 0 days if the lead time was with hours of each other. Having the date functions, I can calculate days where >0 days and hours when <=0 days. The same applies for weeks and months so I'm not locked into one time slice.

  • Using human-readable formats (eg. 20160913) for dates and times is a much better approach. These dates can very easily be converted and are much easier to read when one is debugging an issue or performing maintenance on the DW.

    I would highly advise against using integers that can not be human-readable.

  • The type of fact table that you want is called an "accumulating snapshot".

    It contains one row for each lifecycle (e.g. a customer that creates a specific order in your case). Each time an event happens, a column containing the date for that event is updated.

    @PB_BI: it's certainly normal to model a fact table that can be updated. Here it offers significant advantages because it's very easy to calculate a duration between two defined events. If you would model the fact table as a normal transaction table, the queries would become too complex.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • xsevensinzx (9/12/2016)


    RonKyle (9/12/2016)


    Regarding the dates, it's not uncommon for developers to advocate using an integer with the date in human friendly format, e.g. 20160912 for September 12, 2016. I would advise against this. By setting a date as Day 0 (usually the last day of the year before you have useful data) each date will have its own integer. This makes the date interval calculations easy to make. Instead of trying to do 20160912-20160830, you will simply determine the difference between these date IDs, e.g. 3456 - 3433.

    I've been doing this kind of warehouse for more than ten years now and it has worked well.

    What's the benefit of using integers in that context versus just using dates? For example, is there performance gains? Isn't date datatypes less bytes than int for one good reference?

    What did you mean by integers in that context of "day 0"? Did you mean a date dimension where there is an attribute for a day integer you could join on the fact or just manually embedding the day integer on the fact itself where if you wanted to switch to minutes from days, you would have to recalculate that field for every record in the fact table? Bit confused there.

    I would think that taking advantage of the date datatype with the date functions would be a good win for flexibility. At least that's what I always do because it's normally date and time. So, I would surely have lead times in the 0 days if the lead time was with hours of each other. Having the date functions, I can calculate days where >0 days and hours when <=0 days. The same applies for weeks and months so I'm not locked into one time slice.

    +1. I always use dates now instead of integer. Same size in bytes and you can use date logic directly on the fact table without joining to the date dimension first. It also prevents the use of dates that do not exist.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (9/13/2016)


    The type of fact table that you want is called an "accumulating snapshot".

    It contains one row for each lifecycle (e.g. a customer that creates a specific order in your case). Each time an event happens, a column containing the date for that event is updated.

    @PB_BI: it's certainly normal to model a fact table that can be updated. Here it offers significant advantages because it's very easy to calculate a duration between two defined events. If you would model the fact table as a normal transaction table, the queries would become too complex.

    I would have to disagree slightly. In my defence, when I made the statement above I was making the assumption that this data could be treated as a transactional fact table (actually more than one). In an accumulating snapshot then yes, updates are necessary.

    When I used the model I gave as an example it was using registration and order data whereby the registration happened once and then multiple transactions could be made over time for the same customer. These transactions could be in millions per hour at peak business times. In this scenario it made sense to use transactional fact tables and then apply logic to the presentation layer which measured things like time between registration to order, lapsed users, abandoned baskets etc.

    I made an assumption about the OPs data and I was wrong about that. I don't agree that I'm wrong about not updating transactional fact tables - perhaps I could have been less emphatic and revise my statement to say that updating transactional fact tables is best avoided if at all possible, particularly with high volume, high velocity data. As for making queries more complex, I would argue that what you may lose in query simplicity with more tables, you gain in scalability and manageability (assuming good index design, some partitioning etc.)

    Too many times I've seen places go down the path of "keeping queries simple" and end up with huge "God" tables which they rely on for all of their BI. With the OPs data I'm still not sure if an accumulating snapshot would be advisable if a customer could place more than one order over time. In which case does the OP create a new column when this happens? A new row which repeats the same registration data? A self join?

    Not trying to start an argument here by the way, just felt I needed to clarify my position 😀

    Totally agree on the "smart date" approach though (I've seen it called that somewhere before, hence the quotes).


    I'm on LinkedIn

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply