A Stumbling Count

  • Comments posted to this topic are about the item A Stumbling Count

  • Interesting approach although I'm not sure I would have done it in the same way.

    It looks like you're describing fact and dimension tables so you're looking at large fact tables. 12 seconds for a 300,000 row table (albeit on a test system) isn't great. Imagine a 300 million row table (or billion rows) and then try and run that CTE (joined twice) against the fact table.

    Firstly, the number of production days between two dates can be calculated without the use of a CTE:

    select count(*) from dbo.TstCalendar where Date between @delivery_day and @ship_date and ProdDay ='Y'

    Secondly, in this instance it would make more sense to store the delivery_time as a fact within the fact table. It's additive so it fits in there quite nicely and can allow the business to ask questions like (what was our average delivery time for orders in Jan, what was our lowest, highest, etc...). Also, by storing it in the table you remove the need to perform an unnecessary join between a very large fact table and the calendar table.

    This does of course mean that you need to perform this calculation during the ETL load process but it's more efficient to do so with a day's (assuming you run the load once a day worth of data than against a large fact table.

    On another note, you should consider having a seperate order and shipment fact tables. A fact table should never really contain any nulls - the presence of nulls is often a clue that something is not right with the design. In this instance, it's fair to say that a ship_date is not really a fact related to orders.

  • Karl,

    Thanks for your reply. Well, actually, I think I should have mentioned that I am not describing a full blown data warehouse implementation. Building and maintaining a data warehouse is too expensive for the small and medium sized companies I work for. There are numerous companies like these and they also have their reporting needs and their problems to be solved.

    I think you are right when you say that it makes a lot of sense to have delivery times stored in a data warehouse. That is, if you have one... The same is true for your remarks about null values for ship_date and a separate fact table for shipment facts. But discussing those issues would lead us to talking about details of an order management data mart and, although potentially very interesting, that is not what I had in mind at this point.

    I have a question:

    Your code

    select count(*) from dbo.TstCalendar where Date between @delivery_day and @ship_date and ProdDay ='Y'

    gives the correct number of production days between two dates, but how would you implement this code, when you want to report on, let's say, 50,000 records? (given a poor man's data mart consisting of a denormalized invoice-orders table with order_date and ship_date on the same row)

    Gijs

  • Gijs,

    I assumed that it was a data warehouse because of your description of the role-play dimension but I can understand how sometimes a full-blown data warehouse is too much to consider.

    With respect to calculating the delivery time using my code I would probably do something like amend your function dbo.WorkDayDiff to perform the select count(*) from the Calendar table. The only problem with this is that you'll end up with a nested loop join, which might not perform efficiently. That said, even with 10 years worth of data in this table you're still only talking of a table with 3600+ rows so it shouldn't be too bad.

    However, I would still give consideration to including this field in the actual table so that you wouldn't need to perform the calculation when trying to query many thousands of rows. Instead you'd only need to perform the calculation whenever you insert a record into the table.

  • SQLZ (5/31/2011)


    However, I would still give consideration to including this field in the actual table so that you wouldn't need to perform the calculation when trying to query many thousands of rows. Instead you'd only need to perform the calculation whenever you insert a record into the table.

    Horses for courses. This is a good principle, but some environments simply won't allow it -- whether it's because the database is owned by a third-party app that won't allow changes to tables, or because it's a hybrid transactional/analytical database and a delay of 12 seconds in running a report is preferable to a delay of .1 second when adding/updating a transaction.

    If it weren't for non-standard requirements, we'd all be out of a job. 😛

  • SQLZ (5/31/2011)


    If it weren't for non-standard requirements, we'd all be out of a job. 😛

    I'll drink to that! Gotta love job security.

  • You know I have done this myself but later came to wonder why people want to store all information for dual logic scenarios when they just need the smaller set. In the case of a production work date versus off dates typically the off dates are the smaller set. I would suggest just have a date table for non-production dates only which would hold the values for weekends and other non-production dates with the date column clustered indexed. These means storing around 110 dates a year for dates you don't work (could be saturdays, sundays, holidays or other days that are non-production in your environment) which is insignificant. So your query can be simply something like this

    select

    TOD.Ordernr

    ,TOD.Orderline

    ,datediff(dd, TOD.OrderDate, TOD.ShipDate) - count(NPD.DateColumn) as diff

    from

    dbo.TestOrderDetails TOD

    inner join

    dbo.NonProductionDates NPD

    ON

    NPD.DateColumn BETWEEN TOD.OrderDate AND TOD.ShipDate

    GROUP BY

    TOD.Ordernr

    ,TOD.Orderline

    ,TOD.OrderDate

    ,TOD.ShipDate

    Of course you need to manage logic of nulls and if day is inclusive (20 to 27 is 8 days) or exclusive (20 to 27 is 7 days). But it keeps logic very simple.

  • Interesting use of the date dimension table to calculate analytics. Expanding the model a little to add accounting weeks (in a 4,4,5 or similar rotation), quarters, years, etc. and following a similarly flagged (y,n or 0,1) field design provides a dynamic transition into data warehousing for the small to medium businesses that do not have the budget for a full blown data warehouse application. Then, you simply pair the flattened warehouse structures with SSRS and voila....basic management reporting 101. Nice post!

  • Gijs,

    in my company, we had to solve a similar problem - given the order date and number of production days, calculate the ship date. We did come up with a Calendar table, with the addition of a couple of (admittedly, denormalized) columns "prodDayNumber" and "nextProdDayNumber" . For production days, their values are equal, fro non-production days, "prodDayNumber" is NULL, "nextProdDayNumber" equals "prodDayNumber" of the next production day. Whenever we add more rows to the Calendar table (which happens once a year at least, or when we add more facilities), we populate those columns as well. After that, the query, similar to your last one, runs really fast.

  • To Antares686:

    The Calendar table is not very big, so you're not saving much space by storing exceptions. Besides, having a full-blown Calendar table gives you much more than just counting date difference. For example, try to solve a reverse problem - given the order date and number of production days, calculate shipping date. Or, given an employee's time sheet, calculate compensation, keeping in mind that different days can have different pay rates. Possibilities are endless.

  • Perhaps this is one of the most interesting blogs that I have ever seen. Interesting article, Funny comment. Keep it up!

    <a href="http://www.coachbagshoppes.com/">coach bags</a>

    coach bags[/url]

    <a href="http://www.guccibagsshoppes.com/">gucci outlet</a>

    gucci outle[/url]

  • izblank (5/31/2011)


    To Antares686:

    The Calendar table is not very big, so you're not saving much space by storing exceptions. Besides, having a full-blown Calendar table gives you much more than just counting date difference. For example, try to solve a reverse problem - given the order date and number of production days, calculate shipping date. Or, given an employee's time sheet, calculate compensation, keeping in mind that different days can have different pay rates. Possibilities are endless.

    I can agree and disagree with that. I agree that it is not super significant in most cases. However I disagree based on your choice of scoping. I am a firm believer in KISS, so keep it simple. For the information provided a full blown calendar table is not needed and thus I will simplify it to what I need versus what I could build otherwise I am storing things I may never use which is a waste of resources.

    Now that said you should never calculate a single employees time sheet in a system. Now when you add in multiple employees you will having varying pay and may have different rulesets for the type of employee, the local of the employee, bonuses based on performance and other types of rules you will need to apply. And generally speaking the type of employee you would be referring to would be an hourly paid employee which means you would have them log their time to ensure you are paying them for what they work. In that scenario you would not calculate anything after the fact because you have a legal requirement on being able to validate the correctness of your information if you are ever audited. You would instead calculate upon entry and store with the employees entry the information about the calculation. Yes I know a lot of companies don't do this, but I know a few who have been fined or taken to court over incorrect pay amounts and they lost because they could not prove they did it right. So I would advise anyone who calculates the way you describe as setting themselves up for potential trouble later. This is however my opinion but it is based on historical comparison.

    As for calculating the shipping date, that can be done without the table having all the dates in them. Sure you could have the dates in the table which for a single calculated field on a screen is fine or you can put in your business logic of your app to figure on the fly as you won't see a significant difference between the two methods. Not sure what you would get out of it otherwise as reports will need some recursion to find the correct estimated shipping date per line which could be done still by looking only at the non-production dates and honestly the only people who care about estimated shipping dates are customers, shipping department and audit reporting. My best answer there since over time things can with regards to when something might ship as well as different types of items is to calculate at the time of entry the estimated date so everyone will get the idea as to a deadline and audit will want to do efficiency reports historically which means they will want to see if items are consistently shipped in under or over the estimated time which impacts customer satisfaction.

    So I go back and say I agree and disagree with you. You have to weigh it based on your requirements and what impacts you really need to address.

  • Antares686 (6/1/2011)

    Now that said you should never calculate a single employees time sheet in a system. In that scenario you would not calculate anything after the fact because you have a legal requirement on being able to validate the correctness of your information if you are ever audited. You would instead calculate upon entry and store with the employees entry the information about the calculation.

    But you could still use the Calendar table upon entry, right?

    Antares686 (6/1/2011)

    As for calculating the shipping date, that can be done without the table having all the dates in them.

    And, as you admit yourself, it would require a recursion or a cursor-like processing, which is order of magnitudes slower. We perform this calculation millions times a day, so it does make a difference. And I can't keep the logic in the front-end app for reasons I will give below.

    Antares686 (6/1/2011)

    the only people who care about estimated shipping dates are customers, shipping department and audit reporting.

    There are some legal requirements as well

    Antares686 (6/1/2011)

    My best answer there since over time things can with regards to when something might ship as well as different types of items is to calculate at the time of entry the estimated date so everyone will get the idea as to a deadline and audit will want to do efficiency reports historically which means they will want to see if items are consistently shipped in under or over the estimated time which impacts customer satisfaction.

    I am in eCommerce business, and there is a need to calculate shipping date not only upon order placement, but at some other points of the order life cycle. So the logic needs to be in the database, and it needs to be lightning fast, even if we do it row by row,

    Antares686 (6/1/2011)

    So I go back and say I agree and disagree with you. You have to weigh it based on your requirements and what impacts you really need to address.

    Maybe I should have said it differently. The main advantage of tables like Calendar is not just that they have everything in it, but also that you can pre-calculate a bunch of things there, so that you do not repeat the calculations again, or reduce you own calculations to a simple math. It is the same idea as materialized views. Or, suppose you need to calculate sum of squares of consecutive integers. You could do it in a loop, or you could expand your Tally table to hold running sum of squares, thus reducing your calculation to mere subtraction of two numbers found in the table.

  • izblank (6/1/2011)


    Or, suppose you need to calculate sum of squares of consecutive integers. You could do it in a loop, or you could expand your Tally table to hold running sum of squares, thus reducing your calculation to mere subtraction of two numbers found in the table.

    It would be much simpler to do the arithmetic in the obvious way, using n*(n+1)*(2n+1)/6 for the sum of the squares of the first n natural numbers and getting the sum of sqares of integers between m+1 and n inclusive by using that twice and subtracting.

    Doing a three way join onto a Tally table with a lot of rows (and containg some extra stuff in addition to the single number of the traditional Tally table so that it occupies more space) may not be sufficiently efficient to justify the additional code complexity. Doing it in a loop would be just plain crazy.

    Tom

  • Tom.Thomson (6/1/2011)


    izblank (6/1/2011)


    Or, suppose you need to calculate sum of squares of consecutive integers. You could do it in a loop, or you could expand your Tally table to hold running sum of squares, thus reducing your calculation to mere subtraction of two numbers found in the table.

    It would be much simpler to do the arithmetic in the obvious way, using n*(n+1)*(2n+1)/6 for the sum of the squares of the first n natural numbers and getting the sum of sqares of integers between m+1 and n inclusive by using that twice and subtracting.

    Doing a three way join onto a Tally table with a lot of rows (and containg all extra stuff in addition to the single number of the traditional Tally table so that it occupies more space) may not be sufficiently efficient to justify the additional code complexity. Doing it in a loop would be just plain crazy.

    Right, and now instead of sum of squares, I will ask you to get a count of prime numbers within a range. Can you come up with a formula for that?:-)

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply