Refreshing the data warehouse without hurting anybody

  • Our management decided for reasons best known to themselves, to outsource our data warehouse development. It became apparent that while the company we chose had probably built some simple DWs in the past, they had not devised ways to deal with some of the subtleties that crop up. Some quotes to set the scene:

    :: "We don't generally like doing type 2 dimensions - it confuses the customers"

    :: "We'd rather not do incremental loads. Truncate and reload is better."

    :: "The sk CAN be a compound key"

    :: "We've never integrated data from multiple source systems before."

    Don't get me wrong. I'm not making light of those issues. They're complex, for sure. But they're all issues that one could reasonably expect a data warehouse to deal with. The latest issue to arise is that if people query a datamart (the fact and dim tables, not the cube) while a refresh is in progress they'll get the wrong results because the SKs in the dimensions are recreated on each refresh and there's a period of time (an hour and growing) between the refreshing of the dimensions and the refreshing of the facts. We're a 24/7 organisation so this must not happen. The developers have acknowledged this is a problem. They've suggested a solution but said they won't implement it unless they can requote on the work. My question is this: Is it a reasonable implicit expectation that a data warehouse have a strategy for preserving the integrity of query results during a refresh? If it's reasonable to expect this (i.e. it's an expected feature of any data warehouse) then we shouldn't need to requote. They should just deliver what's reasonably expected. If it's unreasonable to expect this, then we should have specified more carefully, and we should requote.

    I'm asking the good people of SSC to be the jury in a hypothetical civil case. What say you?

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • I'll offer my opinion, but in no means should it be seen as a verdict (as I don't have all the facts).

    It is a reasonable expectation that (and not only that, but the premise of) a data warehouse will track history (i.e. type-2 dimensions), load data from multiple sources and load data incrementally. Anybody who makes statements like the ones you mentioned without a real good explanation, show their inexperience and ignorance on the subject. In short...if these statements were made before the engagement even began, you should have turned and ran (quickly).

    With that being said, it also depends on the requirements communicated at the onset. Not every situation calls for history to be tracked, and depending on the size and nature of the data it may make sense to truncate and reload. I personally prefer to take clients through a process whereby these concepts are discussed within the realms of their data and requirements. Decisions are then made through this process in conjunction with the business users, and nobody feels like they've been short-handed afterwards.

    My guess is that such a process may not have been taken in your situation, and you are now reaping the results of that. It's a tough situation to be in, and instead of trying to force your current delivery partner's hand it may be a better option to get another objective opinion from a team with more experience.

  • Thanks Martin. For the specific question "Is it a reasonable implicit expectation that a data warehouse have a strategy for preserving the integrity of query results during a refresh?" What's your view?

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • if these statements were made before the engagement even began

    They mostly came to light after the contract had been signed. The undertakings they gave prior to that were all of the vapid "Yeah we can do that." variety. The contract was silent on all of those things and therefore it comes down to "What would one reasonably expect from a DW?" As I say the decision to engage the developers was made by senior management for reasons known only to themselves. It is what it is.

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • GPO (5/12/2015)


    Thanks Martin. For the specific question "Is it a reasonable implicit expectation that a data warehouse have a strategy for preserving the integrity of query results during a refresh?" What's your view?

    I wish it was as simple as "yes" or "no", but there's a lot of gray areas between "reasonable" and "implicit explanation".

    My view as a professional would be the following: If it was a known requirement that data should be refreshed during business hours or that end users may want to query the data during the ETL process, they should have thought about and devised a strategy to ensure that the integrity of the data remains in tact.

    That could mean different things depending on the scenario, but something that has to be accounted for during load times. You usually try to run ETL processes during the night when nobody would want to access the data, but if your company is 24/7 then it is something that should have been thought about from their end.

    In my opinion, it isn't the best approach to truncate and reload an entire data warehouse every time. If this approach is to be followed, you should at minimum either truncate all the tables as a first step or ensure that the system is inaccessible to the end users.

  • GPO (5/12/2015)


    if these statements were made before the engagement even began

    They mostly came to light after the contract had been signed. The undertakings they gave prior to that were all of the vapid "Yeah we can do that." variety. The contract was silent on all of those things and therefore it comes down to "What would one reasonably expect from a DW?" As I say the decision to engage the developers was made by senior management for reasons known only to themselves. It is what it is.

    It's an unfortunate situation, and one I sadly see too often. When it comes to Business Intelligence, it is sometimes too easy for service providers to mislead customers into thinking that everything is possible, doable and easy. And it's easy because in most situations, the client knows less about BI than the services company they are trying to engage with.

    If I was in your situation, I would urge management to pay for an independent assessment of the environment. Use a company with a good track record and reputation. The assessment will quickly highlight everything that should have been done, or things that can be implemented to make the environment better. An experienced vendor would be able to look at both the requirements and what was delivered, and assess whether you were short-changed.

    Forcing your current service provider to fix things based on implicit requirements may not yield the expected result...especially if they are not experienced.

    Hope this helps.

  • GPO (5/12/2015)


    Our management decided for reasons best known to themselves, to outsource our data warehouse development. It became apparent that while the company we chose had probably built some simple DWs in the past, they had not devised ways to deal with some of the subtleties that crop up. Some quotes to set the scene:

    :: "We don't generally like doing type 2 dimensions - it confuses the customers"

    :: "We'd rather not do incremental loads. Truncate and reload is better."

    :: "The sk CAN be a compound key"

    :: "We've never integrated data from multiple source systems before."

    Don't get me wrong. I'm not making light of those issues. They're complex, for sure. But they're all issues that one could reasonably expect a data warehouse to deal with. The latest issue to arise is that if people query a datamart (the fact and dim tables, not the cube) while a refresh is in progress they'll get the wrong results because the SKs in the dimensions are recreated on each refresh and there's a period of time (an hour and growing) between the refreshing of the dimensions and the refreshing of the facts. We're a 24/7 organisation so this must not happen. The developers have acknowledged this is a problem. They've suggested a solution but said they won't implement it unless they can requote on the work. My question is this: Is it a reasonable implicit expectation that a data warehouse have a strategy for preserving the integrity of query results during a refresh? If it's reasonable to expect this (i.e. it's an expected feature of any data warehouse) then we shouldn't need to requote. They should just deliver what's reasonably expected. If it's unreasonable to expect this, then we should have specified more carefully, and we should requote.

    I'm asking the good people of SSC to be the jury in a hypothetical civil case. What say you?

    If it's a civil case, then the management team that 1) made the decision to outsource the work and 2) gave the work to that particular company should probably be locked up for life and subjected to mandatory sterilization so as to end that branch of the human gene pool. 😛

    I don't know what it is with most companies. They interview people like crazy to get what they hope will be the best and they don't actually ask 3rd party companies more than "can you do this and how much will it cost". They don't do background checks, interview other customers of the company, nor come up with a decent contract that stipulates any kind of performance or other expectations where you could actually sue such a bunch of dummies for falsely representing available services and products.

    Let's review the answers you posted.

    :: "We don't generally like doing type 2 dimensions - it confuses the customers"

    Real meaning: We've heard of it but every time we try it, our customers complain because we don't actually have a clue what we're doing and we mess it up so badly, there's no hope of our customers understanding it. We offer to help the customers understand the mess we made if they pay more money.

    :: "We'd rather not do incremental loads. Truncate and reload is better."

    Real meaning: We've tried this in the past but we don't actually have anyone that knows how to do it correctly or with performance. We do have one guy that's an absolute expert at deleting stuff though. We know he's good at it because he uses it a lot in doing type 2 dimensions. If you really want this done, you'll need to pay more because he's REALLY good at deleting stuff and we're sure that's what will be needed.

    :: "The sk CAN be a compound key"

    Real meaning: At least that's what we read some where. Since we know that and you don't, we don't have to answer any more of your questions about it unless you want to pay us more money.

    :: "We've never integrated data from multiple source systems before."

    Real meaning: We've never actually tried it because we have enough problems trying to integrate data from just one source. We'll get it down soon but we need more money so we can learn how to do it on your dime.

    😛

    My recommendation is that their answers are incredibly stupid and, contrary to what you've said, none of those things are actually difficult. If I were the DBA at your company, I would document a case against them so that I had enough info to at least convince management to cut their losses and fire the company if they weren't stupid enough to enter a "guaranteed minimum" contract.

    Then, find someone that knows not only how to interview companies for such tasks but also how to have the companies cough up solid proof that they've done such things in the past and what the performance and resource usage of the code was not to mention interview customers OTHER than the ones the company provides you with.

    If I sound a little twisted about this subject, it's because I'm 3 days older than dirt and I've seen this dozens of times. There is absolutely no excuse for this type of junk to happen. And management needs to learn the most powerful lesson of all... "If you want it real bad, that's the way you'll normally get it". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Oh Jeff,

    I can't begin to tell you how hard you've made me laugh. It's like you work here. Keep up the good work. It has cheered me up no end.

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • Heh... thanks for the feedback, Bertrand.

    Now, if you haven't seen the following, make sure that you don't have your mouth full of your favorite beverage and imagine the negotiations between that company and the geniuses that hired the company. I can only imagine that it went something like the following... https://www.youtube.com/watch?v=BKorP55Aqvg

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The latest issue to arise is that if people query a datamart (the fact and dim tables, not the cube) while a refresh is in progress they'll get the wrong results because the SKs in the dimensions are recreated on each refresh and there's a period of time (an hour and growing) between the refreshing of the dimensions and the refreshing of the facts

    I'm not certain that your exact question has been addressed. I don't allow the direct querying of base tables except through the drill through. If the base tables are being redone every time, however, this is still problematic. For all the other points you mention, the company is a major fail. But on this point, there is some room to cut them some slack. I've seen others use the full reload method, and as you need, the reload time will only increase. Incremental is more difficult, but it's not difficult.

  • I'm not certain that your exact question has been addressed.

    You're right Ron, and thanks for tackling it. You seem to be leaning towards the view that we should have explicitly specified that these errors (where the facts and dims get out of sync AND queried, during refresh) not occur. I think what we'll do in the short term is get the facts truncated before anything else and loaded after everything else. The cubes themselves are immune to the problem as far as I can tell. A solution I've heard about, that results in practically zero downtime for the facts and dims, involves having two copies of each table (eg dim.staff_a and dim.staff_b and a synonym called dim.staff. The tables alternate between fresh and stale. *_a tables are fresh today stale tomorrow. *_b vice versa. Once your tables are refreshed you recreate the synonyms to point to the fresh tables and that takes almost no time at all. Your reporting queries are pointing at the synonyms so they never need to change. This would, of course, require our vendor to make quite a number of changes to the work they've already done, and they'd want to requote to deliver us a product that was actually of merchantable quality 🙂

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • that we should have explicitly specified that these errors (where the facts and dims get out of sync AND queried, during refresh) not occur

    Yes, that's what I'm saying. Even an incremental load would experience this. While I agree that all the other requests you made were not only reasonable but should be considered normal, if someone brought me this requirement, there would likely be some pushback from me on the necessity as a solution would require a great deal more work.

  • My apologies. I got caught up in my outrage over yet another vendor that can't do what they said they could do.

    You and Ron are spot on with the synonym-flopping. It's a tried and true method if you have the disk space for it.

    The problem is that doesn't solve any of the problems indicated by the 4 original company comments in the original post. My recommendation would be that the synonym-flops could be written fairly easily by inhouse personnel and, done correctly, it shouldn't void any warranty (BWAAAA-HAAAAA-HAAA!!!! Like someone thought of that in the contract!) that you might have with that company. Of course, you should check with them first and get at least THAT in bullet-proofed writing.

    Of course and if it were me, I'd still cut that company loose and find someone that actually knows what they're doing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Very funny Jeff, and happens more than people would like to admit 🙂

  • RonKyle (5/13/2015)


    ...Even an incremental load would experience this.

    Not sure that I agree 100% with that statement Ron. If you incrementally load dimensions and then facts, this scenario (facts pointing to incorrect/non-existent dimension members) will not happen.

    The only thing that could happen during an incremental load is that the users may see an older version of the data, and locking/blocking of course.

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply