Worst Practices - Making a "Live" Change

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/sjones/worstpracticesmakingalivechange.asp

  • Hi Steve

    As an apps dba, you wouldnt believe the excuses and comments from developers re getting so called adhoc changes through test then prod with a time span of 1 minute. The development manager really needs to own and manage suck practices and be thoroughly supportive of proper change practices and impact analysis. People tend to forget how much it can cost companies for those "minor patches" going wrong.

    Cheers

    Chris K


    Chris Kempster
    www.chriskempster.com
    Author of "SQL Server Backup, Recovery & Troubleshooting"
    Author of "SQL Server 2k for the Oracle DBA"

  • Generally it is the person who normally barks about these things that is the one who fouls things up in groups I have worked with. However, 90% of the time they make changes right before going on vacation so everyone else has to determine the mistake in their code. And the worst part of all is they tested the code in the test environment but instead of scripting the object or code changes they type them by hand.

  • Gods do I hear you on this, and I'm deaf!!

    This is why I always script everything I do. Yes EM is quick and easy, but the wretched thing should be locked down in the live environment!

    We follow the following procedures

    • Unit test our code
    • Script and document the installation procedure
    • Get someone else to work throught the script, a curmudgeonly BOFH is best for this.
    • Get someone else to test the results, the user from hell is best for this

    If neither the BOFH or user smiles at you then your change and installation procedure works and can be applied to the live system.

    You would think that this would be fool-proof, but we often deal with 3rd party suppliers. Unless all suppliers buy into this test mechanism then you may still find that even the most carefully planned change can still go wrong.

  • Our product has been implemented across several sites. We use SQL Server.

    What has come to our notice is that too much time is being spent fixing the sameproblems over and over again. This is due to the fact that some one at a particular site notices a problem, fixes it up and doesn't bother to inform others about it. When the same problem crops up at another site, the person over there fixes it up as well.

    The problem, we now have two different version of codes since PERSON A failed to inform the central team about the problem encountered since he or she thought it was not 'worth informing'. We have now started a code consolidation exercise which is going on forever.

    This is another worst practice. Not bothering to inform 'trivial changes'.

    Cheers!

    Abhijit

  • I totally agree with the article, and all subsequent comments.

    I'd like some debate, however, so I'd like to state 2 extremes, and ask where people's tradeoffs are?

    1. Critical system, 24 hour access, any downtime at all is damaging.

    2. Non-critical system, used every Friday for 5 minutes, doesn't matter if it doesn't work for a month or so.

    I'd be surprised if there is anyone who would make changes to a live 1, and I'd also be surprised if there is anyone who would NOT make changes to a live 2.

    There must therefore be a middle-ground somewhere. Ultimately this middle-ground should be decided based on cost and business risk, but I would be interested to know where YOUR middle-ground is...

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • My experience with systems that are used periodically i.e. at set intervals but not continuously, is that when they are used it is "all hands to the pump" and they absolutely have to be up and running when they are needed.

    There is a danger that an important repair is forgotten and you are left desperately trying to bring yourself back up to speed when it becomes an urgent issue.

    The time to be pulling your hair out with these things is not when their failure is highly visible and everyone is looking at you.

    I think the danger of having a "it doesn't matter if this doesn't work for a while" application is that it encourages sloppy practices and poor discipline.

    It is very easy to fall into bad habits and to be lulled into complacency. It is only when you are badly bitten in the bum that you realise jsut how badly you lulled.

  • No middle ground for me sorry. Even thou item 2 is non-critical and can stand to be down for a month you will still have to fend off customer complaints concenring it being down. And many will rip you a new one if it goes down more than once (I know I have had this happen to a developer here who learned his lesson). Good coding and implementation practices should always be used and tested no matter the scenario. Besides no matter how non-critical a system is, any amount of downtime and troubleshooting is a waste of time, energy and resources. Ultimately even thou it may be insignificant it will cost you in dollar amounts and the more mistakes made the higher the amount.

  • I'm a bit confused by those replies.

    David - If the system has to be up and running when it's needed, then is that a non-critical system?

    Antares:

    a) Scenario 2 isn't necessarily a customer system?

    b) I don't follow your point about downtime and troubleshooting being a waste of time, energy and resources in the context of making some changes on either the live or the development version.

    By the way, how do you guys copy your changes from the development to the live system? Do you have a system in place to do it? If so, how do you manage changes to that system?

    I should say that I'm not advocating any approach, I'm just asking a few questions - hopefully to try to get a discussion going.

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • quote:


    I'm a bit confused by those replies.

    David - If the system has to be up and running when it's needed, then is that a non-critical system?

    Antares:

    a) Scenario 2 isn't necessarily a customer system?

    b) I don't follow your point about downtime and troubleshooting being a waste of time, energy and resources in the context of making some changes on either the live or the development version.


    Sorry should have stated specifically it is a waste in production. All should be tested prior ro production. If tested properly then no issue should arise (except the occasional unknown factor). The key is production should only be declared when all sanity checks have been done otherwise it is better to call it a BETA, meaning bugs may exist not neccessarily do exist.

    And as for not being a customer system then it doesn't exist. If you build a system and store data to be used by anyone it is a customer system. Even if it is just one person. Otherwise you are talking Dev system or Test system. I am referring to Prod which means it is in use.

    Edited by - antares686 on 01/21/2003 11:55:08 AM

  • quote:


    I'd like some debate, however, so I'd like to state 2 extremes, and ask where people's tradeoffs are?

    1. Critical system, 24 hour access, any downtime at all is damaging.

    2. Non-critical system, used every Friday for 5 minutes, doesn't matter if it doesn't work for a month or so.

    I'd be surprised if there is anyone who would make changes to a live 1, and I'd also be surprised if there is anyone who would NOT make changes to a live 2.


    I would make changes to option 1. Why? Well, it's all I have. My client won't put out the money to buy a test-bed system, so the production one is all we have. Recently (and it will happen again soon) my client demanded that we run a pilot on our live, production system. This meant loading a vendor's software on our system strictly for the purpose of finding out how well it works. When we didn't go with the software (it didn't work well) we had a terrible time uninstalling every last bit of it. Explained this to the client AGAIN and guess what....that's right...it went in one ear and out the other. Bottom line, the client won't buy a test-bed system; it's the client's system and they will do whatever they want with it; and if the system crashes, it's my fault.

    -SQLBill

  • I dont disagree, but I still dont quite practice all of Steve's wisdom yet. I try to avoid live changes, but in the environment/pace I have, sometimes I take the risk. Note that when I say risk it's fairly minimal - adding a column, tweaking a stored proc. Not huge sweeping changes.

    Sr developers typically dont like having more controls imposed, jr developers usually love it - they live in fear of doing something wrong! Reviewing changes with jr developers is a good chance for some 5 min mentoring. Apply Steve's "is is stupid" test.

    Practically, my philosphy is to never work without a net. Transaction log is one. A tool like Log Explorer might be another. Just making a copy first, just in case.

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • Ryan,

    What sort of system is it that runs for 5 minutes on a Friday and doesn't matter if it doesn't work for a month or so?

    OK I know that this is just an example but I would be looking to retire such a system fitting this description.

    My experience is that in a live environment, every system is critical. Then gain, if you work in accounts then everyone elses system is regarded as non-critical.

    The way that my company manages change is as described.

    • Developer develops and does basic tests.
    • Someone else performs exhaustive testing. This is normally done in test cycles i.e. test the entire thing many times
    • Develop an installation script, SQL Script, Backup strategy etc
    • Rehearse and test said script
    • On success apply the script in the live environment

    The installation script is very much an idiots guide to installation and acts as a check list.

    The script has space for the implementors (old term from the Infocom ZORK series) to write comments and to check off a step on the script.

    At the end the person responsible for the implementation has to sign off the implementation as being complete.

    You now have a signed off document to say that everything has been completed and installed to plan. Any mistakes and their solutions are clearly documented and can be reviewed for future reference.

    Actually, I could probably write an article about this.

  • Oh and after an change the one key step no matter if success, failure or success after issue is

    Do a group follow up and make a list of lessons learned.

    For example we went thru a Call Center switch conversion which my group collects data from thru various means. After about a week the switch dropped like a rock. Upon bringing it back up they took note of all the processes and what was eating the most processor time. Unfortunately they poitned at one of our apps (88%). Fortunately for me three other switches did not show this issue even thou same app was hitting them.

    Ultimately they decided it must be the CPU or Motherboard and thus set out to change the CPU and see what happens (2AM and I have never heard of a bad CPU unless someone burnt it out with stupidity). So me and my Manager came in at 2AM and let them have their fun. They changed out the CPU, brought the switch back and and BAM 88%. Ok not the CPU (or maybe it is they suggested again, ok who is the nut case).

    Anyway 4AM rolls around and I am thinking old UNIX boxes are fine. NT is the new switch so what is missing from this picture (and no it ain't UNIX). So I decide to jump in and ask has anyone bothered to check the NT Event Viewer Logs. Answer NOPE. So they check it, and low and behold they find a process screaming about issues every few secounds. Note: If a process is writting to the Event logs that often it does have impact on the server health. Trace the message response back to a controller card which by the way was related to the process I was querying every 3 Secs. Reschedule change of card for another day (which afterwards dropped CPU to 25% my process).

    So after change we schedule 8AM meeting for lessons learned. I bet you can guess the lesson.

  • When one eliminates the impossible that what remains, no matter how improbable, must be the solution.

    Sherlock Holmes.

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply