A Patch Disaster

  • Comments posted to this topic are about the item A Patch Disaster

  • Interesting challenge as corporate infrastructure becomes more centralized and server virtualization expands. Fewer people support more platforms with automated tools. The day's of knowing or even identifying system support and admin teams seems like a distant memory.

    Automated patch process in this environment is a challenge, all in the interest of cyber security and minimizing cost.

  • Automation is actually easy and I have no idea how they managed this mistake unless they were using a 3rd party (ie. non-MS) tool to deploy the update. We use WSUS and don't EVER have this problem because WSUS works with Windows Update specifically to install only what's applicable to that OS (System Center does the same thing but is more complex and has more features) - and includes updates for all MS products (so I can apply SQL updates along with Windows updates and minimize downtime for everyone and then only have one reboot - all without having to pay it any attention). When done properly automated update maintenance works great - I know it saves me a lot of time and manual labor.

  • I'd have to say that we do see security patches for SQL Server these days, and they certainly require more work for me than service packs.

    SP# > current SP, install it.

    MS12-070 - well, what's your current build number? And if Microsoft Updates installed the one for the lesser build number, you can't upgrade to the greater build number (or at least I haven't figured out to go from build 10.00.5512 to 10.00.5826 without getting stopped by a "you've already installed this update" error)

    Last year, SQL Server 2005 had two security updates (MS12-070, MS11-049) and no service packs.

    2008 had one security update (MS12-070), no service packs.

    2008R2 had one security update (MS12-070), one service pack

    2012 had one security update (MS12-070) and one service pack.

    Sum total: two service packs, five security updates.

  • Nadrek (1/10/2013)


    ...

    Sum total: two service packs, five security updates.

    That's across 3 platforms, which still isnt' bad. If you look at the Oracle or DB2 lists, many more.

  • If you consider how hard hackers and malware developers are trying to break in and corrupt or take data or software illegally this is rather surprising. And further if you consider the complexity and diversity of SQL Server functionality, having this few "fixes" is really amazing. I know the save them up and release a number of updates at one time, but still to not have one emergency security patch after another is great.

    M.

    Not all gray hairs are Dinosaurs!

  • From many rounds of patch management over the years, I would highly recommend the following:

    1) Have a Computer Management Configuration Database (CMDB) in place, and make sure that the patch information is updated regularly thru an automated process.

    2) If databases servers have their own CMDB, and there are good reasons to do so, make sure that the Windows support and patch management team(s) are aware of it. Provide them with the appropriate interface so they understand which are the DB servers. Not every junior Windows admin will realize that DB and web/app servers need to be treated differently.

    3) SLA's for each server should be documented. It should be easy to group servers by application to see how the entire patching schedule should be set up. One of the more annoying things to deal with is spending Friday afternoon and evening re-schreschedulingxceptions to the company-wide patching window.

    4) The patch coordinator for each application should be identified and easy to determine. Application inventory systems may only list the high level owner who won't always recognize 'ServerX'.

    5) Make sure that you are getting the high availability from your clusters. In an active/passive setup, the passive node needs to be identified before every patching cycle. This should be automated, and fed into the process, as discussed in #1 & 2. Of course we would like every instance to be on the preferred node, but that doesn't always happen. Since cluster nodes are likely to have consecutive IP addresses and/or server names, both nodes could receive the patch at the same time, something that should obviously never happen.

    The bottom line is that it is critical to have good processes in place. To quote the Yogi Berra Aflac commercial, when you don't have it, that's when you gotta have it.

  • Very good point Steve! Our security team just stopped the MS13-007 patch (http://support.microsoft.com/kb/2769327) here from going onto all of our servers when it was discovered through a website scan that applying that patch to our servers would bring down the websites. They scanned our websites and found hundreds of occurrences of the REPLACE function in the .aspx code. This just goes to show you that every patch that Mickeysoft puts out is not always in your best interest. You must examine each and every one of them on a case by case basis for your shops particular situation.:-D

    "Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

  • TravisDBA (1/10/2013)


    Very good point Steve! Our security team just stopped the MS13-007 patch (http://support.microsoft.com/kb/2769327) here from going onto all of our servers when it was discovered through a website scan that applying that patch to our servers would bring down the websites. They scanned our websites and found hundreds of occurrences of the REPLACE function in the .aspx code. This just goes to show you that every patch that Mickeysoft puts out is not always in your best interest. You must examine each and every one of them on a case by case basis for your shops particular situation.:-D

    Good catch.

    There are definitely issues with some patches. I know I'm hesitant to apply any the first month. I'd rather let someone else test things.

  • Reading deeper into this story, it probably wasn't a patch to blame

    "this was the result of Task Sequence distributed to a custom SCCM Collection. The Collection had been created/modified by an HP Engineer (adding a wildcard) and the engineer had inadvertently altered the Collection so that it was very similar in form and function to the “All Systems” Collection. The Task Sequence contained automation to format the disks"

    That explains the no OS found error messages reported, and also the question of how MS OS patches could possibly be installed on the wrong systems.

  • It wasn't the patch, it was the delivery. However, that's exactly what I might be concerned about over time. Someone makes a mistake in delivery, which ends up causing issues with systems.

  • Perhaps most importantly:

    The first, easy part:

    If we assume that somehow a server is rendered unstable, excessively slow, incorrect, invalid, unusable or inoperable, what's the plan... and do you ever test it?

    The second, hard part:

    Same as the above... but on many or all servers.

    At many companies, even very large ones, the response goes much like this between a Hardware/OS/low level team and an application/database team:

    Hardware: "We've got backups."

    App: "So it'll be just like it was before X happened?"

    Hardware: "Of course not - we only back up the data/your SQL Server .bak files!"

    App: "Oh. So what now?"

    Hardware: "We've installed the operating system on [the old | some new] hardware."

    App: "So now we have to reinstall our application? From scratch? We haven't done that in Y years! And those people aren't in our team anymore!"

    Hardware: "If the OS or hardware has issues, call us."

    App: "What were all the settings we had?"

    ...

    App: "It's up!"

    User: "Z feature is broken!"

    App: "Oh... there was an exception we had to do Q for."

    GOTO User

    And this takes awhile, but it's the normal response to one server failing. If a mass update causes multiple servers to fail at once, this becomes a real nightmare... especially if the backup servers are also affected.

    Bare metal restore capable drive images are a very good solution to this.... but almost no-one actually does them of servers, and they do take a lot of space, more if you have encrypted or otherwise incompressible data.

    Whatever you're "it caught on fire" plan is, if you don't try it, and plan for doing it on many machines at once, there's a lot of room for a nasty surprise.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply