A special thanks to Dwayne J. Baldwin of Winhurst Technologies Inc. for sending this to me. Apparently APC has issues an alert for PowerChute users, in a nice, little place on their website. Click the below image to see a shot of the APC homepage and see if you can find the notice.
To be fair, they sent out notices to their customers, or at least those that probably registered. SQLServerCentral.com community member Dwayne noted that he'd gotten their email just as I write this, on Wednesday, August 17th, 2005. Here's the link to the notice on APC's site. However I do have a few problems with this and some experience. I'd let a certificate expire before (more on that below), so I can appreciate the issue.
The notice says that you could experience issues after July 27, 2005. Anyone see an issue here? It's now August 17th, nearly a month since the issue cropped up and notifications are going out now? And it's not a huge new item? The server could stop responding? You have to boot using Safe Mode to fix things? At least it's mentioned on the support page at the top so people would notice it.
I'm sure APC doesn't want to publicize this. And it might not be a huge problem (yet) since the Google Sci/Tech news page has more important stories like XBOX 360 prices, Blu-Ray v HD-DVD, and of course the Windows ZOBOT worm and variants. Searching for APC is useless, but American Power Conversion gives me information about their racks, earnings, and various product announcements. Even "APC Powerchute" doesn't bring up the story.
Having this issue is pretty dumb for a software company. But not putting it as the top story on your site, warning and making a huge effort to reach customers is negligent. I've got lots of people patching this week and next for the August MS updates and having their servers stop responding on the reboot will not be fun. How many of them will even connect the dots that it's Powerchute software?
The APC situation is probably a big "whoopsi", but I had one similar. Names changed to protect the guilty and upset.
doodl-loo-doodle-loo-doodle-loo - Wayne's World
It's Wednesday morning, circa early-2000, and we've survived the Y2K crisis with a huge whimper that had me working way too late on New Year's Eve and back in on New Year's Day. Still things are at their normal hectic pace as I come in to check the status reports from my system admin and junior DBA. Life at a startup financial services firm, managing money and facilitating trading for many educational institutions is interesting and hopefully it will end up being profitable for me.
I look up as Billy, my system admin, runs in. "Trading isn't working!" he says breathlessly
I'm relatively used to this. Firefighting this home-brewed, quick drip software has resulted in many issues, not the least of which is nearly every change to an application breaks something else. Calmly I lean back in my chair.
"Which part of trading isn't working?" I inquire. Since we have web based, fat client, and batch automated trading, I need to narrow things down.
"Edu-Fund can't connect to their website."
Aha, a critical piece of information that makes some sense to me. One of our big clients has most of their financial advisors connecting to us via the web to trade positions. I'm worried and we need to get this working, but it's not $100,000 a minute worried. We deal in mutual funds and as long as we send the trades out to the big firms by 2:00MST, we're good. However it's 9:04 am and we need to get moving.
My first action is to hold up a hand to Billy and grab the phone. I need to get the COO involved and have him start working the client, ensuring that trades are being taken over the phone and logged so that we can help the client catch up when things are working. Next I get up and walk through the halls with Billy in tow. A quick stop at Nathan, the junior DBA's office before hitting the server room.
"Everything ok with the dbs?" I ask.
Nathan's having a decent morning, not too stressed out, but he's working against that. "DBs are fine, but since no one can trade.."
I cut him off, his voice rising and the anxiety growing with each word.
"I'm on that, just keep an eye on all the SQL Servers and jobs. Make sure nothing else is wrong and we're ready for when the trades start coming."
We continue on to the server room. Mainly because I'm more comfortable in there with all the servers, I can quickly hit the console we have for the firewall and routers, and it's quiet. Plus the COO doesn't know the entry code, so I'll have warning before he starts babbling in my ear. Always a good thing in a stressful situation.
The first step when we sit down is to grab the notepad that's sitting on the console desk at all times. I notice that there's no entries for today, but decide not to say anything. I write the date, time, and a short "EduFund website not working" down to being my log of events. Keeping a short log when things are happening is crucial. I've found it essential for reconstructing events later and learning from what happened. Plus since we log all changes, it's easier and more reliable than someone's memory if something goes more wrong.
I switch the KVM setup over to the administrator workstation and as I pull up the EduFund website, I ask Larry exactly what happened.
"EduFund called in and one of the analysts passed them to me. They said that they cannot log into the website and there's some error about an invalid certificate. They say nothing's changed on their side, but I know nothing's changed on the web servers here so they must be doing something." Billy spills all this out in what seems like one breath. He's excited and when he is, he talks quickly. Fortunately one of us maintains some calm.
I'm silent as the website loads and immediately gives an SSL error. HTTPS isn't allowed as the certificate is expired.
OK, this is simple, but somehow from a financial analyst to a client analyst on our side to my sys admin no one has interpreted this simple "SSL Certificate expired" to mean that the certificate on our web server is expired. Now I'm new to the whole e-commerce side of things. Most of my work has been on local LANs or secured WANs. No SSL involved and so this is a new one to me. But even for someone who's never installed a certificate on an IIS 4.0 / NT 4.0 web server, I've read enough about the web to figure out what's wrong.
Now I'm new to this job, relatively, I didn't setup the systems, but I manage them and so this is my fault. Plain and simple and the thing to do now it fix it ASAP. As if on cue, there's a knock on the door.
"Get that please, " I ask Billy as I start writing a few notes down. Certificate expired, https not working, and I cut and paste the error and save it in a text file on the admin desktop, named for the date and time.
Our COO walks in, slightly upset, but he's maintaining his cool pretty well. "What's going on?", he asks as he sits next to me.
I explain the situation as I'm surfing to TechNet and Verisign simultaneously on two browsers. I tell him the certificate is expired and we can fallback to HTTP immediately and begin processing trades. As I give this option, the client analyst comes in and starts immediately yelling that we cannot go off SSL communications because of security. Man we need to change the code on that door. I think to myself that the COO is the only person that doesn't know it.
I explain that since the clients at EduFund are mostly on the same LAN, we can set a firewall rule and limit their access from their outgoing proxy server to this webserver. It doesn't help a few other clients, but they're not working anyway, so there's nothing I can do to help them without knowing specific IPs. There's more complaining, but I ask that they inquire if the client find's this acceptable. They grumble, but start making calls.
As they get on the phone, I confirm with TechNet that my certificate is expired, and that I need to get a new one from the issuing authority, Verisign or Thawte. I pull up the Verisign website and start looking for ways to renew my certificate. It appears that I need to log into my account and they'll get me a new one ASAP for a small charge. Money's no object, so I get to the login screen and pull Billy over.
"Can you log in here?" I ask while tapping the CIO on the shoulder to secure his credit card.
Billy looks right at me, "I don't have the login."
"Well, who set this up?" Pretty much everyone has been working here longer than I, so I assume that someone knows which login was used.
"I don't know, " is the reply I'm dreading, but somehow expecting.
"Well find out, " I tell him. This is the last thing I need. As he leaves to go ask around the development staff, or at least that's my assumption, the COO hangs up the phone.
"We can't not use HTTP," he tells me as if he understood what HTTPS was and his double negative made sense.
I tell him it's ok, and that once we log in, we'll quickly get a new certificate and be back online. He tells me to hurry and leaves with the client analyst, giving me a few minutes of peace. I jump over to OWA and enter my notes to date, including the error message into our public folder log and then print out the instructions for updating the certificate from TechNet.
About 10 minutes later as I'm reading through the instructions and making notes about the various items specific to our servers, Billy comes back in. From looking at his face, I know what he's going to say.
"No one knows anything about a login, " he starts, "Apparently Dan set this up last year for EduFund. They were the first client that wanted SSL setup."
I'm leaning over, head between my knees, hands clasped behind my head.
That name just tourques me off and I need a few deep breaths before I can respond. He was one of the original developers of many of our software products and a half-assed, no holds barred developer. He'd quit a couple months earlier, a few days ahead of me getting him fired. He'd caused who knows how many outages by making changes live on production and from what I'd seen, he barely understood the web. Even if he'd been available and willing to help us, it's unlikely he even knew the login or password.
This was a classic example of why there should be some controls and process around managing an IT department, if you can call 12 people an IT department. Since I'd been there, we'd built a spreadsheet with all admin logins and passwords that was stored on a floppy in the CFO's safe. It gave us protection for events just like this. I didn't have to ask if Billy's already checked it, as I'd spent months driving the need for this into his head. Just like logging all changes, protection from user's hoarding critical information is essential for a company to protect itself.
I knew there was nothing to do, so I immediately surfed over to the Verisign contact information page and started dialing. I got someone on the phone and explained the situation.
The long and short of this situation was that Verisign needed to ensure we were the correct owners of the certificate and would only issue a new one with a certified letter from one of the corporate officers on stationary sent to them via certified mail. I didn't bother to point out the security problems with that approach, but thanked them, contacted the COO and had him draft a letter to go out via courier to Verisign.
We ended up getting a certificate 3 days later and installing it in IIS without issue. I'd purchased a 3 year certificate and made sure the CFO had reminders set in the financial system to revisit this in a few years. Three days of phone trades did not make anyone happy and I got called into the CEO's office after things were working to explain.
There isn't a good excuse for letting some critical service expire and I didn't make any. I took the blame myself since it was my department that let it happen. It was the perfect time, however, to drive home the importance of us having procedure and process and working in an orderly manner. Not moving slowly, but moving in a consistent fashion that allows us to undo or backtrack if needed. Simply storing names and passwords in a secure location, as I'd implemented, would have solved this. And while we wouldn't have looked good by having a client see this, we would have solved it in an hour or two instead of three days.
The takeaway from this was a lot more power to standardize procedure and process for IT. The reigning in of the cowboy developers, which had started with the changing and hoarding of the "sa" password continued on with more lockdowns to ensure stability of our systems.
I left six or eight months after that and I'm not sure if they had another incident down the road. I talked to Billy almost a year later and he was still logging things everyday as I'd taught him and he had fewer issues, but my successor wasn't as tight with process and had allowed developers back on production systems. Billy admitted to me that it was nice that he wasn't bothered with implementing changes, but he much preferred that to being called to fight the fires caused by untested and uncontrolled changes.
It's possible that APC got caught in a similar situation. Building something that would run for a long time, not considering or expecting the a certificate for something like the Java environment would ever expire.
But it seems to me their reaction has been a little too calm and quiet. This is something that could cause tremendous pain and problems for many environments, not just Windows. But because of patch day, Windows and Microsoft, will get the majority of the press and blame. Another reason for people to not use "Windows".
If I were Steve Ballmer, this is one I'd head off early and put the word out quickly and widely. Be careful of patching your Windows servers because of 3rd party software: Powerchute 6.x.