Downtime

  • Comments posted to this topic are about the item Downtime

  • A broken URL has no answers today πŸ™

    However here's one I got in email:

    For the record, the main reason to remove SPs that are surplus to requirements is to ensure that nobody uses them. Imagine an old and unmaintained stored procedure is used by someone and it has a bug or causes a performance issue... think of the amount of time and effort that would be required in troubleshooting and correcting the mess this might cause!

    Far better to force a user to specifier their requirements or think through their own stored procedure, I think.

    Code maintenance and software development is hard, and extraneous and unmanaged code is awful. I should know, I work for a vendor (EMC Infra) and code that our clients customize that isn't being used or is badly spec'ed is often the code that causes us the most amount of problems

  • Whew! I thought I was going crazy or the network was! Thank goodness it was just you and these crazy links!

    Downtime: No. Didn't have any last year. Had a power outage but the back-ups and generators all worked. A few people lost/locked their terminal sessions because they didn't plug into the proper side of the UPS strip, but no data loss or downtime.

    Pruning: Not a regular exercise. Once a year, or so. And I let things linger for 2 1/2 years, not just 6 months. I find when I get assigned a 'new' project, I'm usually rehashing through an older project that I did, or someone here did, about 2 years ago.

  • The company I currently work for had a heat-related massive downtime issue a month or so before I started here. From what I understand, the AC failed, and alerts that should have gone out, didn't.

    I caused some downtime once by making a mistake with the default login used by a linked server, but that's not the kind of downtime you're asking about here.

    I think most of the downtime I've seen has been because of poorly set up and maintained hardware. But that was years ago with a sysop who really shouldn't have been in that business. Downtime was hours per month at that place.

    The best "the whole network is down" I've been through was when a salesperson ended up with Slammer on a copy of SQL Express he had on his laptop, and plugged that into the LAN and brought the whole place to its knees. He wasn't even sure why he had SQL on the laptop, much less how he got Slammer in there. Took all morning for the admin to figure out what was going on.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • We have very little downtime.

    Pruning is something that occurs once a year. On our system i, I can use a command to tell the last time an object was used. I've written queries to tell me if the last usage was more than two years. We can then analyze it to determine if anybody really needs it.

    I'm still pretty new to SQL Server and haven't looked into how this would be done. It would be handy to do it though.

  • Purging of stored procedures is usually because there are so many it is difficult to see the wood for the trees! If they could be divided into sub-directories in the same way as integration services packages can in msdb then life would be a lot easier.

  • knock on wood

    the unplanned downtimes we had were:

    - during fire detection system tests, the backup power unit was shut down because of a little shot circuit by one of the electricians.

    All non-critical servers were still working :w00t:

    - SAN servers lost SAN connection during online activation of SAN-switch duplexing, which shoudn't be a problem according the the SAN manufacturer.

    - One of the non-sysadmins thought he should reboot a server, wasn't able to do it using RDB, so he entered the server room and pulled the power cables ... out of the wrong server.

    - extreme temperature exposure caused some servers to stop working.

    (we had some very hot days last year) That server cabinet has been modified with an airco unit.

    - data upgrade because of new application rollout destructed data because of some last minute changes by the dev team.

    The restore operation took more than the planned downtime window.

    - in one occasion a non dba added a non-sysadmin sqlaccount to the

    sysadmin group of sqlserver, and applications got messages

    "object does not exists.." ....

    That was finally thΓ© issue that got us the approval to restrict sysadmin membership.

    - we also has a downtime caused by yours truly, because even tough

    the logon trigger had been tested for some time,

    not every actual situation had occurred and an instance got unresponsive.

    DAC saved my but.

    Yes, we did have some hard disk failures, and because of it were raid volumns, all the sysadmins had to do was replace the disk and monitor the rebuilt process. No downtime, only slow down.

    Redundancy and protection are to be considered like an insurance and cannot protect you physically against what human inventiveness πŸ˜‰ .

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution πŸ˜€

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • If you're interested, there's a discussion on this in our LinkedIn group as well

  • I know it makes the DBA look bad, but yeah we have a lot of unplanned downtimes of production systems. Bear in mind that we have over 1600 databases on ~140 different servers and 2 DBAs, so statistically we do ok with uptime.

    Number one cause. Too many people with admin rights and not enough communication. It is a cultural thing here I inherited, and I can't change. As a result we do a lot of fire fighting.

    Number two cause, budget constraints. Customers want High Availability for everything, but can't pay for it. Good example here is SAN failures, I won't name specific brand, but hey they were cheap for a reason.

    Other unplanned downtimes included:

    - Network switch failures

    - Antivirus software mis-configured killed clusters

    - Autopatch turned on accidently

    - AD management failures (helpdesk had power to reset service account passwords and used said power)

    - Rarely, the occasional CPU, Mainboard, memory, or other physical hardware failures.

  • When I came in Monday morning my main production SQL server was down due to a hardware error. After talking to tech support it was decided the box required a new motherboard. We're a small non-profit so we don't have fail-over, replication or any such thing. Fortunately I have good backups! I had just started building a replacement server so I used the new server, installed SQL, restored databases from backup and was up in two hours. Everybody was happy after that.

    One thing I've learned as a DBA: backup databases, practice restoring databases. It may be boring but can save your job and your companies data!

  • Blade enclosure caught fire. A few power supplies melted at the ends. The fire alarms went off in the data center and the whole building was evacuated. That type of failure was unheardof by both us and the manufacturer.

    Luckily for VMWare and our second blade enclosure in a nearby rack, most systems failedover properly and immediately. The rest were back up and we were fully functional in under 90 minutes. We had to 'migrate' a few of the least singed blades from the smoking enclosure to the working one.

    Finding a replacement blade enclosure (still under warranty from manufacturer) took about a week. Apparently they don't keep any extras lying around and had to divert someone else's shipment.

  • I had downtime just this week due to a hardware issue. The issue was something I had never encountered before, and I have been in IT since 1993! Unlike many of you, I am not only the DBA, but I am also everything else-we run very lean (me and a part-time assistant for our entire office).

    One of my main SQL servers was blue screened when I came in Tuesday and had an NMI parity error. Dell solved the problem before my restore to my backup SQL server finished. Turns out the processor came loose, and reseating it simply fixed the problem. This is the bottom server on a Dell rack and it has been in place for 5 years. The very helpful Dell tech told me that occassionally the vibrations in the room can cause this. That was a new one for me, but I guess he was right because the server is still up & running. The total amount of downtime was approximately 45 minutes. In my office, a few hours of downtime per year is not earth-shattering, though certainly not desirable.

  • We've had a few hours of downtime over the past year due to power failure. Our battery backups only last a few hours. Also, one of our databases became corrupt, but we backup the data faithfully so that only cost another hour or two.

    Also, though I'd hesitate to call it downtime since the servers were up and running, but our T1 provider (Tier 1) has had their cable cut two different times. Meaning that even through everything was functioning as far as the databases were concerned, our clients were not able to access them. Furthermore, we were not able to send or receive many of the files necessary for DB updates. This cost us about 20 hours!

  • Some bad memory issues and MOBO updates took us down a couple times this year on one server, for as long as a reboot took, and some scheduled down time to replace/upgrade the memory.

    Most of the downtimes I have seen in the last 3 places of work have been MySQL related... or rather the administration of such... but I won't bother explaining those here, other than to say, a *nix admin does not automatically equate to DBA skills... much against common practice.

    Some third-party vendor outages delayed some sync updates from time to time - nothing serious though.

  • Just got done messing with an older server. Fan on the Northbridge stopped running and caused it to over heat. Put on a new fan, but it was to late. Chip was cooked. Luckily we had a second motherboard handy, so I was able to swap it and get it back up.

    Most of the downtime I see is similar, older hardware that should never have been turned into a server in the first place. A previous vendor was hot on the idea that there was no sense in paying for full blown servers.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply