The Troubleshooting List

  • Ill add an extra point 🙂

    #Get to know your general IT support structure you don't control

    Over the years ive been databasing, ive got to know how our IT support works here (and more importantly common problems that seem to re-occur). Knowing the really good guys (and building those working relationships that let you bypass red tape), and knowing the `holes` in support.

    For me it helps prioritise my first trouble shooting steps, especially for those trickier things. e.g. We had an access gui that connected over odbc to an sql server. All was fine for months - then some new machines were installed. Some processes in the front end that were normally instant suddenly took minutes. Spent all sorts of time profiling this, checking that. Turned out to be missing DNS entries for the sql server were causing the problem! Now when I see slow connections to a newly set up sql server, its the first thing I check. It has happened again!

    martin 🙂

  • Worst performance problem:

    Developers not understanding transaction isolation levels.

    My specific worst performance problem: Developers not understanding what is going on behind the scenes of a 5th GL that was used to write several applications. An upgrade of one of these applications caused the first read (or query but they were always reads before updates later) executed to start an implicit transaction in serializable isolation mode. This was a little detrimental to performance. A user would run an report or just a view to look at something and lock everything in it's path, then would go on to do other things and continue locking until they eventually did an update which released the locks -- until it all started over again. System was brought to it's knees -- actually it was laid flat out. The software change was tested in depth with one user doing all the testing sequentially. No load testing at all. No verifying what the software was doing under the covers with transactions.

  • Blocking is consistantly the biggest problem for us, too. I've written a small application that, every two minutes, checks for blocking on our main sql server. When detected it waits for 15 seconds then checks again. If the same spid(s) is still blocking the app emails me the details about every processid that is blocking or being blocked, including duration, the command being issued, hostname, program, login, etc.

    This has allowed us to be extremely proactive in troubleshooting blocking issues. Its funny when we do determine who is running the process that's blocking, call them and suggest something is wrong. "How'd you know?!?!"

  • I did the same. I wrote sql sp and added it as SQL JOB to log into table all lockings/blockings holding more than some criteria(logging blocking/locked required resource,type of lock,cmds,duration,application,user,etc.). If it continues more than normal time script sent error on email. This helped a lot in troubleshooting applications, especially new ones. You could kill process and then analyze the log on problem cause.

  • I agree with checking for blocking first, but I'm surprised that two of the most fundamental things aren't even listed: check the SQL log for unreported errors and check the server performance counters for any current bottlenecks.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • I am a very green production DBA - always been a SQL developer who has just been thrown into the thick of server maintenance now. To make matters more hot we just launched a new app this week to beta and I have already had to troubleshoot an incident or two. Have to ask who is PSS (SQL Server consultant firm mentioned in the article). I really like the idea of having a pre-approved call in my back pocket, but haven't known where to start looking for the right outfit.

    Thanks in advance, NJ

  • Very Nice!

    I will forward this to our Lead DBA who was (translation-upset) Friday when the vendor blamed the DB performance and it was a front-end issue so he sent out an email with a subject " It's always the DB"

    To add to your list:

    - Check if it is really a DB, not the application or web pages (see above)

    - Check if another job or process is running on the server, especially if the performance is not good at certain times of the day. We once found a Windows backup job was running at 10 AM of all times. The counter I was monitoring was Disk Queue Length that was consistently up starting 10 AM. Rescheduling the backup job helped.

    - Check the Processor Utilization.

    - Tell them, developers to check on indexes. I would not give examples of full table scans multiple times for one query - will take to much time.

    - Tell them, developers to check how connections are opened / closed during on person's session. Had a case when by a typo or something the code had Connection.Open statement inside the loop while saving records, so the new connection was created for each processed record. It was long time ago. What was not long time ago was Sharepoint 2 opening about 400 connections with each user session. This issue was resolved by our Sharepoint admin after calling Microsoft..

    I have much more, but the page may time out before I post it.

    Yelena

    Regards,Yelena Varsha

  • PSS = Product Support Services. Basically, it's Microsoft SQL Server support.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • Number 10 reminded of a recently learned lesson. I came in one morning and one of our major applications wasn't working....users could not log in . Of course, when an application isn't working everyone likes to blame the database. A quick sp_who2 and a check of CPU utilization and a few other indicators all said the database was purring like a kitten. And with pride I responded that this time, it wasn't the database. The application developers rolled up their sleeves and began pouring thru code. The network engineers were checking switches and the deployment team was checking recent changes. And I kept watching the database. Turning on profiler I saw RPC calls and SQL statements executing at a normal pace.

    But then I look closer.....they were repeating the same queries over and over again. Hmmmm, wonder what that means. Finally, I tried executing one of the RPC calls directly from Query Analyzer. What the @#$%@! An error message: Time out occurred while waiting for buffer latch type 2 After 3 hours of declaring the database innocent, I was not ready to call it guilty. Fifteen minutes later and the outage was over.

    The lesson: Never stop looking until the problem is found. If all indicators say the database is fine....check more indicators.

    Fortunately, rather than being chastised for my hubris, I was praised for my perseverance.

    Gordon

    Gordon Pollokoff

    "Wile E. is my reality, Bugs Bunny is my goal" - Chuck Jones

  • At my current position, one of the first things I do when things start acting up is to check available hard drive space. We have a lot of processes that perform very large aggregations, and (with fairly frequent code-pushes) it's not unknown for a t-log file, or sometimes tempdb, to suddenly bloat up to fill all available space. These days I always check the alpha-characters first; I wasted 10 minutes or so of Production downtime because I only saw "23", when what it actually said was "23MB". :pinch:

    Mileage will of course vary depending on your code, environment, and so forth.

    Philip

  • 23 MB left instead of 23 GB? That will make a small difference :w00t:.

  • So what was the root of that problem??

  • Oh, the usual--code gone wild. There have been a number of similar situations, and they generally fall back to the fact that SQL Server isn't all that good at running queries involving dozens of tables, views, functions, and (thanks 2k5) synonyms... and that's what we've got.

    Ya got some tables; then ya build some views on those tables; then a table-valued function on the views; then a view that hits the function; then a query that joins a whole bunch of them. Note that all this was built up over time, and later developers aren't necessarily aware of what the "common tools" (views and functions) that the developers of yore wrote are actually based on. Iterate over that loop several times, and--once you decipher everything--you end up with some pretty monstrous queries.

    So, naturally, we buy more hardware and the problem goes away and everyone's happy. And two-three months later...

    What's a production DBA to do? At least, as per the original thread, it's fairly easy to pinpoint why the system is acting up...

    Philip

  • I'd like to see the execution plan of that thing :w00t:.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply