Over my career as a performance-focused database engineer, I've optimized countless queries. Most problems fall into familiar categories like indexing issues, non-set-based logic, Cartesian explosions or improperly normalized schema.
But some of my biggest wins had nothing to do with technical query optimization.
The solutions couldn't be found in the execution plan, runtime statistics, or DMVs. It required taking a step back and asking some basic questions that nobody else was asking:
- What is the intent of this query?
- When should it run?
- Does the data distribution really look like what the business owner expected?
I've found that other database engineers, DBAs, and even AI tools struggle with these problems – not because they can't understand the error, but because they're not looking for it. Resolving these issues probably isn't part of the typical Database Engineer/DBA job description, but you'll likely get at least part of the blame when these queries perform poorly—and you can claim all the credit when you resolve them. Here are four examples where the real problem was hiding in plain sight.
Example 1: Bad Data Distribution in Lower Environments
Our application had dynamic, multi-filter queries that ran very frequently. Most of the queries filtered for items that were Active (meaning not delivered and not canceled). These queries started to perform so poorly in lower environments that test and dev were almost unusable. Testers and developers assumed that some recently released logic was making them slower.
I investigated and noticed that the data distribution of the status column in lower environments was different from that in production. In production, real users almost always either set the status column to delivered or cancelled. The only active items were those in the active work queue.
In contrast, testers in lower environments created processes that focused on item creation and manipulation but never completed the workflow – items were not delivered or cancelled. The testers had created hundreds of thousands of items that were still active, and this was bogging down the queries.
The solution was to create a script to auto-cancel all items in lower environments after a certain amount of time. Eventually, the data distribution and performance metrics in lower environments became more production-like and query performance in test came back into alignment with production.
Example 2: Missing Client Feature Flag Creates Inverted Data Distribution
We received a new requirement to add a "find invoice-able items" feature to our TMS. The specs called for many optional filters plus constant filters that identified “invoice-able” items: delivered, accounting follow-ups completed, and not already invoiced.
I developed and tested a dynamic stored procedure on sandbox environments that contained production data. The stored procedure was fast and correct. However, as soon as the change was deployed, we got a ticket that it was slow in the development environment.
The ticket went to an on-call teammate who struggled to figure out why. After spending several hours, he reached out to me. The problem was immediately clear: the dev environment wasn't being used for invoicing. Unlike production, where most items had been invoiced, most items in dev were still eligible for invoicing.
The immediate fix was simple: don't run the query if you're not using invoicing. As the feature headed for production, QA teams created several more tickets for the same issue. The real solution was a feature flag indicating whether the client was using our application for invoicing to prevent these calls entirely.
Example 3: Filtering in Application Code Instead of SQL
I was asked to advise on a problem that some of my colleagues were stuck on. The query requested the top 100 items based on a few simple filters with user-controlled sort order. The query always matched many more rows than expected and when adding the TOP N and the sort, it was unacceptably slow.
My first question for the developer was "can you explain the intent of this query?" It was clear that the request being sent to SQL only contained part of the query intent. The developer believed that SQL couldn't handle complex filter combinations efficiently and decided to apply many of the filters in application code.
The result was a slow query with high IO and high network latency. We convinced the developer to submit all the filters with the query request and helped him write a dynamic stored procedure to return only the records that would be displayed to the user. Performance went from totally unacceptable to blink-of-an-eye responsive.
Example 4: Expensive Query Called but Data Was Never Used
We were alerted to high resource usage in a database and identified two new queries causing the spike. One was getting the details and the other the summary statistics. A known issue was affecting query performance but it wouldn't be fixed anytime soon. There wasn't much we could do to make the queries faster.
I often go into the dev UI as a user to see if there are any clues that can help with optimization. In this case, I didn't see anywhere in the UI where the summary query data was shown to the user.
We brought this up with the dev team and they brought it up with Product. Eventually, everybody agreed that the summary query results weren't being used and they could simply stop running the query. We still had one bad query but that's a lot better than having two.
How to Spot These Issues
When typical metrics don't reveal the answer, here's my diagnostic process:
- Understand the business intent – Formulate a one-sentence description of what the query does from the user's perspective.
- Analyze constant filters – Identify any filters that always apply and roughly estimate the expected result set size.
- Compare expectations to reality – use Query Store to examine statistics like reads per execution and average rows returned. Do these match your expectations?
- Talk to stakeholders – reach out to developers, product designers and testers. Ask them to explain the feature and share your understanding. Look for inconsistencies and follow up on them.
- Look in the UI – if you have access to the UI, pretend you are a user and try to run the query. Look for anything unexpected here and follow up.
This type of analysis is time consuming. It should only be applied to queries that are on your "worst offenders" list. But when you have a query with a critical performance issue and the typical metrics don't give you an answer, this is the next place to look.
Sometimes the best optimization is a conversation.
What non-technical performance problems have you uncovered? I'd love to hear your stories in the comments.