Parameter Sniffing in Production: How a 200ms Query Became a 90-Second Outage

The query had been in production for just over two years. It powered the customer search screen used by our internal support team, the one that hits the orders history table to pull recent activity for whoever the agent happens to be looking at. On a normal day it returned in under 200 milliseconds, and nobody thought about it.

Then one Monday morning at 8:47 AM, the support queue stopped moving. Tickets stacked up. The application logs showed timeouts on a single stored procedure — the one nobody had touched in years. Average runtime had jumped from 200ms to 92 seconds. CPU on the SQL Server was pegged at 94%. Nothing about the data had changed. Nothing about the code had changed. The schema had not changed. The indexes were the same indexes that had served this query for two years.

What had changed was the query plan. And the reason the plan had changed was something I had read about a dozen times and never taken seriously enough: parameter sniffing.

This is the story of how we diagnosed it, what we tried first that didn't work, what eventually fixed it, and the monitoring we put in place so the next occurrence would be a 10-minute investigation instead of a 4-hour fire drill.

What parameter sniffing actually is

Briefly, for anyone who hasn't encountered it: when SQL Server compiles a parameterized query or stored procedure for the first time, it inspects the actual parameter values being passed and uses them, along with the column statistics, to estimate how many rows each operator will produce. From those estimates, it picks a plan — index seek vs scan, nested loops vs hash join, the works. That plan gets cached and reused for every subsequent execution of the same query, regardless of what parameter values come in next.

This is usually fine. For most queries, the plan that's good for one set of parameters is good for all of them. The problem arises when the data distribution is heavily skewed and different parameter values would benefit from radically different plans.

The classic example: a stored procedure that filters orders by @customer_id. Most customers in your system have 50 to 500 orders. But three or four enterprise customers have hundreds of thousands of orders each. If the plan gets compiled for a typical customer, the optimizer picks an index seek + nested loops, which is fast for small result sets but catastrophic when an enterprise customer queries 400,000 rows. If the plan gets compiled for an enterprise customer, the optimizer picks a hash join with a full scan, which handles the big customer well but adds noticeable overhead for the 99% of customers who only need 50 rows.

The plan in cache is whichever one happened to be compiled first.

The first wrong guesses

The initial diagnosis was almost guaranteed to be wrong, because every Monday-morning performance problem looks the same to a tired engineer. We worked through the usual suspects in roughly this order:

"The indexes must be fragmented." They weren't. Fragmentation on the relevant indexes was under 8%, well within normal range. We rebuilt them anyway, because rebuilding indexes is the database equivalent of restarting the router — it makes you feel like you're doing something. Performance did not improve.

"Statistics must be stale." They weren't really stale either, but we updated them with FULLSCAN just to be sure. This actually did "fix" the problem for about 20 minutes, which sent us down a completely wrong path. (More on why this happened in a moment.)

"Maybe it's blocking." It wasn't. sp_whoisactive showed the slow query running alone, eating CPU, not waiting on locks.

"Did anything deploy over the weekend?" Nothing. The last release was 11 days ago.

The breakthrough came when one observation finally clicked: when we manually ran the stored procedure from SSMS with a typical parameter value, it returned instantly. When the application ran the exact same procedure with the exact same parameter, it took 90 seconds. Same code. Same data. Different runtime by a factor of 400.

That gap between SSMS performance and application performance is the fingerprint of parameter sniffing. SSMS was compiling a fresh plan for our test parameter (because the connection settings differ enough — ARITHABORT in particular — that it generates a separate cached plan). The application was reusing the cached plan that had been compiled for someone else's parameter.

Finding the smoking gun

Once we suspected parameter sniffing, confirming it took one query. If you have Query Store enabled (and if you're on SQL Server 2016 or later, you should), this is where it earns its keep. The query below joins the Query Store catalog views to surface every cached plan for a given stored procedure, along with the execution count, average duration, and average logical reads for each plan. The goal is simple: if the same query text has more than one plan, and those plans show wildly different runtime characteristics, you are almost certainly looking at parameter sniffing.

SELECT 
    qsq.query_id,
    qsqt.query_sql_text,
    qsp.plan_id,
    qsp.is_forced_plan,
    qsrs.count_executions,
    qsrs.avg_duration / 1000.0 AS avg_duration_ms,
    qsrs.avg_logical_io_reads,
    qsrs.last_execution_time
FROM sys.query_store_query qsq
JOIN sys.query_store_query_text qsqt 
    ON qsq.query_text_id = qsqt.query_text_id
JOIN sys.query_store_plan qsp 
    ON qsq.query_id = qsp.query_id
JOIN sys.query_store_runtime_stats qsrs 
    ON qsp.plan_id = qsrs.plan_id
WHERE qsqt.query_sql_text LIKE '%usp_SearchCustomerOrders%'
  AND qsrs.last_execution_time >= DATEADD(DAY, -7, SYSUTCDATETIME())
ORDER BY qsrs.last_execution_time DESC;

The output told the whole story. Two plans for the same query. One had been running for 18 months with an average duration of 187 ms across millions of executions. The other had been compiled the previous evening at 11:14 PM and had an average duration of 84,000 ms across the few thousand executions since.

Why did a new plan get compiled? In our case, an automatic statistics update on the orders table had crossed the modification threshold late Sunday evening and triggered a recompile. When the recompile happened, the first parameter value to come through was an enterprise customer ID with 380,000 matching rows — an overnight batch process that touches our three largest accounts had just started. The optimizer compiled a plan optimized for that case — a hash join with a full index scan — and cached it. Every subsequent execution, including the ones for typical small customers, was now stuck with that plan.

This also explained why updating statistics had "fixed" the problem for 20 minutes: the stats update forced another recompile, and by chance, the next parameter to come through was a more typical value, producing a more typical plan. Until the next bad-luck recompile.

The other diagnostic that confirmed it: looking at the actual execution plan and comparing estimated vs actual row counts. The cached plan was estimating 400,000 rows from the customer filter, then actually returning 73 rows. That kind of estimate-vs-actual skew, by three or four orders of magnitude, is parameter sniffing's signature.

The fix, and the fixes we rejected

There are several ways to handle parameter sniffing in SQL Server, and the right choice depends on the workload. Here's how I think about each one:

OPTION (RECOMPILE) forces SQL Server to generate a fresh plan on every execution. This eliminates sniffing entirely — the plan is always optimal for the current parameters. The cost is the compilation itself, which adds CPU and a small amount of latency to every execution. For a query that runs thousands of times per minute, this CPU cost is significant. For a query that runs a few times per minute, it's negligible.

OPTION (OPTIMIZE FOR @param = value) pins the plan to a specific parameter value you choose. This works well if you know which parameter pattern is most common and want to optimize for that, accepting that outlier values will get a suboptimal plan. The risk: data distributions shift over time, and a value that was representative two years ago might not be representative today.

OPTION (OPTIMIZE FOR UNKNOWN) uses the column's density information rather than any specific parameter value. You get a "middle of the road" plan that isn't great for any specific value but isn't catastrophic for any either. In practice, "middle of the road" plans are often mediocre, but mediocre and consistent is sometimes better than great-then-terrible.

Query Store forced plans let you pin the known-good plan without changing the query code. This is incredibly useful for situations where you can't modify the SQL — vendor applications, ORM-generated queries, queries embedded in code you don't own. The downside is that a forced plan is forced forever (until you unforce it); if the data shifts enough that the forced plan becomes wrong, you've now got a different kind of stuck-plan problem.

Rewriting the procedure to handle the skew explicitly — for example, branching with IF to call different code paths for "small customer" vs "large customer" parameters. This is the most work but often the most robust solution for known-skew situations.

In our case, we forced the good plan via Query Store as an immediate fix to stop the bleeding. The support queue started moving within a minute of forcing. Over the next sprint we rewrote the procedure to branch on customer size, using a lookup against a small cached table of "large account" customer IDs to route requests down one of two query paths. That removed the sniffing risk entirely, and once we'd verified the new procedure performed well for both populations, we unforced the plan in Query Store.

Forcing a plan through Query Store is two statements: identify the good plan_id from the diagnostic query above, then call sys.sp_query_store_force_plan with the query_id and plan_id. The command below is what we ran at 9:02 AM to pin the 187 ms plan and stop the bleeding:

-- query_id and plan_id come from the diagnostic query in the previous section
-- query_id 4827 is our stored procedure; plan_id 19384 is the known-good 187 ms plan
EXEC sys.sp_query_store_force_plan @query_id = 4827, @plan_id = 19384;

-- Verify the plan is now forced
SELECT query_id, plan_id, is_forced_plan, force_failure_count, last_force_failure_reason_desc
FROM sys.query_store_plan
WHERE query_id = 4827;

The verification query confirms is_forced_plan = 1 and force_failure_count = 0. From this point on, the optimizer is required to use that specific plan for the query, regardless of which parameter value triggers the next compile. If forcing ever fails (for example, because an index referenced by the plan has been dropped), force_failure_count increments and last_force_failure_reason_desc tells you why, so you are not flying blind.

What we explicitly did not do: slap OPTION (RECOMPILE) on the procedure and walk away. That would have fixed the symptom and added persistent CPU overhead to a query that ran 120,000 times per hour. Easy fixes that defer cost are the kind of decision that comes back two years later as someone else's incident.

Preventing the next one

The fix was one thing. The bigger lesson was that we had no way of detecting plan regressions before users complained. The query had been silently running on a bad plan for almost 10 hours overnight before anyone noticed Monday morning. That was the real failure.

Query Store has plan regression detection built in if you write the query for it. The script below compares each query's recent performance (last hour) against its historical baseline (the prior seven days), then flags any query whose recent average duration is at least five times its historical average. I run this on a 15-minute schedule and alert on any non-empty result set:

WITH recent_stats AS (
    SELECT 
        qsp.query_id,
        qsp.plan_id,
        AVG(qsrs.avg_duration) AS recent_avg_duration,
        SUM(qsrs.count_executions) AS recent_executions
    FROM sys.query_store_plan qsp
    JOIN sys.query_store_runtime_stats qsrs 
        ON qsp.plan_id = qsrs.plan_id
    JOIN sys.query_store_runtime_stats_interval qsrsi
        ON qsrs.runtime_stats_interval_id = qsrsi.runtime_stats_interval_id
    WHERE qsrsi.start_time >= DATEADD(HOUR, -1, SYSUTCDATETIME())
    GROUP BY qsp.query_id, qsp.plan_id
),
baseline_stats AS (
    SELECT 
        qsp.query_id,
        AVG(qsrs.avg_duration) AS baseline_avg_duration
    FROM sys.query_store_plan qsp
    JOIN sys.query_store_runtime_stats qsrs 
        ON qsp.plan_id = qsrs.plan_id
    JOIN sys.query_store_runtime_stats_interval qsrsi
        ON qsrs.runtime_stats_interval_id = qsrsi.runtime_stats_interval_id
    WHERE qsrsi.start_time >= DATEADD(DAY, -7, SYSUTCDATETIME())
      AND qsrsi.start_time <  DATEADD(HOUR, -1, SYSUTCDATETIME())
    GROUP BY qsp.query_id
)
SELECT 
    r.query_id,
    qsqt.query_sql_text,
    r.recent_avg_duration / 1000.0 AS recent_ms,
    b.baseline_avg_duration / 1000.0 AS baseline_ms,
    CAST(r.recent_avg_duration * 1.0 / NULLIF(b.baseline_avg_duration, 0) 
         AS DECIMAL(10,2)) AS regression_factor,
    r.recent_executions
FROM recent_stats r
JOIN baseline_stats b ON r.query_id = b.query_id
JOIN sys.query_store_query qsq ON r.query_id = qsq.query_id
JOIN sys.query_store_query_text qsqt 
    ON qsq.query_text_id = qsqt.query_text_id
WHERE r.recent_avg_duration > b.baseline_avg_duration * 5
  AND r.recent_executions >= 100
ORDER BY regression_factor DESC;

Each row in the result identifies a query_id, its plan_id, the historical average duration in milliseconds, the recent average duration in milliseconds, and the ratio between the two. A clean system returns zero rows. The morning of our incident, this query would have returned a single row showing query_id 4827 with historical_avg_ms around 187, recent_avg_ms around 84000, and a regression_ratio of roughly 449 — impossible to miss.

The thresholds are tunable: I use 5x regression and at least 100 recent executions to filter out noise. This query, run on a schedule and alerted on non-empty results, would have caught our incident within 15 minutes of the bad plan caching, instead of 10 hours later when the support team showed up to work.

A note for anyone on SQL Server 2022 or later: Microsoft introduced Parameter Sensitive Plan Optimization (PSPO) in that version, which can automatically generate and cache multiple plans for the same query based on parameter value patterns. It's enabled by default at database compatibility level 160 and works for a meaningful subset of cases — queries with equality predicates on columns with significant data skew. It's not a complete solution to parameter sniffing, but for some workloads it removes the problem without code changes. Worth testing before adding OPTION hints everywhere.

What I would do differently

A few things stand out, looking back:

Query Store should be on for every production database, with sensible retention. It was on for the database in this story, which is the only reason the diagnosis took an hour instead of a day. Every production database I've stood up since has had it enabled in the initial setup script, with a 30-day retention and the default capture policy set to AUTO. The storage cost is minor; the diagnostic value is enormous.

Plan regression alerting belongs in the standard monitoring stack, not as a one-off script. The query above (or its many published variants) should run continuously, with alerts wired to wherever the on-call rotation watches. The cost of building this once is a few hours. The cost of not having it is one outage like the one I just described.

Stored procedures with known skew should be documented as such. Not every procedure needs special handling, but the ones that filter on parameters where data is heavily skewed should have a comment in the procedure saying so, along with which mitigation is in use (forced plan, recompile, optimize for, code branch). Future engineers — including future versions of yourself — will thank you. We had no documentation for this procedure's known skew, even though the same customer size pattern had bitten us in a different proc 18 months earlier. Two incidents, no institutional memory between them. That's a failure mode you fix with comments and a wiki page, not with cleverer SQL.

Test parameter coverage in performance testing. Most performance tests run the same parameter values every time, which means they exercise the same plan every time. Tests that randomize parameters across the realistic distribution — including the outliers — will catch parameter sniffing risk before production does. This is harder than it sounds because you need realistic data and realistic parameter distributions in the test environment, but the payoff is finding these problems on a Wednesday afternoon instead of a Monday morning.

Closing

Parameter sniffing is one of those problems that's easy to explain, easy to recognize once you've seen it, and devastating the first time it happens to you in production. The query plan that's been quietly serving traffic for two years is not actually stable — it's stable until the next recompile, which can be triggered by a server restart, a statistics update, memory pressure, or an index rebuild. When that recompile happens with an unrepresentative parameter, you get a new plan that may or may not work for the rest of your traffic.

The defenses are not exotic. Query Store, regression monitoring, deliberate plan management on known-skew queries, and a healthy skepticism of "but it's been working fine for years" cover most of the risk. Build them in before the incident, and the incident becomes a 15-minute investigation. Build them after, and you spend a long Monday morning rebuilding indexes that didn't need rebuilding.

The query that's fast today is fast on this plan. Make sure you know which plan, and make sure you'll notice when it changes.

Parameter Sniffing in Production: How a 200ms Query Became a 90-Second Outage

What parameter sniffing actually is

The first wrong guesses

Finding the smoking gun

The fix, and the fixes we rejected

Preventing the next one

What I would do differently

Closing

Rate

Share

Categories

Share

Rate

Parameter Sniffing in Production: How a 200ms Query Became a 90-Second Outage

What parameter sniffing actually is

The first wrong guesses

Finding the smoking gun

The fix, and the fixes we rejected

Preventing the next one

What I would do differently

Closing

Rate

Share

Categories

Share

Rate

Related content

Creating and Using Inline Table-Valued Functions

Creating and Using Inline Table-Valued Functions

Your Data After Life

Changing Computer Retail

Index Column Order – Be Happy!