Should You Use SQL Server 2022’s STRING_SPLIT?

T-SQL
2 Comments

SQL Server 2022 improved the STRING_SPLIT function so that it can now return lists that are guaranteed to be in order. However, that’s the only thing they improved – there’s still a critical performance problem with it.

Let’s take the Stack Overflow database, Users table, put in an index on Location, and then test a couple of queries that use STRING_SPLIT to parse a parameter that’s an incoming list of locations:

The two queries produce slightly different actual execution plans, but the way STRING_SPLIT behaves is the same in both, so I’m just going to take the first query to use as an illustration:

That red-highlighted part has two problems:

  1. SQL Server has no idea how many rows are going to come out of the string, so it hard-codes a guesstimate of 50 items
  2. SQL Server has no idea what the contents of those rows will be, either – it doesn’t know if the locations are India, China, or Hafnarfjörður

As a result, everything else in the query plan is doomed. The estimates are all garbage. SQL Server will choose the wrong indexes, process the wrong tables first, make the wrong parallelism decisions, be completely wrong about memory grants, you name it.

Like I wrote in this week’s post about DATETRUNC, that doesn’t make STRING_SPLIT a bad tool. It’s a perfectly fine tool if you need to parse a string into a list of values – but don’t use it in a WHERE clause, so to speak. Don’t rely on it to perform well as part of a larger query that involves joins to other tables.

Working around STRING_SPLIT’s problems

One potential fix is to dump the contents of the string into a temp table first:

And the actual execution plan is way better than the prior examples. You can see the full plan by clicking that link, but I’m just going to focus on the relevant STRING_SPLIT section and the index seek:

This plan is better because:

  • SQL Server knows how many rows are in #LocationList
  • Even better, it knows what those rows are, and that influences its estimate on the number of users who live in those locations, which means
  • SQL Server makes better parallelism and memory grant decisions through the rest of the plan

Woohoo! Just remember that temp tables are like OPTION (RANDOM RECOMPILE), like I teach you in this Fundamentals of TempDB lecture.

Previous Post
Should You Use SQL Server 2022’s DATETRUNC?
Next Post
Should You Use SQL Server 2022’s GREATEST and LEAST?

2 Comments. Leave new

  • Heh… my boss says the same thing about DelimitedSplit8K. I keep asking him why he thinks that joining to a iTVF is a good idea.

    It’s not just the STRING_SPLIT() function that has this issue. Joining to an iTVF is a “code smell” in my book. It reminds me of people that do WHERE clauses on aggregated columns on views and then they wonder why the code “runs slow”. 😀

    Reply
    • And a “p.s.”. This is a great article in more ways than one. One of the most important things is this is proof, yet again, that “Set Based” does NOT mean “all in one query”. 😀 “Divide’n’Conquer” is a VERY effective performance tool and not just when it comes to functions and the like.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.