RE: String tokenizing / splitting

SSC-Addicted

Points: 411

February 25, 2011 at 7:32 pm

I ran the XML, CTE, and the SUBSTRING functions through and checked the estimated to actual execution plans. The estimated shows the XML_Reader is, by far, the most costly (so costly it can't compare). The Substring shows to be twice as expensive as the CTE, but CTE only works (naturally) up to 100 tokens. The actual came out as equal, so we'd have to design something to check the execution at scale... which eliminates the CTE version unless the test is designed using parallel effort rather sheer token volume.

You could design a SUBSTRING/CTE hybrid where CTE is executed with a try/catch, then cut to the SUBSTRING if it fails by volume... but it might be simply smarter to use the version that fits your use case the best.

The SUBSTRING version has a bug where if the delimiter is a space or any phrase with a trailing space, it trims off that space (or ignores it) and comes up one character short. For VARCHAR, you can fix this by using the DATALENGTH() function instead of LEN() and for NVARCHAR, use DATALENGTH(), then divide by two.

I also found a bug in the XML version that used a clustered index on the returned table. That will change the token order and require unique tokens. You can remove the index to avoid natural reordering by the indexer and errors raised by duplicate tokens. Avoiding a tablescan here may or may not make sense for your use case.

I also forgot to mention the TALLY version. It uses CTE, too, and it was slightly more expensive than the SUBSTRING version so I didn't bother comparisons past that point because it had the worst of both CTE and SUBSTRING versions.

One more bug: the XML version does not handle special characters such as the ampersand(&amp). The use of a CDATA declaration may help but I did not try it.