RE: Performance issue with tally solution

Paul White

SSC Guru

Points: 150467

April 12, 2009 at 10:42 pm

#976387

Hey Flo!

It is true to say that the following features of the script won't help overall performance:

1. Using table variables (no statistics are available and parallelism cannot be used)

2. Using large object types (MAX)

3. Using Unicode

4. @crlf is defined as CHAR rather than NCHAR (implicit conversion)

Nevertheless, all that is rather beside the point here.

The tally or numbers table solution is, to an extent, a brute-force approach. In the case of many long strings where the frequency of the searched-for string is low, a solution based on a loop/CHARINDEX approach should out-perform the tally approach.

Searching a 1000-row table containing strings of 500K characters for example:

The tally method will execute a SUBSTRING 1000 * 500K times, trying to find the search string.

A method based on CHARINDEX will execute Sigma (Ni+1) times, where Ni is the number of occurrences of the search string in row i.

Of course CHARINDEX is a character-by-character search from the starting position, but even without doing the math rigorously, it is apparent that there will always be a point where the tally method will be slower.

In your particular example, it helps if the source table contains an extra column containing the length of the string, but the CHARINDEX method is still faster.

This code shows how including an extra length column helps the optimizer:

DECLARE @crlf NCHAR(2)

SELECT @crlf = CHAR(13) + CHAR(10)

SELECTSUBSTRING(s.definition, N + 2, CHARINDEX(@crlf, s.definition, N + 1) - N - 2)

FROM#source AS S

JOIN#tally AS T

ON (T.N BETWEEN 1 AND S.Length - 2)

WHERESUBSTRING(s.definition, N, 2) = @crlf

ORDERBY N

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi