RE: Which is more efficient? WHERE SUBSTRING = ...or... WHERE fldField LIKE 'x%'

SSC Guru

Points: 186127

October 19, 2015 at 9:13 am

mister.magoo (10/16/2015)
Isn't it also the case that using LIKE allows the QO to use statistics to estimate the selectivity of the query.
According to this White Paper: https://msdn.microsoft.com/en-us/library/dd535534(v=sql.100).aspx
String Summary Statistics
SQL Server 2008 includes patented technology for estimating the selectivity of LIKE conditions. It builds a statistical summary of substring frequency distribution for character columns (a string summary). This includes columns of type text, ntext, char, varchar, and nvarchar. Using the string summary, SQL Server can accurately estimate the selectivity of LIKE conditions where the pattern may have any number of wildcards in any combination. For example, SQL Server can estimate the selectivity of predicates of the following form.
Column LIKE 'string%'
Column LIKE '%string'
Column LIKE '%string%'
Column LIKE 'string'
Column LIKE 'str_ing'
Column LIKE 'str[abc]ing'
Column LIKE '%abc%xy'
If there is a user-specified escape character in a LIKE pattern (that is, the pattern is of the form LIKE pattern ESCAPE escape_character), SQL Server 2008 guesses selectivity.
This is an improvement over SQL Server 2000, which uses a guess for selectivity when any wildcard other than a trailing wildcard % is used in the LIKE pattern, and it has limited accuracy in its estimates in that case.
The String Index field in the first row set returned by DBCC SHOW_STATISTICS includes the value YES if the statistics object also includes a string summary. The contents of the string summary are not shown. The string summary includes additional information beyond what is shown in the histogram.
For strings longer than 80 characters, the first and last 40 characters are extracted from the string and concatenated prior to considering the string in the creation of the string summary. Hence, accurate frequency estimates for substrings that appear only in the ignored portion of a string are not available.
Which means that using LIKE not only makes the predicate SARGable, but also allows for the use of statistics when deciding on an execution plan.
This can be seen in the test script provided by Chris, although it is a simple one table select.
The attached images show row estimates for the LIKE (which actually becomes a greater than / less than query) , but not for the LEFT (which becomes a substring query).
I haven't come across any similar papers for other versions of SQL, but the images are taken from SQL 2016.
It specifically references SQL 2008 compared to SQL 2000, so I don't know if the same holds true for SQL 2005, and don't have it handy to test.

Thanks MM. On first reading this I figured it would be trivial to set up a test or two employing a two-table join to amplify the efficiency of LIKE over CHARINDEX. Although the outcome is generally as you would expect, the route taken is far from predictable – at least for me. One test uses a string comparison to a unique column as before, the other uses a string comparison to a column containing 26 distinct strings. There are three uses cases within each test: LIKE, CHARINDEX and “BETWEEN & LIKE” - since LIKE uses a range seek and a residual predicate (as you pointed out) I figured it would be fun to explicitly code for this, and in practice it’s generally faster than LIKE alone.

Have fun 🙂

----------------------------------------------------------------------------------------------

-- Data setup

----------------------------------------------------------------------------------------------

DROP TABLE #TestData

;WITH

E1 (x) AS (SELECT 0 FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (x)),

E2 (x) AS (SELECT 0 FROM E1 a, E1 b),

E4 (x) AS (SELECT 0 FROM E2 a, E2 b),

iTally AS (SELECT n = CHECKSUM(ROW_NUMBER() OVER(ORDER BY (SELECT NULL))-1) FROM E4 a, E2 b)

SELECT

i.n,

aDate = DATEADD(HOUR,i.n,'19100101'),

IntValue = CHECKSUM(NEWID()),

x.Stringy,

CharValue = CAST(REVERSE(NEWID()) AS VARCHAR(36))

INTO #TestData

FROM iTally i

CROSS APPLY (

SELECT TOP 1 Stringy

FROM (VALUES

('Onestringy'),('twostringy'),('threestringy'),('fourstringy'),('fivestringy') ,('sixstringy') ,('sevenstringy') ,('eightstringy'),('ninestringy'),('tenstringy'),

('elevenstringy'),('twelvestringy'),('thirteenstringy'),('fourteenstringy'),('fifteenstringy'),('sixteenstringy'),('seventeenstringy'),('eighteenstringy'),('nineteenstringy'),('twentystringy'),

('twentyonestringy'),('twentytwostringy'),('twentythreestringy'),('twentyfourstringy'),('twentyfivestringy'),('twentysixstringy')

) d (Stringy)

ORDER BY n % 2, NEWID()

) x

OPTION (MAXDOP 1)

CREATE UNIQUE CLUSTERED INDEX ucx_n ON #TestData (n)

CREATE UNIQUE INDEX ix_CharValue ON #TestData (CharValue)

CREATE UNIQUE INDEX ix_n_CharValue ON #TestData (n, CharValue)

CREATE UNIQUE INDEX ix_n ON #TestData (n)

CREATE INDEX ix_Stringy ON #TestData (Stringy)

CREATE UNIQUE INDEX ix_n_Stringy ON #TestData (n, Stringy)

----------------------------------------------------------------------------------------------

-- Test 1: Unique values in character column

----------------------------------------------------------------------------------------------

DECLARE @StringToFind VARCHAR(10), @Dummy VARCHAR(36)

SELECT TOP 1

@StringToFind = LEFT(CharValue,10)

FROM #TestData ORDER BY NEWID()

SELECT @StringToFind

PRINT ''

PRINT 'LIKE =================================================='

SET STATISTICS IO, TIME ON

SELECT @Dummy = t2.CharValue

FROM #TestData t1

INNER JOIN #TestData t2 ON t2.n = t1.n

WHERE t1.CharValue LIKE @StringToFind+'%'

OPTION(RECOMPILE)

SET STATISTICS IO, TIME OFF

PRINT ''

PRINT 'RANGE & LIKE =================================================='

SET STATISTICS IO, TIME ON

SELECT @Dummy = t2.CharValue

FROM #TestData t1

INNER JOIN #TestData t2 ON t2.n = t1.n

WHERE t1.CharValue >= @StringToFind AND t1.CharValue < @StringToFind+CHAR(255)

AND t1.CharValue LIKE @StringToFind+'%'

OPTION(RECOMPILE)

SET STATISTICS IO, TIME OFF

PRINT ''

PRINT 'CHARINDEX =================================================='

SET STATISTICS IO, TIME ON

SELECT @Dummy = t2.CharValue

FROM #TestData t1

INNER JOIN #TestData t2 ON t2.n = t1.n

WHERE CHARINDEX(@StringToFind,t1.CharValue) > 0

OPTION(RECOMPILE)

SET STATISTICS IO, TIME OFF

----------------------------------------------------------------------------------------------

-- Test 2: Non-unique values in character column

----------------------------------------------------------------------------------------------

DECLARE @StringToFind VARCHAR(10), @Dummy VARCHAR(36)

SELECT TOP 1

@StringToFind = LEFT(Stringy,10)

FROM #TestData ORDER BY NEWID()

SELECT @StringToFind, COUNT(*) FROM #TestData WHERE Stringy LIKE @StringToFind+'%'

PRINT ''

PRINT 'LIKE =================================================='

SET STATISTICS IO, TIME ON

SELECT @Dummy = t2.CharValue

FROM #TestData t1

INNER JOIN #TestData t2 ON t2.n = t1.n

WHERE t1.Stringy LIKE @StringToFind+'%'

OPTION(RECOMPILE)

SET STATISTICS IO, TIME OFF

PRINT ''

PRINT 'RANGE & LIKE =================================================='

SET STATISTICS IO, TIME ON

SELECT @Dummy = t2.CharValue

FROM #TestData t1

INNER JOIN #TestData t2 ON t2.n = t1.n

WHERE t1.Stringy >= @StringToFind AND t1.Stringy < @StringToFind+CHAR(255)

AND t1.Stringy LIKE @StringToFind+'%'

OPTION(RECOMPILE)

SET STATISTICS IO, TIME OFF

PRINT ''

PRINT 'CHARINDEX =================================================='

SET STATISTICS IO, TIME ON

SELECT @Dummy = t2.CharValue

FROM #TestData t1

INNER JOIN #TestData t2 ON t2.n = t1.n

WHERE CHARINDEX(@StringToFind,t1.CharValue) > 0

OPTION(RECOMPILE)

SET STATISTICS IO, TIME OFF

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden