RE: Need Help on Fastest Search Logic

SSC Guru

Points: 1003863

October 5, 2013 at 10:41 am

Again, I apologize for this taking so long. Real life had me hopping and it took a little thought on how to generate the test data and still have it finish sometime in the same week :-D. I hope you'll find it worth the wait.

Just to keep everything all in one place, here's the code I used to setup the test. The CSVs in Table 2 all have 4 to 10 elements and ALL Product Numbers in the CSV are available in Table 1. The Product Numbers in Table 1 are unique. The Product Numbers in Table 2 are not.

I used random lengths of random GUIDs and replaced the dashes with spaces to simulate multi-word Product Names. The code will certainly work with single word Product Names, as well. I also used a random GUID as a unique value in the "Details" column of Table 2.

As cited in the original post, there are 1,000 rows in Table 1 (#Table1) and 25,000 rows in Table 2 (#Table2).

Here's the code to generate the test tables according to the requirements above. On a decent machine, this code takes approximately 22 seconds to execute. It's a bit long winded because of the sort by NEWID() to randomize the Product Numbers in the CSVs.

--===== Conditionally drop the test tables to make reruns in SSMS easier

IF OBJECT_ID('tempdb..#Table1','U') IS NOT NULL DROP TABLE #Table1;

IF OBJECT_ID('tempdb..#Table2','U') IS NOT NULL DROP TABLE #Table2;

GO

--===== Create Table1

WITH

cteGenProduct AS

(

SELECT DISTINCT TOP 1000

ProductName = LTRIM(REPLACE(SUBSTRING(LTRIM(NEWID()),ABS(CHECKSUM(NEWID()))%10+1,ABS(CHECKSUM(NEWID()))%10+5),'-',' '))

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

)

SELECT ID = IDENTITY(BIGINT,1,1)

,ProductName

,Qty = ISNULL(ABS(CHECKSUM(NEWID()))%1000,0)

INTO #Table1

FROM cteGenProduct

;

SELECT * FROM #Table1

;

--===== Create Table 2.

-- The ProductNames from Table 1 will randomly appear as CSV data

-- with 4 to 10 "elements"

WITH

cteRandomData AS

(

SELECT TOP 25000

ID = ROW_NUMBER()OVER(ORDER BY (SELECT 1))

,Details = CAST(NEWID() AS NVARCHAR(1000))

,Description = CAST(NULL AS NVARCHAR(1000))

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

)

SELECT ID

,ProductName =

(SELECT STUFF(CAST(

( --=== Concatenate N number of random t1.ProductName's

SELECT TOP (ABS(CHECKSUM(NEWID()))%7+4)

N','+t1.ProductName

FROM #Table1 t1

WHERE t2.ID > 0 --This nonsense is to randomize the CSV

ORDER BY NEWID()

FOR XML PATH('')

)

AS NVARCHAR(MAX)),1,1,N'')

)

,Details

,Description

INTO #Table2

FROM cteRandomData t2

;

SELECT * FROM #Table2

;

Next, we need the quintessential function to normalize Table 2. This is the "next" version of a very well tested and proven function. Please read the header of the code for more information. To be safe during this proof of concept, all of this code is being executed in TempDB.

As a bit of a sidebar, you're using NVARCHAR(MAX) for your Product Number CSV's. This will only handle up to 8K bytes including the delimiters and you could get some unwanted changes in data if you're really using NVARCHAR for its intended purpose. If your data exceeds 8K bytes or you truly need NVARCHAR, then we'll have to wittle on all the code a bit.

USE tempdb;

GO

CREATE FUNCTION [dbo].[DelimitedSplit8K]

/**********************************************************************************************************************

Purpose:

Given a string containing multiple elements separated by a single character delimiter and that single character

delimiter, this function will split the string and return a table of the single elements (Item) and the element

position within the string (ItemNumber).

Notes:

1. Performance of this function approaches that of a CLR.

2. Note that this code implicitly converts NVARCHAR to VARCHAR and that conversion may NOT be faithful.

Revision History:

Note that this code is a modification of a well proven function created as a community effort and initially documented

at the following URL (http://www.sqlservercentral.com/articles/Tally+Table/72993/). This code is still undergoing

tests. Although every care has certainly been taken to ensure its accuracy, you are reminded to do your own tests to

ensure that this function is suitable for whatever application you might use it for.

--Jeff Moden, 01 Sep 2013

**********************************************************************************************************************/

--===== Define I/O parameters

(@pString VARCHAR(8000) , @pDelimiter CHAR(1)) --DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

-- enough to cover VARCHAR(8000).

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d), --10E+4 or 10,000 rows max

cteTally(N) AS ( --=== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

),

cteStart(N1) AS ( --=== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT 1 UNION ALL

SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter COLLATE Latin1_General_BIN

),

cteLen(N1,L1)AS ( --=== Return start and length (for use in substring).

-- The ISNULL/NULLIF combo handles the length for the final of only element.

SELECT s.N1,

ISNULL(NULLIF(CHARINDEX(@pDelimiter ,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

FROM cteStart s

)

--===== Do the actual split.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteLen l

;

GO

I did try the LIKE and the CHARINDEX methods. They each took over 3 minutes to run and they both have a hidden problem. For example, if the value in Table 1 is "Shot", both methods will also find "Shot Glass" and "Shotgun", which doesn't seem correct to me.

Since the overwhelming recommendation is to "normalize" the data in Table 2 and that, apparently, can't be done at this point in the project, the following code creates a normalized result set for table 2 as a CTE and then runs against that. This is proof positive that data should never be stored in a CSV column because, even with the overhead of having to normalize Table 2 as a first step, it blows the other methods away coming in at only 2.1 to 2.5 seconds on my laptop.

The code below produces the desired result with some additions. The result is stored in a temporary table (#MyHead) as requested. I also took the liberty of combining the output of Table1 and Table2 so that you know what uses what. I also include an "ItemNumber" column so you know which element in the CTE matched the Product Number from Table1 and included the ID's of both just in case your "additional processing" might need it.

Since the code is all in one query, it is possible to convert the code to either a view or a high performance Inline Table Valued Function (iTVF).

WITH

cteNormalize AS

( --=== This normalizes T2

SELECT t2.ID, split.ItemNumber, Split.Item

FROM #Table2 t2

CROSS APPLY dbo.DelimitedSplit8k(t2.ProductName,',') split

)

SELECT Table1_ID = t1.ID

,Table1_ProductName = t1.ProductName

,norm.ItemNumber

,Table2_ID = t2.ID

,ProductNameCSV = t2.ProductName

,t2.Details

,t2.Description

INTO #MyHead

FROM #Table1 t1

JOIN cteNormalize norm

ON norm.Item = t1.ProductName

JOIN #Table2 t2

ON t2.ID = norm.ID

;

Please send beer... I already have enough pretzels. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)