Need Help on Fastest Search Logic

Question

Need Help on Fastest Search Logic

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 1

born2achieve (10/1/2013)
Hi Livingston,
thanks for your time on this and i have tried whatever you have shown couple of days before, the problem is time consuming and how to fasten the search process.

hello...the sql code was provided as a set up for others that may like to get engaged in this thread...this is something that you have not provided so far.

can you please confirm that, as in your original post, you want to match "corn" with "cornil" ??

good luck

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

born2achieve Hall of Fame Points: 3182 More actions · Answer 2

Hi Livingston,

my bad. i didn't realize your set up data. The data is perfect but it's not cornil. it should be corn on the last insert row.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 3

Sorry about that... I was going to make a large amount of test data to test performance of various methods with but got caught up in another discussion.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 4

Haven't abandoned this and still working on a test data generator that everyone could use on this problem. [font="Arial Black"]Is the Product_Name in Table1 UNIQUE in Table1? Or can it be duplicated with multiple different Qty numbers?[/font]

Also, this is part of the reason why we ask for "the DDL and some readily consumable test data". It answers so many questions. Please see the first "Helpful Link" in my signature line below.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

born2achieve Hall of Fame Points: 3182 More actions · Answer 5

Hi Jeff,

Thanks a lot for your time on this and product_name on the table1 is unique.

lnardozi 61862 SSCommitted Points: 1961 More actions · Answer 6

This makes use of Moden's DelimitedSplit8k which you can download in the scripts section.

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE FUNCTION tvf_ChildRowNames

(

@rowName varchar(250)

)

RETURNS TABLE

AS

RETURN

(

SELECT name from table2 where exists ( select 1 from [DelimitedSplit8K](name, ',') where item = @rowName)

)

GO

You use it like,

select name from tvf_ChildRowNames('canola')

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 7

Generating random CSV's with a random number of elements from the list in Table1 was a bit of a chore. The items wouldn't come out random so I had to play with it a bit to trick it. It takes a bit to build the test tables (02:17 mm:ss) on my old machine but it would appear that I've been able to generate test data not unlike your real data for everyone to play with. It's a bit late here so I'll play with it tomorrow night. In the meantime, here's the code to generate the data for folks to play with and test for performance.

--===== Conditionally drop the test tables to make reruns in SSMS easier

IF OBJECT_ID('tempdb..#Table1','U') IS NOT NULL DROP TABLE #Table1;

IF OBJECT_ID('tempdb..#Table2','U') IS NOT NULL DROP TABLE #Table2;

GO

--===== Create Table1

WITH

cteGenProduct AS

(

SELECT DISTINCT TOP 1000

ProductName = REPLACE(SUBSTRING(LTRIM(NEWID()),ABS(CHECKSUM(NEWID()))%10+1,ABS(CHECKSUM(NEWID()))%10+5),'-','')

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

)

SELECT ID = IDENTITY(BIGINT,1,1)

,ProductName

,Qty = ISNULL(ABS(CHECKSUM(NEWID()))%1000,0)

INTO #Table1

FROM cteGenProduct

;

SELECT * FROM #Table1

;

--===== Create Table 2.

-- The ProductNames from Table 1 will randomly appear as CSV data

-- with 4 to 10 "elements"

WITH

cteRandomData AS

(

SELECT TOP 25000

ID = ROW_NUMBER()OVER(ORDER BY (SELECT 1))

,Details = CAST(NEWID() AS NVARCHAR(1000))

,Description = CAST(NULL AS NVARCHAR(1000))

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

)

SELECT ID

,ProductName =

(SELECT STUFF(CAST(

( --=== Concatenate N number of random t1.ProductName's

SELECT TOP (ABS(CHECKSUM(NEWID()))%7+4)

N','+t1.ProductName

FROM #Table1 t1

WHERE t2.ID > 0 --This nonsense is to randomize the CSV

ORDER BY NEWID()

FOR XML PATH('')

)

AS NVARCHAR(MAX)),1,1,N'')

)

,Details

,Description

INTO #Table2

FROM cteRandomData t2

;

SELECT * FROM #Table2

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 8

I have the desired process down to 11 seconds using my two favorite tools of "Divide'n'Conquer" and the DelimitedSplit8K function on my old 32 bit single 1.8 Ghz cpu 11 year old machine. I need to clean the code up, test it on a slightly more modern machine, and then I'll post my solution. Going to bed... finally.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

Again, I apologize for this taking so long. Real life had me hopping and it took a little thought on how to generate the test data and still have it finish sometime in the same week :-D. I hope you'll find it worth the wait.

Just to keep everything all in one place, here's the code I used to setup the test. The CSVs in Table 2 all have 4 to 10 elements and ALL Product Numbers in the CSV are available in Table 1. The Product Numbers in Table 1 are unique. The Product Numbers in Table 2 are not.

I used random lengths of random GUIDs and replaced the dashes with spaces to simulate multi-word Product Names. The code will certainly work with single word Product Names, as well. I also used a random GUID as a unique value in the "Details" column of Table 2.

As cited in the original post, there are 1,000 rows in Table 1 (#Table1) and 25,000 rows in Table 2 (#Table2).

Here's the code to generate the test tables according to the requirements above. On a decent machine, this code takes approximately 22 seconds to execute. It's a bit long winded because of the sort by NEWID() to randomize the Product Numbers in the CSVs.

--===== Conditionally drop the test tables to make reruns in SSMS easier

IF OBJECT_ID('tempdb..#Table1','U') IS NOT NULL DROP TABLE #Table1;

IF OBJECT_ID('tempdb..#Table2','U') IS NOT NULL DROP TABLE #Table2;

GO

--===== Create Table1

WITH

cteGenProduct AS

(

SELECT DISTINCT TOP 1000

ProductName = LTRIM(REPLACE(SUBSTRING(LTRIM(NEWID()),ABS(CHECKSUM(NEWID()))%10+1,ABS(CHECKSUM(NEWID()))%10+5),'-',' '))

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

)

SELECT ID = IDENTITY(BIGINT,1,1)

,ProductName

,Qty = ISNULL(ABS(CHECKSUM(NEWID()))%1000,0)

INTO #Table1

FROM cteGenProduct

;

SELECT * FROM #Table1

;

--===== Create Table 2.

-- The ProductNames from Table 1 will randomly appear as CSV data

-- with 4 to 10 "elements"

WITH

cteRandomData AS

(

SELECT TOP 25000

ID = ROW_NUMBER()OVER(ORDER BY (SELECT 1))

,Details = CAST(NEWID() AS NVARCHAR(1000))

,Description = CAST(NULL AS NVARCHAR(1000))

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

)

SELECT ID

,ProductName =

(SELECT STUFF(CAST(

( --=== Concatenate N number of random t1.ProductName's

SELECT TOP (ABS(CHECKSUM(NEWID()))%7+4)

N','+t1.ProductName

FROM #Table1 t1

WHERE t2.ID > 0 --This nonsense is to randomize the CSV

ORDER BY NEWID()

FOR XML PATH('')

)

AS NVARCHAR(MAX)),1,1,N'')

)

,Details

,Description

INTO #Table2

FROM cteRandomData t2

;

SELECT * FROM #Table2

;

Next, we need the quintessential function to normalize Table 2. This is the "next" version of a very well tested and proven function. Please read the header of the code for more information. To be safe during this proof of concept, all of this code is being executed in TempDB.

As a bit of a sidebar, you're using NVARCHAR(MAX) for your Product Number CSV's. This will only handle up to 8K bytes including the delimiters and you could get some unwanted changes in data if you're really using NVARCHAR for its intended purpose. If your data exceeds 8K bytes or you truly need NVARCHAR, then we'll have to wittle on all the code a bit.

USE tempdb;

GO

CREATE FUNCTION [dbo].[DelimitedSplit8K]

/**********************************************************************************************************************

Purpose:

Given a string containing multiple elements separated by a single character delimiter and that single character

delimiter, this function will split the string and return a table of the single elements (Item) and the element

position within the string (ItemNumber).

Notes:

1. Performance of this function approaches that of a CLR.

2. Note that this code implicitly converts NVARCHAR to VARCHAR and that conversion may NOT be faithful.

Revision History:

Note that this code is a modification of a well proven function created as a community effort and initially documented

at the following URL (http://www.sqlservercentral.com/articles/Tally+Table/72993/). This code is still undergoing

tests. Although every care has certainly been taken to ensure its accuracy, you are reminded to do your own tests to

ensure that this function is suitable for whatever application you might use it for.

--Jeff Moden, 01 Sep 2013

**********************************************************************************************************************/

--===== Define I/O parameters

(@pString VARCHAR(8000) , @pDelimiter CHAR(1)) --DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

-- enough to cover VARCHAR(8000).

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d), --10E+4 or 10,000 rows max

cteTally(N) AS ( --=== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

),

cteStart(N1) AS ( --=== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT 1 UNION ALL

SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter COLLATE Latin1_General_BIN

),

cteLen(N1,L1)AS ( --=== Return start and length (for use in substring).

-- The ISNULL/NULLIF combo handles the length for the final of only element.

SELECT s.N1,

ISNULL(NULLIF(CHARINDEX(@pDelimiter ,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

FROM cteStart s

)

--===== Do the actual split.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteLen l

;

GO

I did try the LIKE and the CHARINDEX methods. They each took over 3 minutes to run and they both have a hidden problem. For example, if the value in Table 1 is "Shot", both methods will also find "Shot Glass" and "Shotgun", which doesn't seem correct to me.

Since the overwhelming recommendation is to "normalize" the data in Table 2 and that, apparently, can't be done at this point in the project, the following code creates a normalized result set for table 2 as a CTE and then runs against that. This is proof positive that data should never be stored in a CSV column because, even with the overhead of having to normalize Table 2 as a first step, it blows the other methods away coming in at only 2.1 to 2.5 seconds on my laptop.

The code below produces the desired result with some additions. The result is stored in a temporary table (#MyHead) as requested. I also took the liberty of combining the output of Table1 and Table2 so that you know what uses what. I also include an "ItemNumber" column so you know which element in the CTE matched the Product Number from Table1 and included the ID's of both just in case your "additional processing" might need it.

Since the code is all in one query, it is possible to convert the code to either a view or a high performance Inline Table Valued Function (iTVF).

WITH

cteNormalize AS

( --=== This normalizes T2

SELECT t2.ID, split.ItemNumber, Split.Item

FROM #Table2 t2

CROSS APPLY dbo.DelimitedSplit8k(t2.ProductName,',') split

)

SELECT Table1_ID = t1.ID

,Table1_ProductName = t1.ProductName

,norm.ItemNumber

,Table2_ID = t2.ID

,ProductNameCSV = t2.ProductName

,t2.Details

,t2.Description

INTO #MyHead

FROM #Table1 t1

JOIN cteNormalize norm

ON norm.Item = t1.ProductName

JOIN #Table2 t2

ON t2.ID = norm.ID

;

Please send beer... I already have enough pretzels. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

born2achieve Hall of Fame Points: 3182 More actions · Answer 10

Hi Jeff,

WOW!!!!

Thanks a lot for your precious time on this post and really this is going to help many users. I just back from my sick days to work. So i will have to try you implementation in my actual concept. will post you once i do. please stay tune....

Thinking of saying more than thanks, finding the words!!!

born2achieve Hall of Fame Points: 3182 More actions · Answer 11

Hi Jeff,

am back,

i got one more tricky situation from my client. i frightened to hear about this concept from them. the concept will be i will have to take the product name from table 1 and search it with %product name% search condition. not whole word matching.

In your example, after we split the comma separated values into temp table , fetch each item from table 1 and we have map it with %table1.productname% onto temp table. i am wondering about this ugly concept. because it will kill the time. do you have any suggestion on this concept. sample below,

if the product name on table 1 is "milk" and on the temp table if we have "milk with fat","milk with out fat","milk with less fat" then we have to fetch these three product name. for this i hope we should have to use % table1.productname %.

could you please

lnardozi 61862 SSCommitted Points: 1961 More actions · Answer 12

At this point, you're better off just creating a full text index on table2 and searching it. You can join it to table2 and perform LIKE '%a%' on the result set returned.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 13

lnardozi 61862 (10/13/2013)
At this point, you're better off just creating a full text index on table2 and searching it. You can join it to table2 and perform LIKE '%a%' on the result set returned.

You could be right but the FTI wouldn't be on Table 2 according to the OPs latest post. It would only be needed on Table 1. Fortunately, I don't see any requirements to join Table 1 to Table 2 under such conditions.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 14

born2achieve (10/13/2013)
Hi Jeff,
am back,
i got one more tricky situation from my client. i frightened to hear about this concept from them. the concept will be i will have to take the product name from table 1 and search it with %product name% search condition. not whole word matching.
In your example, after we split the comma separated values into temp table , fetch each item from table 1 and we have map it with %table1.productname% onto temp table. i am wondering about this ugly concept. because it will kill the time. do you have any suggestion on this concept. sample below,
if the product name on table 1 is "milk" and on the temp table if we have "milk with fat","milk with out fat","milk with less fat" then we have to fetch these three product name. for this i hope we should have to use % table1.productname %.
could you please

Would you also want to pick up "Milk Glass", "Milk Paint", "Milky Way Candy Bar", "Dried Milk", "Milky Ammonia", "Milk Stain Remover", "Chocolate Milk Syrup", etc? Even FTS (Full Text Search) is going to have a problem with leading wildcards because leading wildcards either aren't SARGable (can't use a seek) or you have to generate huge amounts of word parts to facilitate SARGable leading wildcard searches. And, are they going to want to be able to search for a mult-word entry regardless of the order of the search words entered?

Fortunately, you only have a thousand or so products. It's not like you're trying to build a catalog for Grainger (which sells thousands and thousands of different parts and sizes of parts) or a document search like Google. Just do the normal double-ended wildcard search on Table 1 and see how it pans out. Even though it won't be SARGable, a nice narrow unique index that includes the product name and the product ID may help a bit here simply because the data is more narrow than the original Table 1 (assuming there are more than just the 3 columns you posted... I could be wrong). Maybe not.

If they want something faster, then it's going to take a basic redesign of tables and data. For example, there should be a table of categories that should be searched for "Milk" related products and they should search for categories instead of individual products. That's likely the original intent of Table 2 but they fell way short on that. Such further design is also way beyond what should be solved in a forum like this one.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

born2achieve Hall of Fame Points: 3182 More actions · Answer 15

Yes Jeff,

I totally agree your statement and thanks a lot for all your time on this thread. Really it helped me in many aspects and learning about different concepts.

Once again many Thanks Gentle Man!!!