Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««1234»»

Need Help on Fastest Search Logic Expand / Collapse
Author
Message
Posted Thursday, October 3, 2013 6:08 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Saturday, August 16, 2014 8:16 AM
Points: 107, Visits: 513
This makes use of Moden's DelimitedSplit8k which you can download in the scripts section.


SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE FUNCTION tvf_ChildRowNames
(
@rowName varchar(250)
)
RETURNS TABLE
AS
RETURN
(

SELECT name from table2 where exists ( select 1 from [DelimitedSplit8K](name, ',') where item = @rowName)
)
GO

You use it like,

select name from tvf_ChildRowNames('canola')

Post #1501388
Posted Friday, October 4, 2013 11:42 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 3:51 PM
Points: 36,959, Visits: 31,469
Generating random CSV's with a random number of elements from the list in Table1 was a bit of a chore. The items wouldn't come out random so I had to play with it a bit to trick it. It takes a bit to build the test tables (02:17 mm:ss) on my old machine but it would appear that I've been able to generate test data not unlike your real data for everyone to play with. It's a bit late here so I'll play with it tomorrow night. In the meantime, here's the code to generate the data for folks to play with and test for performance.

--===== Conditionally drop the test tables to make reruns in SSMS easier
IF OBJECT_ID('tempdb..#Table1','U') IS NOT NULL DROP TABLE #Table1;
IF OBJECT_ID('tempdb..#Table2','U') IS NOT NULL DROP TABLE #Table2;
GO
--===== Create Table1
WITH
cteGenProduct AS
(
SELECT DISTINCT TOP 1000
ProductName = REPLACE(SUBSTRING(LTRIM(NEWID()),ABS(CHECKSUM(NEWID()))%10+1,ABS(CHECKSUM(NEWID()))%10+5),'-','')
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
)
SELECT ID = IDENTITY(BIGINT,1,1)
,ProductName
,Qty = ISNULL(ABS(CHECKSUM(NEWID()))%1000,0)
INTO #Table1
FROM cteGenProduct
;
SELECT * FROM #Table1
;
--===== Create Table 2.
-- The ProductNames from Table 1 will randomly appear as CSV data
-- with 4 to 10 "elements"
WITH
cteRandomData AS
(
SELECT TOP 25000
ID = ROW_NUMBER()OVER(ORDER BY (SELECT 1))
,Details = CAST(NEWID() AS NVARCHAR(1000))
,Description = CAST(NULL AS NVARCHAR(1000))
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
)
SELECT ID
,ProductName =
(SELECT STUFF(CAST(
( --=== Concatenate N number of random t1.ProductName's
SELECT TOP (ABS(CHECKSUM(NEWID()))%7+4)
N','+t1.ProductName
FROM #Table1 t1
WHERE t2.ID > 0 --This nonsense is to randomize the CSV
ORDER BY NEWID()
FOR XML PATH('')
)
AS NVARCHAR(MAX)),1,1,N'')
)
,Details
,Description
INTO #Table2
FROM cteRandomData t2
;
SELECT * FROM #Table2
;




--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1501816
Posted Saturday, October 5, 2013 12:25 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 3:51 PM
Points: 36,959, Visits: 31,469
I have the desired process down to 11 seconds using my two favorite tools of "Divide'n'Conquer" and the DelimitedSplit8K function on my old 32 bit single 1.8 Ghz cpu 11 year old machine. I need to clean the code up, test it on a slightly more modern machine, and then I'll post my solution. Going to bed... finally.

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1501817
Posted Saturday, October 5, 2013 10:41 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 3:51 PM
Points: 36,959, Visits: 31,469
Again, I apologize for this taking so long. Real life had me hopping and it took a little thought on how to generate the test data and still have it finish sometime in the same week . I hope you'll find it worth the wait.

Just to keep everything all in one place, here's the code I used to setup the test. The CSVs in Table 2 all have 4 to 10 elements and ALL Product Numbers in the CSV are available in Table 1. The Product Numbers in Table 1 are unique. The Product Numbers in Table 2 are not.

I used random lengths of random GUIDs and replaced the dashes with spaces to simulate multi-word Product Names. The code will certainly work with single word Product Names, as well. I also used a random GUID as a unique value in the "Details" column of Table 2.

As cited in the original post, there are 1,000 rows in Table 1 (#Table1) and 25,000 rows in Table 2 (#Table2).

Here's the code to generate the test tables according to the requirements above. On a decent machine, this code takes approximately 22 seconds to execute. It's a bit long winded because of the sort by NEWID() to randomize the Product Numbers in the CSVs.


--===== Conditionally drop the test tables to make reruns in SSMS easier
IF OBJECT_ID('tempdb..#Table1','U') IS NOT NULL DROP TABLE #Table1;
IF OBJECT_ID('tempdb..#Table2','U') IS NOT NULL DROP TABLE #Table2;
GO
--===== Create Table1
WITH
cteGenProduct AS
(
SELECT DISTINCT TOP 1000
ProductName = LTRIM(REPLACE(SUBSTRING(LTRIM(NEWID()),ABS(CHECKSUM(NEWID()))%10+1,ABS(CHECKSUM(NEWID()))%10+5),'-',' '))
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
)
SELECT ID = IDENTITY(BIGINT,1,1)
,ProductName
,Qty = ISNULL(ABS(CHECKSUM(NEWID()))%1000,0)
INTO #Table1
FROM cteGenProduct
;
SELECT * FROM #Table1
;
--===== Create Table 2.
-- The ProductNames from Table 1 will randomly appear as CSV data
-- with 4 to 10 "elements"
WITH
cteRandomData AS
(
SELECT TOP 25000
ID = ROW_NUMBER()OVER(ORDER BY (SELECT 1))
,Details = CAST(NEWID() AS NVARCHAR(1000))
,Description = CAST(NULL AS NVARCHAR(1000))
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
)
SELECT ID
,ProductName =
(SELECT STUFF(CAST(
( --=== Concatenate N number of random t1.ProductName's
SELECT TOP (ABS(CHECKSUM(NEWID()))%7+4)
N','+t1.ProductName
FROM #Table1 t1
WHERE t2.ID > 0 --This nonsense is to randomize the CSV
ORDER BY NEWID()
FOR XML PATH('')
)
AS NVARCHAR(MAX)),1,1,N'')
)
,Details
,Description
INTO #Table2
FROM cteRandomData t2
;
SELECT * FROM #Table2
;


Next, we need the quintessential function to normalize Table 2. This is the "next" version of a very well tested and proven function. Please read the header of the code for more information. To be safe during this proof of concept, all of this code is being executed in TempDB.

As a bit of a sidebar, you're using NVARCHAR(MAX) for your Product Number CSV's. This will only handle up to 8K bytes including the delimiters and you could get some unwanted changes in data if you're really using NVARCHAR for its intended purpose. If your data exceeds 8K bytes or you truly need NVARCHAR, then we'll have to wittle on all the code a bit.

USE tempdb;
GO
CREATE FUNCTION [dbo].[DelimitedSplit8K]
/**********************************************************************************************************************
Purpose:
Given a string containing multiple elements separated by a single character delimiter and that single character
delimiter, this function will split the string and return a table of the single elements (Item) and the element
position within the string (ItemNumber).

Notes:
1. Performance of this function approaches that of a CLR.
2. Note that this code implicitly converts NVARCHAR to VARCHAR and that conversion may NOT be faithful.

Revision History:
Note that this code is a modification of a well proven function created as a community effort and initially documented
at the following URL (http://www.sqlservercentral.com/articles/Tally+Table/72993/). This code is still undergoing
tests. Although every care has certainly been taken to ensure its accuracy, you are reminded to do your own tests to
ensure that this function is suitable for whatever application you might use it for.
--Jeff Moden, 01 Sep 2013
**********************************************************************************************************************/
--===== Define I/O parameters
(@pString VARCHAR(8000) , @pDelimiter CHAR(1)) --DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000).
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d), --10E+4 or 10,000 rows max
cteTally(N) AS ( --=== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS ( --=== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter COLLATE Latin1_General_BIN
),
cteLen(N1,L1)AS ( --=== Return start and length (for use in substring).
-- The ISNULL/NULLIF combo handles the length for the final of only element.
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
GO


I did try the LIKE and the CHARINDEX methods. They each took over 3 minutes to run and they both have a hidden problem. For example, if the value in Table 1 is "Shot", both methods will also find "Shot Glass" and "Shotgun", which doesn't seem correct to me.

Since the overwhelming recommendation is to "normalize" the data in Table 2 and that, apparently, can't be done at this point in the project, the following code creates a normalized result set for table 2 as a CTE and then runs against that. This is proof positive that data should never be stored in a CSV column because, even with the overhead of having to normalize Table 2 as a first step, it blows the other methods away coming in at only 2.1 to 2.5 seconds on my laptop.

The code below produces the desired result with some additions. The result is stored in a temporary table (#MyHead) as requested. I also took the liberty of combining the output of Table1 and Table2 so that you know what uses what. I also include an "ItemNumber" column so you know which element in the CTE matched the Product Number from Table1 and included the ID's of both just in case your "additional processing" might need it.

Since the code is all in one query, it is possible to convert the code to either a view or a high performance Inline Table Valued Function (iTVF).

WITH
cteNormalize AS
( --=== This normalizes T2
SELECT t2.ID, split.ItemNumber, Split.Item
FROM #Table2 t2
CROSS APPLY dbo.DelimitedSplit8k(t2.ProductName,',') split
)
SELECT Table1_ID = t1.ID
,Table1_ProductName = t1.ProductName
,norm.ItemNumber
,Table2_ID = t2.ID
,ProductNameCSV = t2.ProductName
,t2.Details
,t2.Description
INTO #MyHead
FROM #Table1 t1
JOIN cteNormalize norm
ON norm.Item = t1.ProductName
JOIN #Table2 t2
ON t2.ID = norm.ID
;


Please send beer... I already have enough pretzels.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1501842
Posted Monday, October 7, 2013 10:15 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, April 9, 2014 11:31 AM
Points: 175, Visits: 351
Hi Jeff,

WOW!!!!

Thanks a lot for your precious time on this post and really this is going to help many users. I just back from my sick days to work. So i will have to try you implementation in my actual concept. will post you once i do. please stay tune....

Thinking of saying more than thanks, finding the words!!!
Post #1502247
Posted Sunday, October 13, 2013 7:11 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, April 9, 2014 11:31 AM
Points: 175, Visits: 351
Hi Jeff,

am back,

i got one more tricky situation from my client. i frightened to hear about this concept from them. the concept will be i will have to take the product name from table 1 and search it with %product name% search condition. not whole word matching.

In your example, after we split the comma separated values into temp table , fetch each item from table 1 and we have map it with %table1.productname% onto temp table. i am wondering about this ugly concept. because it will kill the time. do you have any suggestion on this concept. sample below,

if the product name on table 1 is "milk" and on the temp table if we have "milk with fat","milk with out fat","milk with less fat" then we have to fetch these three product name. for this i hope we should have to use % table1.productname %.

could you please
Post #1504260
Posted Sunday, October 13, 2013 7:45 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Saturday, August 16, 2014 8:16 AM
Points: 107, Visits: 513
At this point, you're better off just creating a full text index on table2 and searching it. You can join it to table2 and perform LIKE '%a%' on the result set returned.
Post #1504264
Posted Sunday, October 13, 2013 8:31 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 3:51 PM
Points: 36,959, Visits: 31,469
lnardozi 61862 (10/13/2013)
At this point, you're better off just creating a full text index on table2 and searching it. You can join it to table2 and perform LIKE '%a%' on the result set returned.


You could be right but the FTI wouldn't be on Table 2 according to the OPs latest post. It would only be needed on Table 1. Fortunately, I don't see any requirements to join Table 1 to Table 2 under such conditions.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1504268
Posted Sunday, October 13, 2013 9:37 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 3:51 PM
Points: 36,959, Visits: 31,469
born2achieve (10/13/2013)
Hi Jeff,

am back,

i got one more tricky situation from my client. i frightened to hear about this concept from them. the concept will be i will have to take the product name from table 1 and search it with %product name% search condition. not whole word matching.

In your example, after we split the comma separated values into temp table , fetch each item from table 1 and we have map it with %table1.productname% onto temp table. i am wondering about this ugly concept. because it will kill the time. do you have any suggestion on this concept. sample below,

if the product name on table 1 is "milk" and on the temp table if we have "milk with fat","milk with out fat","milk with less fat" then we have to fetch these three product name. for this i hope we should have to use % table1.productname %.

could you please


Would you also want to pick up "Milk Glass", "Milk Paint", "Milky Way Candy Bar", "Dried Milk", "Milky Ammonia", "Milk Stain Remover", "Chocolate Milk Syrup", etc? Even FTS (Full Text Search) is going to have a problem with leading wildcards because leading wildcards either aren't SARGable (can't use a seek) or you have to generate huge amounts of word parts to facilitate SARGable leading wildcard searches. And, are they going to want to be able to search for a mult-word entry regardless of the order of the search words entered?

Fortunately, you only have a thousand or so products. It's not like you're trying to build a catalog for Grainger (which sells thousands and thousands of different parts and sizes of parts) or a document search like Google. Just do the normal double-ended wildcard search on Table 1 and see how it pans out. Even though it won't be SARGable, a nice narrow unique index that includes the product name and the product ID may help a bit here simply because the data is more narrow than the original Table 1 (assuming there are more than just the 3 columns you posted... I could be wrong). Maybe not.

If they want something faster, then it's going to take a basic redesign of tables and data. For example, there should be a table of categories that should be searched for "Milk" related products and they should search for categories instead of individual products. That's likely the original intent of Table 2 but they fell way short on that. Such further design is also way beyond what should be solved in a forum like this one.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1504271
Posted Sunday, October 13, 2013 10:58 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, April 9, 2014 11:31 AM
Points: 175, Visits: 351
Yes Jeff,

I totally agree your statement and thanks a lot for all your time on this thread. Really it helped me in many aspects and learning about different concepts.

Once again many Thanks Gentle Man!!!
Post #1504280
« Prev Topic | Next Topic »

Add to briefcase «««1234»»

Permissions Expand / Collapse