Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Double Delimited String Parsing and Re-Concatenation Help Expand / Collapse
Author
Message
Posted Friday, July 9, 2010 9:53 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, July 15, 2014 8:11 AM
Points: 55, Visits: 406
I have a table with a column of double delimited strings. I need to break it out into all possible combinations without repeats and without repeats from one outer group to another. For example:
CREATE TABLE #TmpTbl (TblID INT PRIMARY KEY, MultiDelimStr VARCHAR(255))
INSERT #TmpTbl ( TblID, MultiDelimStr )
SELECT 101, 'Orange,Apple;Orange,Grape,Banana;Apple,Kiwi,Grape'
UNION
SELECT 202, 'Grape,Apple;Orange,Grape,Banana;Peach,Kiwi,Grape'
UNION
SELECT 303, 'Orange;Peach,Grape,Banana;Apple,Kiwi;Tangerine,Grapefruit'

From this I need the following result set:
101	Orange,Grape,Apple
101 Orange,Grape,Kwi
101 Orange,Banana,Apple
101 Orange,Banana,Kiwi
101 Orange,Banana,Grape
101 Apple,Orange,Kiwi
101 Apple,Orange,Grape
101 Apple,Grape,Kiwi
101 Apple,Banana,Kiwi
101 Apple,Banana,Grape
202 Grape,Orange,Peach
202 Grape,Orange,Kiwi
202 Grape,Banana,Peach
202 Grape,Banana,Kiwi
202 Apple,Orange,Peach
202 Apple,Orange,Kiwi
202 Apple,Orange,Grape
202 Apple,Grape,Peach
202 Apple,Grape,Kiwi
202 Apple,Banana,Grape
202 Apple,Banana,Peach
202 Apple,Banana,Kiwi
303 Orange,Peach,Apple,Tangerine
303 Orange,Peach,Apple,Grapefruit
303 Orange,Peach,Kiwi,Tangerine
303 Orange,Peach,Kiwi,Grapefruit
303 Orange,Grape,Apple,Tangerine
303 Orange,Grape,Apple,Grapefruit
303 Orange,Grape,Kiwi,Tangerine
303 Orange,Grape,Kiwi,Grapefruit
303 Orange,Banana,Apple,Tangerine
303 Orange,Banana,Apple,Grapefruit
303 Orange,Banana,Kiwi,Tangerine
303 Orange,Banana,Kiwi,Grapefruit

But the following row would be incorrect:
101	Apple,Orange,Apple


Please, also note that I can count on at least 1 semicolon or up to 4, but no more than 4 per record. I don’t want to separate the rows with 1 semicolon and process them before moving on to the rows with 2 semicolons, etc.

I thought a double parse routine would get me close to where I want to be, but I can’t figure out how to put it back together again in the order my requirements desire. To break it out I have used:
SELECT FruitID, MultiDelimStr
,RANK() OVER(PARTITION BY FruitID ORDER BY N) AS GroupID
,SUBSTRING(';'+MultiDelimStr+';',N+1,CHARINDEX(';',';'+MultiDelimStr+';',N+1)-N-1) AS GroupFruit
INTO #FirstParse
FROM #FruitTbl F
CROSS JOIN dbo.Tally T
Where T.N < LEN(';'+MultiDelimStr+';')
AND SUBSTRING(';'+MultiDelimStr+';',N,1) = ';'
--SELECT * FROM #FirstParse

SELECT FruitID, GroupID
,SUBSTRING(','+GroupFruit+',',N+1,CHARINDEX(',',','+GroupFruit+',',N+1)-N-1) AS SoloFruit
INTO #SecondParse
FROM #FirstParse F
CROSS JOIN dbo.Tally T
Where T.N < LEN(','+GroupFruit+',')
AND SUBSTRING(','+GroupFruit+',',N,1) = ','
SELECT * FROM #SecondParse

To get the result set (just showing 101 because this is already long enough):
FruitID	GroupID	SoloFruit
101 1 Orange
101 1 Apple
101 2 Orange
101 2 Grape
101 2 Banana
101 3 Apple
101 3 Kiwi
101 3 Grape

Now how do I put it back together and achieve the desired result set without multiple IF statements and multiple sets of subqueries or a loop (cursor/while)? Or is there a better way than splitting it twice? I know I am missing something that I fear I have read before, but I just can't remember it XML?

Thank you very much for your time and advice in advance.
Post #950057
Posted Friday, July 9, 2010 11:50 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Wednesday, July 23, 2014 4:31 AM
Points: 5,014, Visits: 10,514
I'm not sure I understand all your requirements.
Why ID = 101 has 3 combinations? There are 4 fruits there...

Anyway this is what I could come up with. It could be a starter.


IF OBJECT_ID('Tempdb..#TmpTbl') IS NOT NULL DROP TABLE #TmpTbl

CREATE TABLE #TmpTbl (TblID INT PRIMARY KEY, MultiDelimStr VARCHAR(255))
INSERT #TmpTbl ( TblID, MultiDelimStr )
SELECT 101, 'Orange,Apple;Orange,Grape,Banana;Apple,Kiwi,Grape'
UNION
SELECT 202, 'Grape,Apple;Orange,Grape,Banana;Peach,Kiwi,Grape'
UNION
SELECT 303, 'Orange;Peach,Grape,Banana;Apple,Kiwi;Tangerine,Grapefruit'


;WITH Pass1 AS (
SELECT TblId, C.Value
FROM #TmpTbl AS A
CROSS APPLY dbo.fSplit(MultiDelimStr, ';') AS B
CROSS APPLY dbo.fSplit(B.Value, ',') AS C
),
Pass2 AS (
SELECT *
FROM (
SELECT DISTINCT *
FROM Pass1
) AS A
),
Ids AS (
SELECT DISTINCT TblId
FROM #TmpTbl
)
SELECT A.TblId, B.*
FROM Pass2 AS A
CROSS APPLY (
SELECT CStr = A.Value + (
SELECT ',' + Value AS [text()]
FROM Pass2 AS P1
WHERE P1.TblId = A.TblId
AND P1.Value <> A.Value
ORDER BY Value
FOR XML PATH('')
)
)AS B

This code uses a split function by Jeff Moden that can be found here: http://www.sqlservercentral.com/articles/T-SQL/62867/

I'm including the function here:

CREATE FUNCTION [dbo].[fSplit]
(
@Parameter VARCHAR(8000),
@SplitOn char(1)
)
RETURNS @Elements TABLE
(

ID INT identity(1,1),
Value VARCHAR(8000)
)
AS
BEGIN
--===== Add start and end commas to the Parameter so we can handle
-- single elements
SET @Parameter = @splitOn + @Parameter + @SplitOn

--===== Join the Tally table to the string at the character level and
-- when we find a comma, insert what's between that command and
-- the next comma into the Elements table
INSERT INTO @Elements (Value)
SELECT SUBSTRING(@Parameter,N+1,CHARINDEX(@splitOn,@Parameter,N+1)-N-1)
FROM dbo.Tally
WHERE N < LEN(@Parameter)
AND SUBSTRING(@Parameter,N,1) = @splitOn --Notice how we find the comma
RETURN
END

I'm sure this is not what you're after...
Good luck anyway


--
Gianluca Sartori

Get your two-cent-answer quickly
spaghettidba.com
@spaghettidba
Post #950145
Posted Friday, July 9, 2010 11:51 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Wednesday, July 23, 2014 4:31 AM
Points: 5,014, Visits: 10,514
Don't forget to create the Tally table from the article code if you're using the split function!

Good luck with your query!


--
Gianluca Sartori

Get your two-cent-answer quickly
spaghettidba.com
@spaghettidba
Post #950148
Posted Friday, July 9, 2010 1:13 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, July 15, 2014 8:11 AM
Points: 55, Visits: 406

I'm not sure I understand all your requirements.
Why ID = 101 has 3 combinations? There are 4 fruits there...

I am not sure I understand your question. ID 101 should have 10 combinations.
Group1 has 2fruits. Group2 has 3 fruits. Group3 has 3 fruits. Group1 shares 1 fruit with Group2 and 1 different fruit with Group3. Group2 shares 1 fruit with Group3. Since Apple, Orange, Apple is not allowed, you can't count the number of items per group and the number of groups and get your total number of combinations using simple math. You have to know the repeats. This leaves the 10 combinations I list above in the original post.

Thank you very much for the function and CTEs, I knew that I had seen something like that before. I will test against what I have and let you know if it does indeed answer my question.
Post #950173
Posted Friday, July 9, 2010 1:43 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, July 15, 2014 8:11 AM
Points: 55, Visits: 406
Gianluca,
Your solution, while helpful, does not return the desired result set.
I think I see the confusion in my explanation of the problem.
The problem is to come up with all of the unique GROUP combinations.
The result set must have one fruit from Group1 and one fruit from Group2 ... 1 fruit from GroupN.

So if the string field has 1 ';' then there are two groups; Group1 and Group2 (left and right of the ';'). The output should be one fruit from Group1 plus a ',' plus one fruit from Group2. More specifically, all the possible single-fruit-per-group combinations that keep Group1 fruits on the left and Group2 fruits on the right. With no fruit simultaneously in Group1 and Group2 spots during output.

So if the string field has 2 ';' then there are three groups; Group1, Group2, Group3. Group1's fruits are to the left of the first ';' and Group3's are to the right of the last ';' and Group2's are in the the middle. Then re-concatenate them such that all the possible single-fruit-per-group combinations that keep Group1 fruits on the left and Group2 fruits in the middle and Group3 fruits on the right. With no fruit simultaneously in Group1, Group2 and/or Group3 spots during output.

Etc.

In my original post I list out all the valid combinations for output. That is the format that is desired. I hope this helps clear things up and expedites aid.

Thank you all again for taking a look at this.
Post #950183
Posted Monday, July 12, 2010 10:46 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Wednesday, July 23, 2014 4:31 AM
Points: 5,014, Visits: 10,514
I must say I'm quite disappointed I could not solve this problem in a simple way.
Maybe It's something beyond my skills.

Anyway this is the best I could put together:

IF OBJECT_ID('Tempdb..#TmpTbl') IS NOT NULL DROP TABLE #TmpTbl 

CREATE TABLE #TmpTbl (TblID INT PRIMARY KEY, MultiDelimStr VARCHAR(255))
INSERT #TmpTbl ( TblID, MultiDelimStr )
SELECT 101, 'Orange,Apple;Orange,Grape,Banana;Apple,Kiwi,Grape'
UNION
SELECT 202, 'Grape,Apple;Orange,Grape,Banana;Peach,Kiwi,Grape'
UNION
SELECT 303, 'Orange;Peach,Grape,Banana;Apple,Kiwi;Tangerine,Grapefruit'


;WITH Splits AS (
SELECT TblId, GRP_ID = B.Id, C.Value AS Fruit, C.Id AS Fruit_id, NumGroups = MAX(B.Id) OVER(PARTITION BY TblId)
FROM #TmpTbl AS A
CROSS APPLY dbo.fSplit(MultiDelimStr, ';') AS B
CROSS APPLY dbo.fSplit(B.Value, ',') AS C
),
ThreeRows AS (
SELECT N
FROM Tally
WHERE N <= 3
),
FourRows AS (
SELECT N
FROM Tally
WHERE N <= 4
),
CrossThree AS (
SELECT A.N AS [1], B.N AS [2], C.N AS [3]
FROM ThreeRows AS A
CROSS JOIN ThreeRows AS B
CROSS JOIN ThreeRows AS C
),
CrossFour AS (
SELECT A.N AS [1], B.N AS [2], C.N AS [3], D.N AS [4]
FROM FourRows AS A
CROSS JOIN ThreeRows AS B
CROSS JOIN ThreeRows AS C
CROSS JOIN ThreeRows AS D
)
SELECT DISTINCT B.TblId, B.Fruit + ',' + C.Fruit + ',' + D.Fruit
FROM CrossThree AS A
INNER JOIN Splits AS B
ON B.GRP_ID = 1
AND B.Fruit_ID = [1]
INNER JOIN Splits AS C
ON C.GRP_ID = 2
AND C.TblId = B.TblId
AND C.Fruit_ID = [2]
AND C.Fruit <> B.Fruit
INNER JOIN Splits AS D
ON D.GRP_ID = 3
AND D.TblId = C.TblId
AND D.Fruit_ID = [3]
AND D.Fruit <> B.Fruit
AND D.Fruit <> C.Fruit
WHERE B.NumGroups = 3
UNION ALL
SELECT DISTINCT B.TblId, B.Fruit + ',' + C.Fruit + ',' + D.Fruit + ',' + E.Fruit
FROM CrossFour AS A
INNER JOIN Splits AS B
ON B.GRP_ID = 1
AND B.Fruit_ID = [1]
INNER JOIN Splits AS C
ON C.GRP_ID = 2
AND C.TblId = B.TblId
AND C.Fruit_ID = [2]
AND C.Fruit <> B.Fruit
INNER JOIN Splits AS D
ON D.GRP_ID = 3
AND D.TblId = C.TblId
AND D.Fruit_ID = [3]
AND D.Fruit <> B.Fruit
AND D.Fruit <> C.Fruit
INNER JOIN Splits AS E
ON E.GRP_ID = 4
AND E.TblId = D.TblId
AND E.Fruit_ID = [4]
AND E.Fruit <> B.Fruit
AND E.Fruit <> C.Fruit
AND E.Fruit <> D.Fruit
WHERE B.NumGroups = 4

Basically the code builds two work tables (CrossThree and CrossFour) with "N" columns, holding all N^N possible combinations of values between 1 and "N".
Once the work tables are in place, I join them back to the splitted strings, choosing the right worktable looking at the number of groups.
The main thing I don't like about this code is that it relies on fixed numbers of groups, so that it should be generated dynamically after a first pass on the input data to gather the minimum/maximum number of groups.

I think this problem could be solved much better with a CLR aggregate, but it's a technique I don't master.

I hope someone can help with something better than this.
Gianluca


--
Gianluca Sartori

Get your two-cent-answer quickly
spaghettidba.com
@spaghettidba
Post #950903
Posted Monday, July 12, 2010 11:12 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, July 15, 2014 8:11 AM
Points: 55, Visits: 406
Your solution is similar to mine with more joins than I believe to be efficient, but still effective.

I am also hoping someone else can provide a cleaner, simpler, more efficient way to solve this challenge.

I do appreciate the time you put into this; thanks!
Post #950919
Posted Monday, July 12, 2010 11:36 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 2:51 PM
Points: 2,607, Visits: 17,910
I was wondering if you had control over the schema and could change it to have all the data broken out into different tables? Gianluca did that with his CTE, and with that something like this works:

;WITH Splits AS (
SELECT TblId, GRP_ID = B.Id, C.Value AS Fruit
FROM #TmpTbl AS A
CROSS APPLY dbo.fSplit(MultiDelimStr, ';') AS B
CROSS APPLY dbo.fSplit(B.Value, ',') AS C
)
SELECT one.tblID, one.Fruit + ',' + two.Fruit + ISNULL(',' + three.Fruit, '') + ISNULL(',' + four.Fruit, '') + ISNULL(',' + five.Fruit, '')
FROM Splits one
LEFT JOIN Splits two on (one.tblID = two.tblID AND two.GRP_ID = 2)
LEFT JOIN Splits three on (one.tblID = three.tblID AND three.GRP_ID = 3)
LEFT JOIN Splits four on (one.tblID = four.tblID AND four.GRP_ID = 4)
LEFT JOIN Splits five on (one.tblID = five.tblID AND five.GRP_ID = 5)
WHERE ISNULL(one.Fruit,'') != ISNULL(two.Fruit,'*')
and ISNULL(one.Fruit,'') != ISNULL(three.Fruit,'*')
and ISNULL(one.Fruit,'') != ISNULL(four.Fruit,'*')
and ISNULL(one.Fruit,'') != ISNULL(five.Fruit,'*')
and ISNULL(two.Fruit,'') != ISNULL(three.Fruit,'*')
and ISNULL(two.Fruit,'') != ISNULL(four.Fruit,'*')
and ISNULL(two.Fruit,'') != ISNULL(five.Fruit,'*')
and ISNULL(three.Fruit,'') != ISNULL(four.Fruit,'*')
and ISNULL(three.Fruit,'') != ISNULL(five.Fruit,'*')
and ISNULL(four.Fruit,'') != ISNULL(five.Fruit,'*')
AND one.GRP_ID = 1

It's basically just a table join making sure you don't get duplicates back. Ugly, but it worked with the sample data.

Chad
Post #950939
Posted Monday, July 12, 2010 12:58 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, July 15, 2014 8:11 AM
Points: 55, Visits: 406
I was wondering if you had control over the schema and could change it to have all the data broken out into different tables?

Doesn't every DBA/D want control over that? LOL And no, I do not.
Ugly, but it worked with the sample data.

Yeah, but do you think it is possible to do this more "Swan" and less "Duckling?" Is there a better way to start? (other than changing the schema, of course)
What is the performance hit of joining the same table 5 times going to be when I try to do 100,000 of these at once?
Thank you for your time and help.
Post #950988
Posted Monday, July 12, 2010 2:18 PM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 2:51 PM
Points: 2,607, Visits: 17,910
Well... not any answers, but some more questions and thoughts that might spark a new idea:

When you query out for these, you are doing it based on the TBLID, right? You aren't looking for everything from everywhere each time? With that in mind, you could target those rows with a where clause in the CTE and that would make a huge difference with that column indexed. I wouldn't think that would be too bad at all.

What do the insert/updates against this table look like - are they frequent and constant? I hate recommending a trigger because of the overhead they represent, but if inserts, updates and deletes are infrequent, you could use a trigger to mirror the structure in 5 seperate tables and with TBLID indexed the performance shouldn't be too bad. Or, you could just prebuild all the information on the fly from the trigger in the first place.

I know you don't have control over the schema, but if you could take some metrics to someone (I'm just making this up, but "If we change the schema we'll go from 5 minute queries to sub-second queries" or whatever it turns out to be), would they consider making some changes? If you use some sample data and an environment similar to what you have in production, can you see what difference a new schema might make?

Do you have enough control over the schema to change the content of the string? I'm not the best at XML, but if the string was in an XML format, there might (I don't know) be some new opportunities that don't involve all the joins and loops and just leverage the XML.

What is the consuming target for the data? Is it a report, or is this an input to another algorithm? SQL doesn't have a full complement of string tools, would it be faster to offload some of the work to the next layer?

Do you need real-time access to the data? If you prebuilt a report table on a semi-regular basis so that all the crunching was done once and all the reporting/consuming was done off a slightly out-of-date copy, would it be satisfactory?

And I guess the other question is do you have an environment to test a real-sized load to see what it would do?

Thanks,
Chad

Post #951025
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse