RE: Split list of string into all possible combinations

SSC Guru

Points: 61136

November 30, 2016 at 9:54 am

#1915226

N_Muller (11/30/2016)
Alan.B (11/29/2016)
My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):
You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.
edit: added some explanation.
I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:
declare @table table ( IndexID int identity(1,1), Col varchar(100) )
insert @table ( Col )
values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )
-- solution
;with Z as
(
selectRowID = Y.IndexID,
ColID = V.ID,
Col = V.ListValues
from@table Y cross apply
FN_ListToTable_8K ( Y.Col, ',' ) V
), W as
(
selectRowID,
NoCols = max(ColID)
fromZ
group by RowID
), R as
(
selectID = 1,
RowID,
Col,
LastCol = convert(int,Col)
fromZ
union all
selectID = R.ID + 1,
RowID = R.RowID,
Col = R.Col + ',' + Z.Col,
LastCol = convert(int,Z.Col)
fromW inner join
R on
R.RowID = W.RowID and
R.ID < W.NoCols inner join
Z on
Z.RowID = R.RowID and
convert(int,Z.Col) > R.LastCol
)
select distinct Col from R order by Col
-- The first CTE transforms the strings into rows while keeping the reference to the original RowID (IndexID in this case).
-- The second CTE gets the number of elements in each original RowID.
-- The third is a recursive CTE where I keep track of the last element added to the comma delimited list (largest value) and loops for the largest number of elements in each original RowID.

First, I noticed an issue with my solution where I was getting duplicates, here's an updated version:

declare @table table (IndexID int identity(1,1), Col varchar(100));

insert @table (Col)

values ('173'), ('97'), ('54,115' ), ('1,32,173'), ('21,32'), ('5,32'), ('32,173');

WITH trips AS

(

SELECT col, item=item+0

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

)

-- All combinations when there's 3

SELECT item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

FROM trips a, trips b

WHERE a.col = b.col AND a.item < b.item

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

UNION

SELECT item

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',')

ORDER BY item; -- not required.

Next, if FN_ListToTable_8K is not an Inline Table Valued function then you need to change that. A multi-line Table Valued splitter function will KILL your performance. The solution you are using includes a recursive CTE - those can be costly too. Here's a performance test to compare techniques:

SET NOCOUNT ON;

-- Sample data:

declare @table table (IndexID int identity(1,1), Col varchar(100));

WITH parts(n,p) AS

(

SELECT TOP (10000)

'1'+REPLICATE('0',(abs(checksum(newid())%3)+1)), abs(checksum(newid())%3)+1

FROM sys.all_columns

)

INSERT @table

SELECT

CASE p

WHEN 1 THEN cast((abs(checksum(newid())%n)+1) AS varchar(4))

WHEN 2 THEN cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

cast((abs(checksum(newid())%n)+1) AS varchar(4))

ELSE cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

cast((abs(checksum(newid())%n)+1) AS varchar(4))

END

FROM parts;

--SELECT * FROM @table

DBCC TRACEON(2453); -- for better cardinality estimates on temp variables

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

PRINT 'MY SOLUTION: '+char(13)+char(10)+REPLICATE('-',50);

WITH trips AS

(

SELECT col, item=item+0

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

)

-- All combinations when there's 3

SELECT item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

FROM trips a, trips b

WHERE a.col = b.col AND a.item < b.item

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

UNION

SELECT item

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',')

--ORDER BY item

PRINT 'RCTE:'+char(13)+char(10)+REPLICATE('-',50);

;with Z as

(

selectRowID = Y.IndexID,

ColID = V.ItemNumber,

Col = V.Item

from@table Y cross APPLY dbo.DelimitedSplit8K ( Y.Col, ',' ) V

), W as

(

selectRowID,

NoCols = max(ColID)

fromZ

group by RowID

), R as

(

selectID = 1,

RowID,

Col,

LastCol = convert(int,Col)

fromZ

union all

selectID = R.ID + 1,

RowID = R.RowID,

Col = R.Col + ',' + Z.Col,

LastCol = convert(int,Z.Col)

fromW inner join

R on

R.RowID = W.RowID and

R.ID < W.NoCols inner join

Z on

Z.RowID = R.RowID and

convert(int,Z.Col) > R.LastCol

)

SELECT DISTINCT Col

FROM R

--ORDER BY col;

SET STATISTICS TIME OFF;

SET STATISTICS IO OFF;

DBCC TRACEOFF(2453);

Perf Test results:

MY SOLUTION:

--------------------------------------------------

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

SQL Server parse and compile time:

CPU time = 31 ms, elapsed time = 32 ms.

Table '#A73A2F66'. Scan count 25, logical reads 120, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 3458, logical reads 98032, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 547 ms, elapsed time = 427 ms.

RCTE:

--------------------------------------------------

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

SQL Server parse and compile time:

CPU time = 16 ms, elapsed time = 21 ms.

Table 'Worktable'. Scan count 58811, logical reads 392249, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#A73A2F66'. Scan count 3, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 1029 ms, elapsed time = 1193 ms.

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

You can see - even with a good splitter the recursive CTE solution is less than half as fast and produces 4X the number of reads.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001