Split column value to rows.

Question

Split column value to rows.

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 1

Eirikur Eiriksson - Friday, June 2, 2017 8:38 AM
Jacob Wilkins - Friday, June 2, 2017 8:18 AM
Eirikur Eiriksson - Friday, June 2, 2017 2:22 AM
Be careful with that order by clause, it will significantly slow down the query!
😎
Indeed.
I kept it included because very strictly speaking there is no guarantee of correctness of the query without it.
To guarantee the query is correct, you have to know that the N returned for a string of length 3 will be 1,2, and 3.
Anything else would yield incorrect results.
As you did, I noticed that I seemed to get correct results every time without it, but without the ORDER BY, there's just no guarantee you'll get the right N from the APPLY.
Perhaps the engine will actually return them in the desired order under every possible circumstance, but I'm a paranoid DBA,so I couldn't bring myself to make that assumption 🙂
From a set based perspective, the order of appearance is irrelevant, the result set will be correct with or without the order by clause as the N will always correspond correctly to the position of the character returned. In fact, when using MAXDOP 1, the output is always appearing in the correct order although as you said, there is no guarantee that will always be so.
😎
A greater culprit is not using the TOP operator, especially when handling greater variety of input length, the constant scan operator will feed the join operator maximum number of rows for every iteration.

It's not the presentation order I'm talking about; it's actually the correctness of the query.

We're returning some values for N from the APPLY, and using them in the SUBSTRING function.

Let's say we're dealing with a category with LEN of 3.

It matters very much whether the three N returned are {1,2,3} or {2,4,5} or something else.

If the three N returned are anything but {1,2,3}, the results returned for that ID and category will not be correct.

To actually guarantee that you return only N from 1 to LEN(category), and subsequently get each character of the category from the SUBSTRING, you need the ORDER BY.

Cheers!

Ed Wagner SSC Guru Points: 287026 More actions · Answer 2

Eirikur Eiriksson - Friday, June 2, 2017 3:23 AM
Using the LEN function in the WHERE clause is a killer here, slows the query down several times, much more efficient to use the TOP operator within the numbers CTE. The reason for this is that when using the former, the server will generate the full output of the CTE and then filter is applied afterwards whilst the TOP operator will limit the number generated.
😎

Eirikur, you sure hit the nail on the head there. After tearing into it a little more, the LEN in the WHERE clause just kills it.

It looks like using the ITVF instead of the table-value constructor in the CROSS APPLY makes a real difference. Here's another variation to run through your mill. My timing is meaningless because we have completely different hardware.

The plan looks horrible because of the function (it always looks that way), but it performs well.

INSERT INTO @times(task,d) VALUES ('TallyN #3', GETDATE());

SELECT @id = ID, @n = t.N, @char = SUBSTRING(d.Category, t.N, 1) FROM #test d CROSS APPLY (SELECT TOP (LEN(d.Category)) N FROM util.dbo.TallyN(5)) t (N); INSERT INTO @times(task,d) VALUES ('TallyN #3', GETDATE());

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 3

Ok, so I finally got a chance to look at this again.

There are a couple ways of modifying the strictly-speaking-unsafe queries to make them safe that don't result in too much of a slowdown.

One is by using ROW_NUMBER() to create the values in the APPLY, and the other is just to add a WHERE N<=LEN(Category) to query in the APPLY.

The latter is slightly but consistently the faster of the two on my machine.

One interesting thing of note is that using OUTER APPLY instead of CROSS APPLY is consistently good for an improvement of about 300ms or so across these similar solutions.

Here are those modified queries, along with a modified TallyN #3 (modified to get the modest speed bump from OUTER APPLY, and a MAXDOP 1 added, since it also likes going parallel), in a form to be used with Eirikur's test harness:

INSERT INTO @timer(T_TXT) VALUES ('Jacob #3');

SELECT @INT_BUCKET = SSC.SSC_ID ,@INT_BUCKET = NM.N ,@CHAR_BUCKET = SUBSTRING(SSC.CATEGORY,NM.N,1) FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC OUTER APPLY (SELECT TOP(LEN(SSC.CATEGORY)) N=ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM (VALUES (0),(0),(0),(0),(0)) X(N)) AS NM OPTION (MAXDOP 1);

INSERT INTO @timer(T_TXT) VALUES ('Jacob #3');

INSERT INTO @timer(T_TXT) VALUES ('Jacob #4');

SELECT @INT_BUCKET = SSC.SSC_ID, @INT_BUCKET = t.n, @CHAR_BUCKET = SUBSTRING(SSC.CATEGORY, t.N, 1) FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC OUTER APPLY ( SELECT TOP (LEN(category)) n FROM (VALUES(1),(2),(3),(4),(5))n(n) WHERE n<=LEN(category) )t(n) OPTION (MAXDOP 1);

INSERT INTO @timer(T_TXT) VALUES ('Jacob #4');

INSERT INTO @timer(T_TXT) VALUES ('TallyN #3');

SELECT @INT_BUCKET = SSC.SSC_ID, @INT_BUCKET = t.n, @CHAR_BUCKET = SUBSTRING(SSC.CATEGORY, t.N, 1) FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC OUTER APPLY (SELECT TOP (LEN(SSC.Category)) N FROM TallyN(5)) t (N) OPTION (MAXDOP 1);

INSERT INTO @timer(T_TXT) VALUES ('TallyN #3');

And some representative results on my system, with Jacob #1 and Jacob #2 removed (#1 was much slower than these, and the #2 Eirikur provided wasn't all that different from Eirikur's, so removing them helped get in more test runs):

T_TXT DURATION -------------------- ----------- DRY RUN 526000 EE OUTER APPLY 2550000 EE CROSS APPLY 2853000 Jacob #4 3273000 Jacob #3 3650000 TallyN #3 8630000:

Ed Wagner SSC Guru Points: 287026 More actions · Answer 4

I don't know I'm doing differently than you guys, but my times are a whole lot different.


Run              Time
---------------  -----
EE Outer Apply   1026
Jacob #4         1383
TallyN #3        2356
Jacob #1         4756

And yes, I've already checked my row counts to make sure.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 5

Ed Wagner - Friday, June 2, 2017 7:00 PM
I don't know I'm doing differently than you guys, but my times are a whole lot different.
Run Time --------------- ----- EE Outer Apply 1026 Jacob #4 1383 TallyN #3 2356 Jacob #1 4756
And yes, I've already checked my row counts to make sure.

Ed, the difference is that you are using milliseconds in the final timer result query whereas we are using microseconds😉
😎

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 6

Ed Wagner - Thursday, June 1, 2017 7:28 PM
I see I forgot to define dbo.TallyN. It's Itzik's zero-read tally ITVF. I use it so much I sometimes think it's a part of SQL Server. It isn't, but it should be.
ALTER FUNCTION dbo.TallyN(@N Bigint) RETURNS TABLE WITH SCHEMABINDING AS --Credit: This function was written by Itzik Ben-Gan at http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers RETURN WITH level0 AS ( SELECT 0 AS g UNION ALL SELECT 0), --2 level1 AS (SELECT 0 AS g FROM level0 AS a CROSS JOIN level0 AS b), --2^2 = 4 level2 AS (SELECT 0 AS g FROM level1 AS a CROSS JOIN level1 AS b), --4^2 = 16 level3 AS (SELECT 0 AS g FROM level2 AS a CROSS JOIN level2 AS b), --16^2 = 256 level4 AS (SELECT 0 AS g FROM level3 AS a CROSS JOIN level3 AS b), --256^2 = 65536 level5 AS (SELECT 0 AS g FROM level4 AS a CROSS JOIN level4 AS b), --65536^2 = 4294967296 Tally AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM level5) SELECT TOP (@N) N FROM Tally ORDER BY N;

If I'm working with varchar(8000) or less then I normally use a simpler version which will output the maximum of 10K rows. It tends to run quite a lot faster than the full blown version.
😎

CREATE OR ALTER FUNCTION dbo.NUMGEN10K ( @MAX_N INT ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N)) SELECT TOP(@MAX_N) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N FROM T T1,T T2,T T3,T T4 GO

Ed Wagner SSC Guru Points: 287026 More actions · Answer 7

Eirikur Eiriksson - Saturday, June 3, 2017 1:57 AM
Ed Wagner - Friday, June 2, 2017 7:00 PM
I don't know I'm doing differently than you guys, but my times are a whole lot different.
Run Time --------------- ----- EE Outer Apply 1026 Jacob #4 1383 TallyN #3 2356 Jacob #1 4756
And yes, I've already checked my row counts to make sure.
Ed, the difference is that you are using milliseconds in the final timer result query whereas we are using microseconds😉
😎

Yeah, but it's more than that. Even after dividing your times, mine are at least half of yours. Sometimes they're really different and the rank is different. Your EE OUTER APPLY wins in either case, but the times are proportionally different from each other. I've learned something from the whole discussion, so I think that having fun has been worth it - the differences are what bugs me. You know, I'm content to just let it go, but I will play with it a bit more on my machine. 😉

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 8

You have me curious, Ed. 🙂

Having said that, it looks like the main difference is the ratio of Jacob #1 to the others that is different in your results. Everything else looks approximately right (full disclaimer, I'm eyeballing that).

Mind sharing the full test script you're using?

Cheers!

Ed Wagner SSC Guru Points: 287026 More actions · Answer 9

Jacob Wilkins - Tuesday, June 6, 2017 9:18 AM
You have me curious, Ed. 🙂
Having said that, it looks like the main difference is the ratio of Jacob #1 to the others that is different in your results. Everything else looks approximately right (full disclaimer, I'm eyeballing that).
Mind sharing the full test script you're using?
Cheers!

Not at all. I don't have it here at work, but it should still be in an SSMS window at home. And yes, it's saved for future playing.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

sql_learner29 SSCrazy Eights Points: 9107 More actions · Answer 10

@Joe Torre
Thanks a lot Joe..

Your code works like a charm..
You are a genius..

Thanks again..

[font="Comic Sans MS"]
---------------------------------------------------

Thanks [/font]

Joe Torre SSChampion Points: 10248 More actions · Answer 11

Joe Torre

SSChampion

Points: 10248

June 8, 2017 at 12:19 pm

#1946046

Glad it helped

sgmunson SSC Guru Points: 110639 More actions · Answer 12

Just for gits and shiggles, I figured I'd try this on for size, and try another option based on Jeff Moden's DelimitedSplit8K function.

Here's the function with the delimiter removed and everything stripped out that didn't need to be there:
USE [LOCAL_DB] GO

IF OBJECT_ID(N'dbo.NonDelimitedSplit8K', N'IF') IS NOT NULL BEGIN DROP FUNCTION dbo.NonDelimitedSplit8K; END;

GO

/****** Object: UserDefinedFunction [dbo].[DelimitedSplit8K] Script Date: 6/21/2017 9:50:34 AM ******/ SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

CREATE FUNCTION dbo.NonDelimitedSplit8K ( @pString varchar(5) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH E1(N) AS ( --10E+1 or 10 rows SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 --UNION ALL SELECT 1 UNION ALL --SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ), cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front -- for both a performance gain and prevention of accidental "overruns" SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E1 ) --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N), Item = SUBSTRING(@pString, l.N, 1) FROM cteTally l; GO

Now here's the test rig:
USE LOCAL_DB; GO

SET NOCOUNT ON; DBCC FREEPROCCACHE; /* IF OBJECT_ID(N'dbo.TEST', N'U') IS NOT NULL BEGIN DROP TABLE dbo.TEST; END; GO

CREATE TABLE dbo.TEST ( ID int NOT NULL CONSTRAINT PK_TEST_ID PRIMARY KEY CLUSTERED, CATEGORY varchar(10), CAT_LEN AS LEN(CATEGORY) ); */ DECLARE @SAMPLE_SIZE AS int = 2000000; DECLARE @MAX_CHARS AS int = 5; DECLARE @INT_BUCKET AS int = 0; DECLARE @CHAR_BUCKET AS varchar(10) = ''; DECLARE @BATCH_ID AS int = 1; DECLARE @timer AS TABLE ( T_TXT varchar(30) NOT NULL, T_TS datetime2(7) NOT NULL DEFAULT (SYSDATETIME()), T_BATCH int NOT NULL ); /* INSERT INTO TEST (ID, CATEGORY) VALUES (1,'5HW6'), (2,'256'), (3,'57'), (4,'3'), (5,'4578'), (6,'256HW');

SELECT * FROM dbo.TEST; */ /* WITH T(N) AS (

SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N) ), NUMS(N) AS (

SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9 ) INSERT INTO dbo.TEST (ID, CATEGORY) SELECT NM.N, SUBSTRING(REPLACE(CONVERT(varchar(36), NEWID(), 0), CHAR(45), CHAR(65)), 1, (ABS(CHECKSUM(NEWID())) % @MAX_CHARS) + 1) FROM NUMS AS NM; */

WHILE @BATCH_ID <= 10 BEGIN INSERT INTO @timer(T_TXT, T_BATCH) VALUES('NON-DELIMITED-SPLIT8K', @BATCH_ID); SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = S.Item FROM dbo.TEST AS T OUTER APPLY dbo.NonDelimitedSplit8K(T.CATEGORY) AS S INSERT INTO @timer(T_TXT, T_BATCH) VALUES('NON-DELIMITED-SPLIT8K', @BATCH_ID);

INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY', @BATCH_ID); SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = SUBSTRING(T.CATEGORY, S.N, 1) FROM dbo.TEST AS T OUTER APPLY (SELECT TOP (LEN(T.CATEGORY)) X.N FROM (VALUES (1),(2),(3),(4),(5)) AS X(N)) AS S INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY', @BATCH_ID);

INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY SGM', @BATCH_ID); SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = SUBSTRING(T.CATEGORY, S.N, 1) FROM dbo.TEST AS T OUTER APPLY (SELECT TOP (T.CAT_LEN) X.N FROM (VALUES (1),(2),(3),(4),(5)) AS X(N)) AS S INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY SGM', @BATCH_ID);

SELECT @BATCH_ID = @BATCH_ID + 1;

END;

SELECT T.T_BATCH, T.T_TXT, DATEDIFF(MICROSECOND, MIN(T.T_TS), MAX(T.T_TS)) AS DURATION FROM @timer AS T GROUP BY T_BATCH, T.T_TXT ORDER BY T_BATCH, DURATION ASC;

--DROP TABLE TEST; --TRUNCATE TABLE dbo.TEST;

And finally, the results:

T_BATCH T_TXT DURATION 1 NON-DELIMITED-SPLIT8K 15600 1 EE OUTER APPLY 46800 1 EE OUTER APPLY SGM 46800 2 EE OUTER APPLY 46800 2 NON-DELIMITED-SPLIT8K 46800 2 EE OUTER APPLY SGM 62400 3 NON-DELIMITED-SPLIT8K 62400 3 EE OUTER APPLY 63200 3 EE OUTER APPLY SGM 100600 4 EE OUTER APPLY SGM 93600 4 NON-DELIMITED-SPLIT8K 93600 4 EE OUTER APPLY 109200 5 EE OUTER APPLY 109200 5 NON-DELIMITED-SPLIT8K 109200 5 EE OUTER APPLY SGM 124800 6 EE OUTER APPLY 124800 6 NON-DELIMITED-SPLIT8K 124800 6 EE OUTER APPLY SGM 140400 7 NON-DELIMITED-SPLIT8K 140400 7 EE OUTER APPLY SGM 156000 7 EE OUTER APPLY 156000 8 EE OUTER APPLY 171600 8 NON-DELIMITED-SPLIT8K 171600 8 EE OUTER APPLY SGM 187200 9 NON-DELIMITED-SPLIT8K 187200 9 EE OUTER APPLY 202800 9 EE OUTER APPLY SGM 249600 10 EE OUTER APPLY SGM 218400 10 EE OUTER APPLY 218400 10 NON-DELIMITED-SPLIT8K 218400

Just not sure why I get such large differences in the timings.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)