SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


«««23456»»»

Performance issue with tally solution Expand / Collapse
Author
Message
Posted Monday, April 13, 2009 3:23 PM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Thursday, March 11, 2010 3:29 PM
Points: 859, Visits: 845
I remember doing a bunch of performance testing a couple of years ago when we were converting code from SQL Server 6.5 to SQL Server 7. We found then that in some cases, cursors (or while loops) were faster. So it's interesting to see that can still be the case.

Like everything, there's never one simple answer.
Post #696162
Posted Monday, April 13, 2009 3:26 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Wednesday, March 17, 2010 11:21 AM
Points: 9,104, Visits: 8,528
Here is the tally-function with the same bug as the other routines:
create function [dbo].[fnSplit4]( 
@parameter varchar(Max) -- the string to split
, @Separator Varchar(64) -- the string to use as a separator
)
RETURNS @Items TABLE(
ID INT -- the element number
, item VARCHAR(8000) -- the split-out string element
, OffSet int -- the original offest
--( not entirley accurate if LEN(@Seperator) > 1 because of the Replace() )
)
AS
BEGIN
/*
"Monster" Split in SQL Server 2005

From Jeff Moden, 2008/05/22

BYoung, 2008/06/18: Modified to be a Table-Valued Function
And to handle CL/LF or LF-only line breaks

Test: (scripts all triggers in your database)
Select Lines.Item
From sys.sql_modules M
Join sys.objects O on O.object_id = M.object_id
cross apply dbo.fnSplit1(M.definition, char(13)+char(10)) Lines
Where O.Type = 'TR'
Order by O.create_date, Lines.ID
*/
Declare @Sep char(1)
--our seperator character (convenient, doesn't affect performance)
Set @Sep = char(10)
--NOTE: we make the @Sep character LF so that we will automatically
-- parse out rogue LF-only line breaks.

;WITH cteTally AS
(--==== Create a Tally CTE from 1 to whatever the length
-- of the parameter is
SELECT TOP (LEN(@Parameter))
ROW_NUMBER() OVER (ORDER BY t1.ID) AS N
FROM Master.sys.sysColumns t1
CROSS JOIN Master.sys.sysColumns t2
)
INSERT into @Items
SELECT ROW_NUMBER() OVER (ORDER BY N) AS Number,
SUBSTRING(@Parameter, N+1, CHARINDEX(@Sep, @Parameter, N+1)-N-1) AS Value
, N+1
FROM cteTally
WHERE N < LEN(@Parameter)
AND SUBSTRING(@Parameter, N, 1) = @Sep
--Notice how we find the separator

Return
END

And here is the correct version:
create function [dbo].[fnSplit3]( 
@parameter varchar(Max) -- the string to split
, @Separator Varchar(64) -- the string to use as a separator
)
RETURNS @Items TABLE(
ID INT -- the element number
, item VARCHAR(8000) -- the split-out string element
, OffSet int -- the original offest
--( not entirley accurate if LEN(@Seperator) > 1 because of the Replace() )
)
AS
BEGIN
/*
"Monster" Split in SQL Server 2005

From Jeff Moden, 2008/05/22

BYoung, 2008/06/18: Modified to be a Table-Valued Function
And to handle CL/LF or LF-only line breaks

Test: (scripts all triggers in your database)
Select Lines.Item
From sys.sql_modules M
Join sys.objects O on O.object_id = M.object_id
cross apply dbo.fnSplit1(M.definition, char(13)+char(10)) Lines
Where O.Type = 'TR'
Order by O.create_date, Lines.ID
*/
Declare @Sep char(1)
--our seperator character (convenient, doesn't affect performance)
Set @Sep = char(10)
--NOTE: we make the @Sep character LF so that we will automatically
-- parse out rogue LF-only line breaks.

--===== Add start and end seprators to the Parameter so we can handle
-- all the elements the same way
-- Also change the seperator expressions to our seperator
-- character to keep all offsets = 1
SET @Parameter = @Sep+ Replace(@Parameter,@Seperator,@Sep) +@Sep
-- This reduces run-time about 10%

;WITH cteTally AS
(--==== Create a Tally CTE from 1 to whatever the length
-- of the parameter is
SELECT TOP (LEN(@Parameter))
ROW_NUMBER() OVER (ORDER BY t1.ID) AS N
FROM Master.sys.sysColumns t1
CROSS JOIN Master.sys.sysColumns t2
--Note: using the sysColumns trick is faster than a permanent Tally table
-- for this case anyway
)
INSERT into @Items
SELECT ROW_NUMBER() OVER (ORDER BY N) AS Number,
SUBSTRING(@Parameter, N+1, CHARINDEX(@Sep, @Parameter, N+1)-N-1) AS Value
, N+1
FROM cteTally
WHERE N < LEN(@Parameter)
AND SUBSTRING(@Parameter, N, 1) = @Sep
--Notice how we find the separator

Return
END



-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #696166
Posted Monday, April 13, 2009 3:28 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Wednesday, March 17, 2010 11:21 AM
Points: 9,104, Visits: 8,528
And finally, here is the inclusion code to inlcude it with the other tests:
--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 3 solution (correct line-splitting)'
SELECT @now = GETDATE()

--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit3(s.definition, char(13)+char(10)) l

--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================

DELETE FROM @result --- Clean up

--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 4 solution (has same bug as the rest)'
SELECT @now = GETDATE()

--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit4(s.definition, char(13)+char(10)) l

--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================

DELETE FROM @result --- Clean up



-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #696167
Posted Monday, April 13, 2009 3:32 PM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Thursday, March 18, 2010 2:19 AM
Points: 2,022, Visits: 3,594
Good morning Paul! (every time I see your first posts I know it's time for bed )

Paul White (4/13/2009)
I'm still wondering why the CLR TVF is so much slower than the XML.


Florian Reischl
I don't really understand why the way over XML is faster than a direct TVF. The only idea I have is that SQL Server calls the FillRowMethod over reflection, what would be really ugly implementation! I usually handle those kinds of dynamic method calls over dynamic classes and delegates. It's a little more source code but the performance improvement is up to factor 1000.


I'm quiet sure that they use reflection because you have to define the name of the row method as string instead of a delegate. This is often a staring point for reflection solutions (MS also is not perfect). Some month ago I tested some reflection solutions against dynamic property descriptors and the performance difference was dramatic.


Note that the line counts are different simply because the TVF omits empty results:

string[] items = toSplit.Split(new string[] { delimeter }, StringSplitOptions.RemoveEmptyEntries);)


I noticed this . I used the StringSplitOptions.None because I want all rows, just splited.

Greets
Flo



The more I learn, the more I know what I do not know

How to Post Data/Code to get the best Help"Numbers" or "Tally" Table
How to Post Performance Problems
Post #696173
Posted Monday, April 13, 2009 3:43 PM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Thursday, March 18, 2010 2:19 AM
Points: 2,022, Visits: 3,594
RBarryYoung (4/13/2009)
And finally, here is the inclusion code to inlcude it with the other tests:
--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 3 solution (correct line-splitting)'
SELECT @now = GETDATE()

--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit3(s.definition, char(13)+char(10)) l

--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================

DELETE FROM @result --- Clean up

--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 4 solution (has same bug as the rest)'
SELECT @now = GETDATE()

--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit4(s.definition, char(13)+char(10)) l

--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================

DELETE FROM @result --- Clean up



Hi Barry!

Thanks also for your multi-line tally functions! They are much faster than my tally solution.

I included all to my test setup and here the results:
Start tally solution
Milliseconds: 5453 | Lines: 28145
Start Tally-function 3 solution (correct line-splitting)
Milliseconds: 3376 | Lines: 28545
Start Tally-function 4 solution (has same bug as the rest)
Milliseconds: 3316 | Lines: 28145
Start cursor solution
Milliseconds: 1950 | Lines: 28145
Start clr xml solution
Milliseconds: 773 | Lines: 28545
Start clr tvf (Paul White) solution
Milliseconds: 1186 | Lines: 28545


Greets
Flo



The more I learn, the more I know what I do not know

How to Post Data/Code to get the best Help"Numbers" or "Tally" Table
How to Post Performance Problems
Post #696180
Posted Monday, April 13, 2009 3:47 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 10:34 AM
Points: 11,937, Visits: 12,050
I think the reason the Tally version is slower has to do with the blob data type, nvarchar(max).

I posted some information in The Thread.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #696182
Posted Monday, April 13, 2009 3:51 PM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Thursday, March 18, 2010 2:19 AM
Points: 2,022, Visits: 3,594
Lynn Pettis (4/13/2009)
I think the reason the Tally version is slower has to do with the blob data type, nvarchar(max).

I posted some information in The Thread.


Hi Lynn

Thanks for your feedback! I currently have to handle the NVARCHAR(MAX)...

By the way: Which "The Thread" do you mean? I thought there have been two.

Greets
Flo



The more I learn, the more I know what I do not know

How to Post Data/Code to get the best Help"Numbers" or "Tally" Table
How to Post Performance Problems
Post #696184
Posted Monday, April 13, 2009 4:00 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 10:34 AM
Points: 11,937, Visits: 12,050
Flo,

I understand needing to work with NVARCHAR(MAX). I was trying to find out why the Tally solution may have been slower, and I think the reason is due to the blob data type. I don't think there is anything inherently wrong with the Tally method when used with the "fixed length" data type, including the VARCHAR(8000) and NVARCHAR(4000).

This is where testing a variety of solutions is always a good thing to do.




Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #696189
Posted Monday, April 13, 2009 4:02 PM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Thursday, March 18, 2010 2:19 AM
Points: 2,022, Visits: 3,594
Lynn Pettis (4/13/2009)
I think the reason the Tally version is slower has to do with the blob data type, nvarchar(max).

I posted some information in The Thread.


Hi again

I just tested with VARCHAR(8000) now I get a exception in my first tally solution but this doesn't matter for the moment. The other tally solutions provided by Barry (which have been much faster than mine) are still slower than the cursor.

The changed DDL and the changed test data:
DECLARE @tally TABLE (N INT NOT NULL, PRIMARY KEY CLUSTERED (N))
DECLARE @source TABLE (name NVARCHAR(128), definition VARCHAR(8000))
DECLARE @result TABLE (line nvarchar(max))

SELECT @crlf = CHAR(13) + CHAR(10)

-- //////////////////////////////////////////////////////////
-- -> Test data and tally table

-- Get some system procedures to split into lines
INSERT INTO @source
-- SELECT TOP(200) o.name, @crlf + m.definition + @crlf
SELECT TOP(200) o.name, @crlf + LEFT(m.definition, 7996) + @crlf
FROM master.sys.all_objects o
JOIN master.sys.all_sql_modules m ON o.object_id = m.object_id
WHERE type = 'P'

The results (sure everything is faster because of the less data):
Start tally solution
Msg 537, Level 16, State 2, Line 37
Invalid length parameter passed to the LEFT or SUBSTRING function.
The statement has been terminated.
Milliseconds: 13 | Lines: 0
Start Tally-function 3 solution (correct line-splitting)
Milliseconds: 2453 | Lines: 21115
Start Tally-function 4 solution (has same bug as the rest)
Milliseconds: 2380 | Lines: 20715
Start cursor solution
Milliseconds: 1466 | Lines: 20715
Start clr xml solution
Milliseconds: 590 | Lines: 21115
Start clr tvf (Paul White) solution
Milliseconds: 880 | Lines: 21115


Greets
Flo



The more I learn, the more I know what I do not know

How to Post Data/Code to get the best Help"Numbers" or "Tally" Table
How to Post Performance Problems
Post #696192
Posted Monday, April 13, 2009 4:02 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Wednesday, March 17, 2010 11:21 AM
Points: 9,104, Visits: 8,528
Paul White (4/13/2009)
RBarryYoung (4/13/2009)
1) For splitting the lines in stored procedure definitions, there is a subtle bug that exists in almost every line-splitting routine, including the ones so far in this thread: there is actually two different kinds of line-breaks incorporated in the system stored procedures: CR-LF, and LF alone. To correctly split them, you must take both into account (which is not trivial).

It's pretty trivial in the CLR case

Code-wise, yes. But I suspect that it will add 10-20% to the run-time. If you could test it for me I'd appreciate it as I probably will not get to loading the CLR code for a while yet.

RBarryYoung (4/13/2009)

3) The final observation that I had at that time was that the loop-based routines start to catch up with the tally-based methods if the average distance between each separator gets large enough, and for line-splitting system procedures, they tend to be quite large of course (as opposed to comma-separated strings, for instance). This is because the loop-based methods can skip the ahead of "dead" characters in-between with CHARINDEX(). Because Florian figured out how to call CHARINDEX only once per loop, that lowered where that threshold is.

*cough*

Paul White

The tally or numbers table solution is, to an extent, a brute-force approach. In the case of many long strings where the frequency of the searched-for string is low, a solution based on a loop/CHARINDEX approach should out-perform the tally approach.

Yes, I was pointing out that our previous analysis had indicated the same thing.

RBarryYoung (4/13/2009)

So anyway, I wanted to test my line splitting functions against these also, so here are the results:


Flo

Start clr xml solution
Milliseconds: 790 | Lines: 28545
Start clr tvf (PW) solution
Milliseconds: 1083 | Lines: 25661

Still slower than both CLR implementations

I wasn't disagreeing with that either, I just don't have the CLR stuff up yet.

And these CLR results are not a big surprise either because about 4-8 weeks ago I wrote a CLR function for a Search & Replace problem that was blowing the loop-based and set-based code out by about 4-5x. Since then I've been speculating on a hypothesis that anytime that loop-based SQL would come close to set-based performance, that CLR will blow them both out anyway. (That was on the Forums here, so I will see if I can find the thread...)

Now, I personally have no problem with CLR except that up until about two months ago it has been really hard to find things that it didn't stink at. My assertion has always been that anytime the CPU/record was above a certain point, it should be really good. but I'm kicking myself for not realizing that this was one of those cases.

Note that the line counts are different simply because the TVF omits empty results:

string[] items = toSplit.Split(new string[] { delimeter }, StringSplitOptions.RemoveEmptyEntries);)

It would be really nice if we could get a version with this fixed so that we can be sure of comparing apples to apples.


-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #696193
« Prev Topic | Next Topic »

«««23456»»»

Permissions Expand / Collapse