|
|
|
SSCrazy
      
Group: General Forum Members
Last Login: Sunday, November 04, 2012 12:23 PM
Points: 2,087,
Visits: 3,932
|
|
Hello everybody!
I have a performance issue with a set based solution. Please could you advice me what I'm doing wrong?
I try to split text into lines by CRLF and it seems that my tally solution is at least twice slower than a cursor based solution if the text length is more than some characters.
Here my test environment:
SET NOCOUNT ON
-- ////////////////////////////////////////////////////////// -- -> Temporary procedure to spilit into lines IF (OBJECT_ID('tempdb..#usp_print_lines') IS NOT NULL) DROP PROCEDURE #usp_print_lines GO CREATE PROCEDURE #usp_print_lines @text NVARCHAR(MAX) AS DECLARE @ret TABLE (id INT IDENTITY, line NVARCHAR(MAX))
DECLARE @pos INT, @next INT, @crlf CHAR(2) SELECT @pos = 1, @crlf = CHAR(13) + CHAR(10)
WHILE (1 = 1) BEGIN SELECT @next = CHARINDEX(@crlf, @text, @pos) IF (@next = 0) BREAK -- Nothing more to do IF (@pos != @next) INSERT INTO @ret SELECT SUBSTRING(@text, @pos, @next - @pos)
SELECT @pos = @next + 1 END
SELECT line FROM @ret ORDER BY id -- Return lines GO -- <- Temporary procedure to spilit into lines -- //////////////////////////////////////////////////////////
DECLARE @now DATETIME, @duration INT, @crlf CHAR(2), @count INT
DECLARE @tally TABLE (N INT NOT NULL, PRIMARY KEY CLUSTERED (N)) DECLARE @source TABLE (name NVARCHAR(128), definition NVARCHAR(MAX)) DECLARE @result TABLE (line nvarchar(max))
SELECT @crlf = CHAR(13) + CHAR(10)
-- ////////////////////////////////////////////////////////// -- -> Test data and tally table
-- Get some system procedures to split into lines INSERT INTO @source SELECT TOP(200) o.name, @crlf + m.definition + @crlf FROM master.sys.all_objects o JOIN master.sys.all_sql_modules m ON o.object_id = m.object_id WHERE type = 'P'
INSERT INTO @tally SELECT TOP(100000) ROW_NUMBER() OVER (ORDER BY c1.column_id) FROM master.sys.all_columns c1 CROSS JOIN master.sys.all_columns c2
-- <- Test data and tally table -- //////////////////////////////////////////////////////////
-- ////////////////////////////////////////////////////////// -- -> Tally solution PRINT 'Start tally solution' SELECT @now = GETDATE()
-- Split text into lines INSERT INTO @result SELECT l.line FROM @source s CROSS APPLY (SELECT TOP(1000) SUBSTRING(s.definition, N + 2, CHARINDEX(@crlf, s.definition, N + 1) - N - 2) line FROM @tally WHERE N < LEN(s.definition) - 1 AND SUBSTRING(s.definition, N, 2) = @crlf ORDER BY N ) l
-- Results SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE()) SELECT @count = COUNT(*) FROM @result PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count) -- <- Tally solution -- //////////////////////////////////////////////////////////
DELETE FROM @result -- Clean up
-- ////////////////////////////////////////////////////////// -- -> Cursor solution PRINT 'Start cursor solution' SELECT @now = GETDATE();
-- Create another table and copy all data... DECLARE @procs table (name nvarchar(128), definition NVARCHAR(MAX))
INSERT INTO @procs SELECT name, definition FROM @source
DECLARE @name NVARCHAR(128) DECLARE @definition NVARCHAR(MAX)
-- Loop through all data and split into lines WHILE EXISTS (SELECT TOP(1) 1 FROM @procs) BEGIN SELECT TOP(1) @name = name, @definition = definition FROM @procs -- Split current text into lines INSERT INTO @result EXECUTE #usp_print_lines @definition DELETE FROM @procs WHERE name = @name END
-- Results SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE()) SELECT @count = COUNT(*) FROM @result PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count) -- <- Cursor solution -- //////////////////////////////////////////////////////////
My results on two different environments
1st system
System: Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 (X64) Mar 29 2009 10:11:52 Copyright (c) 1988-2008 Microsoft Corporation Developer Edition (64-bit) on Windows NT 6.0 (Build 6001: Service Pack 1)
Result: Start tally solution Milliseconds: 5140 | Lines: 27132 Start cursor solution Milliseconds: 1946 | Lines: 28145
2nd system
System: Microsoft SQL Server 2005 - 9.00.3042.00 (Intel X86) Feb 9 2007 22:47:07 Copyright (c) 1988-2005 Microsoft Corporation Developer Edition on Windows NT 5.2 (Build 3790: Service Pack 2)
Result: Start tally solution Milliseconds: 3483 | Lines: 25728 Start cursor solution Milliseconds: 1626 | Lines: 25728
I'm quiet sure that I'm doing something wrong, but I don't see what 
Thank you for all suggestions! Flo
PS: I try to answer as quick as possible but I currently have huge problems with my internet connection...
The more I learn, the more I know what I do not know Blog: Things about Software Architecture, .NET development and T-SQL
How to Post Data/Code to get the best Help How to Post Performance Problems
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Today @ 8:47 AM
Points: 11,617,
Visits: 27,678
|
|
All I did was confirm your code is essentially the same on my SQL2005/1Gig ram 2.8 machine.
My first guess was because you were using a @table variable for your Tally table, I've always heard that @table vars were good only when they had fewer rows. I ran your code, then substituted my permenant tally table...identical results. second pass saved a few milliseconds due to caching I assume.
/* Start tally solution Milliseconds: 3733 | Lines: 25987 Start cursor solution Milliseconds: 2140 | Lines: 26922 second pass: Start tally solution Milliseconds: 3703 | Lines: 25987 Start cursor solution Milliseconds: 1983 | Lines: 26922 */
I've just got to assume that although a Tally table is awesome, there will be situations where it does not perform as well as other methods.
Lowell
--There is no spoon, and there's no default ORDER BY in sql server either. Actually, Common Sense is so rare, it should be considered a Superpower. --my son
|
|
|
|
|
Hall of Fame
       
Group: General Forum Members
Last Login: Yesterday @ 2:02 PM
Points: 3,788,
Visits: 5,543
|
|
Flo: I'm not at my laptop right now, so I can't test your code. I'll do so at the earliest opportunity. Couple of questions though:
Why are you selecting the top (1000)? WHERE N < len(s.definition)-1 should handle that.
ORDER BY N? Does it make any difference to the output?
Why table variable @tally?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller Stop, children, what's that sound? -- Stephen Stills
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 10:33 AM
Points: 10,989,
Visits: 10,529
|
|
Hey Flo!
It is true to say that the following features of the script won't help overall performance:
1. Using table variables (no statistics are available and parallelism cannot be used) 2. Using large object types (MAX) 3. Using Unicode 4. @crlf is defined as CHAR rather than NCHAR (implicit conversion)
Nevertheless, all that is rather beside the point here.
The tally or numbers table solution is, to an extent, a brute-force approach. In the case of many long strings where the frequency of the searched-for string is low, a solution based on a loop/CHARINDEX approach should out-perform the tally approach.
Searching a 1000-row table containing strings of 500K characters for example:
The tally method will execute a SUBSTRING 1000 * 500K times, trying to find the search string. A method based on CHARINDEX will execute Sigma[i=1, 1000] (Ni+1) times, where Ni is the number of occurrences of the search string in row i.
Of course CHARINDEX is a character-by-character search from the starting position, but even without doing the math rigorously, it is apparent that there will always be a point where the tally method will be slower.
In your particular example, it helps if the source table contains an extra column containing the length of the string, but the CHARINDEX method is still faster.
This code shows how including an extra length column helps the optimizer:
DECLARE @crlf NCHAR(2) SELECT @crlf = CHAR(13) + CHAR(10)
SELECT SUBSTRING(s.definition, N + 2, CHARINDEX(@crlf, s.definition, N + 1) - N - 2) FROM #source AS S JOIN #tally AS T ON (T.N BETWEEN 1 AND S.Length - 2) WHERE SUBSTRING(s.definition, N, 2) = @crlf ORDER BY N
Paul White SQL Server MVP SQLblog.com @SQL_Kiwi
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Today @ 6:23 AM
Points: 32,903,
Visits: 26,784
|
|
Florian Reischl (4/12/2009) Hello everybody!
I have a performance issue with a set based solution. Please could you advice me what I'm doing wrong?
ORDER BY on unindexed Tally table, for starters. All the other comments about large table variables may apply, as well, although I haven't tested the code, yet.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Today @ 6:23 AM
Points: 32,903,
Visits: 26,784
|
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 10:33 AM
Points: 10,989,
Visits: 10,529
|
|
Jeff Moden (4/12/2009) ORDER BY on unindexed Tally table, for starters. All the other comments about large table variables may apply, as well, although I haven't tested the code, yet. Unindexed? 
Flo's code:
DECLARE @tally TABLE (N INT NOT NULL, PRIMARY KEY CLUSTERED (N))
edit: overlapped with Jeff's latest
Paul White SQL Server MVP SQLblog.com @SQL_Kiwi
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Today @ 6:23 AM
Points: 32,903,
Visits: 26,784
|
|
Jeff Moden (4/12/2009)
Florian Reischl (4/12/2009) Hello everybody!
I have a performance issue with a set based solution. Please could you advice me what I'm doing wrong?
ORDER BY on unindexed Tally table, for starters. All the other comments about large table variables may apply, as well, although I haven't tested the code, yet.
Wow! Nicely done, Flo... I haven't figured it out, yet. Looks like you may have figured out a way to beat the Tally table.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Today @ 6:23 AM
Points: 32,903,
Visits: 26,784
|
|
Flo... there's a small error in your temporary stored procedure. It was leaving a leading space in the results. I've repaired it in the following but it still beats the tar out of the Tally table solution. Guess the Tally table pretty much sucks on the big stuff.
--===== Procedure to split line IF (OBJECT_ID('tempdb..#usp_print_lines') IS NOT NULL) DROP PROCEDURE #usp_print_lines GO CREATE PROCEDURE #usp_print_lines @text NVARCHAR(MAX) AS DECLARE @ret TABLE (id INT IDENTITY, line NVARCHAR(MAX)) DECLARE @pos INT, @next INT, @crlf NCHAR(2)
SELECT @pos = 1, @crlf = CHAR(13) + CHAR(10) WHILE (1 = 1) BEGIN SELECT @next = CHARINDEX(@crlf, @text, @pos)
IF (@next = 0) BREAK --- Nothing more to do
IF (@pos <> @next) INSERT INTO @ret SELECT SUBSTRING(@text, @pos + 1, @next - @pos -1)
SELECT @pos = @next + 1 END
SELECT line FROM @ret ORDER BY id --- Return lines GO
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSCrazy
      
Group: General Forum Members
Last Login: Sunday, November 04, 2012 12:23 PM
Points: 2,087,
Visits: 3,932
|
|
|
|
|