Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Never say 'never' to the WHILE loop.

/* download the test text file Moby-Dyck.zip here */

/* You might notice that I refer to Moby **** in this blog. This is because the busybodyish nanny software cannot allow me to refer to my test data, the 1851 novel by Herman Melville. This reminds me of when the Royal Society for the Preservation of Birds used community server for their website. They had to refer to 'Male Bird' rather than a Cock.  It is like a scene from a novel by Philip K Dick.*/

DECLARE @LotsOfText VARCHAR(MAX)
SELECT @LotsOfText='
It surprises me how often I need to chop up text into its individual words, paragraphs or lines. I suppose I do a lot of parsing of difficult input to get it into a database. I also like to show off by doing really good searches of data, which requires an inversion table.

Everyone must know how to chop up delimited strings into individual fields, but chopping text into its individual words requires a rather different technique. There are no consistent delimiters, just whitespace. Actually, you can think of slicing text into words as an extreme form of parsing delimited strings.

Here is a routine that has the advantage that it does not need to be encapsulated into a function, although it doesn''t mind being used in a function. It is also very rapid.

The derived table (g) is simply the index into the string of all the places where there is a transition from whitespace to word-characters; in other words, the start of each word. After that, all you have to do is find the end of the word and you are home and dry!

Characters such as '' or - are allowed within words such as double-decker or shan''t, but are treated as whitespace if outside a word. We start with what I guess is a pretty simple routine using a ''helper'' table called ''numbers'' which you''ll just have to set up using one of Anith''s methods described in his ''Faking Arrays in Transact SQL''.http://www.simple-talk.com/sql/t-sql-programming/faking-arrays-in-transact-sql/ '

SELECT SUBSTRING(@LotsOfText,start,
      
PATINDEX('%[^A-Z''a-z0-9-]%',SUBSTRING (@LotsOfText,start,50)+' ')-1)
  
FROM
  
(
  
SELECT [start]=number
  
FROM numbers
  
WHERE number< LEN(@LotsOfText)+1
  
AND ( PATINDEX('[^A-Za-z0-9-''][A-Za-z0-9]',SUBSTRING(' '+@LotsOfText,number,2))=1
  
OR PATINDEX('[^A-Za-z0-9-"''][''"][A-Za-z0-9]',SUBSTRING(' '+@LotsOfText,number,3))=1)
   )
g
ORDER BY start
/* Of course, you'd usually use this sort of routine as a function. */
GO
/*------------------------------------------------------------*/
IF OBJECT_ID(N'WordChop') IS NOT NULL DROP FUNCTION WordChop
GO
CREATE FUNCTION [dbo].[WordChop]
(
@string VARCHAR(MAX)
)
RETURNS
@Results TABLE
(
Item VARCHAR(255)
)
AS
BEGIN
INSERT INTO
@results(item)
SELECT SUBSTRING(@string,start,
      
PATINDEX('%[^A-Z''a-z0-9-]%',SUBSTRING (@string,start,50)+' ')-1)
  
FROM
  
(
  
SELECT [start]=number
  
FROM numbers
  
WHERE number< LEN(@string)+1
  
AND (PATINDEX('[^A-Za-z0-9''-][A-Za-z0-9]',SUBSTRING(' '+@string,number,2))=1
    
OR PATINDEX('[^A-Za-z0-9-"''][''"][A-Za-z0-9]',SUBSTRING(' '+@string,number,3))=1)
   )
g
ORDER BY start
RETURN
END
GO

/* ... but is it really faster than a simple iterative solution? */
/*------------------------------------------------------------*/
IF OBJECT_ID(N'IterativeWordChop') IS NOT NULL DROP FUNCTION IterativeWordChop
GO
CREATE FUNCTION [dbo].[IterativeWordChop]
(
@string VARCHAR(MAX)
)
RETURNS
@Results TABLE
(
Item VARCHAR(255)
)
AS
BEGIN
DECLARE
@Len INT, @Start INT, @end INT, @Cursor INT,@length INT
SELECT
@Cursor=1, @len=LEN(@string)
WHILE @cursor<@len
  
BEGIN
   SELECT
@start=PATINDEX('%[^A-Za-z0-9][A-Za-z0-9]%',
                  
' '+SUBSTRING (@string,@cursor,50)
                   )-
1
  
SELECT @length=PATINDEX('%[^A-Z''a-z0-9-]%',SUBSTRING (@string,@cursor+@start+1,50)+' ')  
  
INSERT INTO @results(item)
      
SELECT  SUBSTRING(@string,@cursor+@start,@length)
  
SELECT @Cursor=@Cursor+@Start+@length+1
  
END
RETURN
END
GO
/* now we have two very similar functions, one of which is iterative, and the other uses a number 'helper' table. Which is the most efficient? Which is the faster? It would seem obvious, from the RBAR school of thought. The number table is always going to win. We can easily put this to the test.*/

/* We can easily put this to the test. Let's put the lovely first chapter of Moby Dick in*/
DECLARE @LotsOfText VARCHAR(MAX)
SELECT  @LotsOfText = BulkColumn
-- 2005/8 only for this code)
FROM    OPENROWSET(BULK 'C:\workbench\moby-dick-C1.txt', SINGLE_BLOB) AS x  
SET STATISTICS time ON
SELECT
'Do it the RBAR way'
SELECT COUNT(*) FROM Iterativewordchop(@LotsOfText)-- (2220 words)
/* SQL Server Execution Times:
   CPU time = 468 ms,  elapsed time = 791 ms.*/

SELECT 'and now the number-table way'
SELECT COUNT(*) FROM wordchop(@LotsOfText)----------- (2219 words)
SET STATISTICS time OFF
/* SQL Server Execution Times:
   CPU time = 94 ms,  elapsed time = 112 ms.*/


GO

/* Let's put the entire text of Moby Dick in. Prepare for a surprise. (remember 2005/8 only for this code)*/
DECLARE @LotsOfText VARCHAR(MAX)
SELECT  @LotsOfText = BulkColumn
FROM    OPENROWSET(BULK 'C:\workbench\moby-dick.txt', SINGLE_BLOB) AS x  

SELECT COUNT(*) FROM Iterativewordchop(@LotsOfText)--26 seconds (214730 words)

SELECT COUNT(*) FROM wordchop(@LotsOfText)-----------48 seconds (214243 words)

/*
word thinks it is 216381 words it took 6 seconds!
Yes, one algorithm is more linear than the other. RBAR wins in this case.*/

/*
Conclusion?
Believe nobody. Test everything. Try your algorithm under the sort of loads you are going to expect in your application, and anyone who says that WHILE loops or cursors are always less efficient is simplifying the grey truth. However, for your piddly databases you should never bet on RBAR.  */

Comments

Posted by K. Brian Kelley on 2 October 2008

Watching for a Jeff Moden sighting... ;)

Posted by Phil Factor on 2 October 2008

Am I scared of Jeff? Me?  Hmmm. Perhaps I am. Eeek!

p.s. Thanks to Robyn for the original idea for this Blog post

Posted by Jeff Moden on 5 October 2008

Heh... no one should be afraid of me.  I'm just a friendly guy... with a bucket of frozen pork chops and a slingshot.  ;-)

Hey, Phil... how can I get a copy of chapter 1 to play with?  Your blog is, indeed, very interesting and I'd, of course, like to take a crack at it.  Thanks, ol' friend...

Posted by Jeff Moden on 5 October 2008

Oh yeah... as for the "Jeff Moden sighting", I should have yelled something appropriate like "Thar she blows!" ;-)  Now, where'd I put that darned harpoon?  ;-)

Posted by Jeff Moden on 5 October 2008

By the way and whilst we wait for the copy of Chapter 1, Phil is absolutely correct when he says "Believe nobody. Test everything. Try your algorithm under the sort of loads you are going to expect in your application..."

The rest I'm not so sure of... so guess what?  I'm going to test.  ;-)

Posted by Phil Factor on 5 October 2008

When I put this blog on, I tried to put the chapter 1, and the whole of Moby ***, into a zip file on this blog, but it doesn't seem to have appeared. I'm struggling a bit with the technology. If you get bored waiting for me to work out how to make it appear, it is readily available on the internet, which is why I used it in the first place; that and the fact it is one of my favourite books.

Posted by Jeff Moden on 5 October 2008

Heh... no problem... do you have a URL where I can find it?

Posted by Jeff Moden on 5 October 2008

Heh... never mind... there's about a bazillion hits on chapter 1...

Posted by Jeff Moden on 5 October 2008

Ok... I ran the entire novel (380 pages according to MS Word) through the very same wringer that you did, Phil.  I used a 5 million row Tally table just to be on the safe side which turns out to be a bit of overkill.  Other than changing the name of the table to "Tally" and it's column name to "N", all of the code is the exact same as yours.  On my poor ol' tired 6 year old P5 1.8Ghz with a gig of RAM and an IDE harddrive, here's the results...

DECLARE @LotsOfText VARCHAR(MAX)

SELECT  @LotsOfText = BulkColumn

FROM    OPENROWSET(BULK 'C:\Documents and Settings\~Jeff Moden\My Documents\xmoby.txt', SINGLE_BLOB) AS x

--SELECT @LotsOfText = REPLICATE(@LotsOfText,850)

SELECT LEN(@LotsOfText)

SET STATISTICS time ON

SELECT 'Do it the RBAR way'

SELECT COUNT(*) FROM Iterativewordchop(@LotsOfText)

/* SQL Server Execution Times:

  CPU time = 109625 ms,  elapsed time = 170506 ms.*/

SELECT 'and now the number-table way'

SELECT COUNT(*) FROM wordchop(@LotsOfText)

/* SQL Server Execution Times:

  CPU time = 73109 ms,  elapsed time = 74593 ms.*/

SET STATISTICS time OFF

The Tally table method wins handily.  I don't have a multi-processor machine to test on so I can't test for parallelism problems, etc.  By the way, I'm running SQL Server 2k5 with no cumulative updates past Service Pack 2.

I'll zip the file up and send it to work and see what happens there.

Posted by Jeff Moden on 5 October 2008

P.S.  Heh... Always say "Never" to a RBAR While loop... if the While loop beats the set based method, there's something else wrong.  ;-)

Posted by Jeff Moden on 5 October 2008

Hmmmm... on the other hand, it IS very interesting that the WHILE loop beat up the Tally table on Phil's machine... it would be very interesting to find out why.

Thanks for starting this blog, Phil.

Posted by Phil Factor on 6 October 2008

I re-ran the tests. The results were almost identical. The numbers table has the clustered index and is generally kosher.

Like you, I always run my development stuff on the worst possible configuration so that performance issues show up clearly. My processor is better than yours, and I have one Gig, but there is other stuff running on the machine, and I suspect that the differences are all due to the fact that I have less memory available for cacheing than you. However, in a runnning system with lots of other processes going on, the same thing could happen. Something that runs well in test will slow down when the cache resources aren't available in a production setting.

I've heard from other database developers that in some circumstances, iterative solutions can be more effective than set-based solutions, but this is the first time I've been able to detect it, and measure it, myself.

Posted by peter on 6 October 2008

Well obviously inspecting a in-memory character sting (the input in this case) is faster in the procedural method. This is what CPUs are good at, no I/O, very localised in-memory data access...examining every character just once if you write the code properly.

Storing the found result in a SQL set compatible form (table) is where SQL Server seems to go RBAR (as jeff calls it) and tries to handle things as a set where in fact the result is not immediatly. The result is build up row by row, and SQL doesn't understand that and handles things inefficient all of a sudden.

For MS this is easily fixable if they wanted to do so. Think of either an extra datatype that represents an in-memory 2-dimensional array (data compatible with a table) that is accessed like an array instead of a table and resides in memory. Then only one bulk conversion (again good localisation) from the memory type to the persitent type (table) can be done.

The result would be the best of both worlds. Another option would be to not expose this new type to the developer but do it automaticaly in the background when the table type is local (can't interact with other processes) and has an idenity column as its primary key. SQL Server could keep it in memory until a true table type is forced by some operation that needs it.

There are other things to exploit as the usual case that rows are added one by one and do the conversion to a real table in page size chunks while keeping latest additions in an in-memory page.

As for the different results on different machines, can't say I am that surprised. Using an older machine to find bottlenecks is IMHO not a good way as the balance between I/O, memory and processing capabilities is different with each generation of machines. On top of that the nature of CPUs processing power is changing as well to more cores instead of faster threads. This means more CPU can be utilized only if you can parallize the execution of your code. This is BTW where materialized tally table will have an edge on large datasets. Operations with them can be stremed and parallized in theory at the expense of CPU and I/O (which should be available in surplus on a DB machine).

Efficiency one one level doesn't always win the race. Your solution is executed at many levels of both hardware and software and each level has its own performace characteristics. Think of harddisks (latency/troughput), memory (latency/troughput), several caching layers by both software as well as the CPU and disks/controllers. And ofcourse SQL server itself which can either connect well or fail miserably at tapping into these resources.

There will be no final word on this, its all evolutionary and will remain so as different aspects of the whole undergo changes over time.

Posted by peter on 19 February 2009

Small addition,

It seesm you did benchmark a table valued function that was NOT INLINE, meaning it has a BEGIN and END around its body and does execute more then just a return of a select statement. This body is hidden from the optimizer and execution thus becomes a multi step process instead of completely set based.

You should re-run your test with a proper inline table valued version instead.

Posted by Tommy on 9 August 2009

Without search criteria, I got it display the results with total records count = 160 milliseconds for 4 million rows.

With search criteria, I got it display the results with total records count = 3 seconds for 5,000 rows.

look like the COUNT rows with criteria took longer for SQL Server to processes.

I try many methods still can not get "COUNT" with criteria  below 1 seconds.

Posted by Dave F on 13 August 2009

Looks like if I don't have time to test, I should use the set approach first and then consider testing with while loops later when I can afford the time. In either case, it does not look like one would be a showstopper over the other.

Leave a Comment

Please register or log in to leave a comment.