RE: Searching Address by Ranges

SSC Guru

Points: 1003894

October 25, 2013 at 1:35 am

Bill Talada (8/22/2013)
I'm completely sick of anti-rbar preachers. They're just getting their ego kicks through trying to one-up everybody.

Since I invented the term “RBAR” and I’m most definitely an “Anti-RBAR Preacher”, I should take personal offense to your very near slanderous choice of words, but I won’t because I know you’re just frustrated.

Here is what is happening: code is being run on a VonNeumann architecture. Everything is procedurally processed in loops. Yes, even CTEs are loops - they're just running looping code in a compiled format instead of an interpreted format

For someone that claims such knowledge of how the Von Neumann architecture works, I’m absolutely floored that you’ve completely missed the point where RBAR is concerned. WHILE loops and other forms of RBAR defeat the performance of the very loops that you speak of because SQL Server is, in fact, an interpretive language, not a compiler language. Every single iteration of a WHILE loop is handled separately and must either produce a new Execution Plan or figure out that it can reuse an existing Execution Plan. That means that WHILE loops loop at interpretive speeds rather than the very efficient machine-language-level compiled speeds you speak of.

Advantages of CTEs:
1) Run faster
2) Programmers get to display their ability to write one
3) Works similar to nature in its fractal and recursive style

You've totally missed the purpose of the CTE in Dwain’s code. First, CTEs don’t usually run any faster than normal code or even a derived table or View. And the “fractal and recursive style” of a Recursive CTE is actually hidden RBAR that’s frequently worse and usually more resource intensive than even a well written WHILE loop. Second, you really need to slow down and read Dwain’s code. The CTE in his code has absolutely nothing to do with any attempt at a solution. Dwain was simply being a bit lazy and used the CTE to present the test data rather than creating a test table.

Disadvantages of CTEs:
1) Not reusable compared to functions
2) Doesn't simplify the code one is actually working on
3) Must be rewritten as a UDF when more complex logic is added that requires procedural processing

While I agree that CTEs don’t provide the reusable encapsulation of a function, they actually can and do greatly simplify code very much in the manner of using an interim Temp Table for “Divide’n’Conquer” methods and by allowing common calculations to be solved once and brought forward through cascading CTEs. But, I digress because, like I said, the CTE in Dwain’s code had absolutely nothing to do with the solution.

The post originator specified this was a one-off job. Therefore, extra work in optimization is a complete waste.

Why do so many people think that 1) one-off jobs can be sloppy and/or slow and 2) that there would be any “extra work in optimization” for such code? If people actually practiced writing good, high performance all the time, it would become second nature and would be no extra effort at all. In fact, highly performant, non-RBAR code is frequently shorter and easier to write than RBAR but it’s like any other skill… ya gotta know and ya gotta practice.

I gave him a general purpose function. I have hundreds of these that I can quickly put to use in importing weirdly formatted customer data. UDFs are very useful from many points-of-view; CTEs are advantageous from one point of view - production code where performance is top priority.

I absolutely agree that having a library of such general purpose functions is critical to RAD for applications and RSD for script development. The problem with that is if you use RBAR in all those functions and with very few exceptions, virtually everything you and everyone else use them for will be quite a bit unnecessarily slower than they should. For those people that justify such slowness by saying they’re only going to “use them on a small amount of rows” need to have their heads examined. There is no justification for any code that runs even as little as 2 times slower even if the difference per row is measured in microseconds especially since the faster code is usually easier and quicker to write and implement.

I’m currently going through such throws at work with such code. In fact, the problem the OP posted is very similar to functionality of some of the code I’ve had to repair for performance so let’s stick with that problem for this example.

First, let’s build some test data. I have well over a million rows that I have to worry about at work but a million rows will do nicely here. I’m using a GUID as an address line because it has a mix of characters much like an address line would and some of those GUIDs have leading digits. Here’s the code to build the test data as well as the RBAR function you posted and a possible alternative that I threw together (I’m sure that someone could probably optimize it a bit but I didn’t want to spend much time on it. Sound familiar?).

--=====================================================================================================================

-- Create and populate a test table to measure performance and resource usage with.

-- We'll start off without an index.

-- The test table contains a single column of GUIDs converted to strings to simulate a street address with

-- possible leading digits.

--=====================================================================================================================

--===== Conditionally drop the test table to make reruns in SSMS easier

IF OBJECT_ID('tempdb..#Addresses','U') IS NOT NULL

DROP TABLE #Addresses

;

GO

--===== Create a million row test table of simulated address lines

SELECT TOP 1000000

AddressLine = CAST(NEWID() AS VARCHAR(36))

INTO #Addresses

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

;

GO

--===== Since we're going to be dropping objects in this test script, do this in a nice safe place that everyone has.

USE tempdb

;

GO

--=====================================================================================================================

-- Conditionally drop and recreate Talada's function just as he posted it.

--=====================================================================================================================

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.LeadingNumbers') AND type IN (N'FN', N'IF', N'TF', N'FS', N'FT'))

DROP FUNCTION dbo.LeadingNumbers

GO

create function dbo.LeadingNumbers

(

@in varchar(100)

)

RETURNS int

AS

BEGIN

declare

@out varchar(10),

@ptr int,

@len int

set @len = LEN(@in)

set @ptr = 1

while @ptr <= @len

begin

if charindex(substring(@in, @ptr, 1),'0123456789') = 0

goto done

set @ptr += 1

end

done:

if @ptr = 0

set @out = ''

else

set @out = left(@in, @ptr - 1)

RETURN cast(@out as int)

END

GO

--=====================================================================================================================

-- Conditionally drop and build a function that does the same thing without any RBAR.

--=====================================================================================================================

--====== Conditionally drop the new function to make reruns easier in SSMS

IF OBJECT_ID('tempdb.dbo.LeadingBigInt','IF') IS NOT NULL

DROP FUNCTION dbo.LeadingBigInt

;

GO

CREATE FUNCTION dbo.LeadingBigInt

/**********************************************************************************************************************

Purpose:

Given a string of characters, returns the leading digits as an Integer or returns a null if there are no leading

digits. Note that this will return the entire string as a BIGINT if the string is all digits and the leading digits

will fit in the BIGINT datatype. Will cause an error if the BIGINT overruns.

Note that this function will NOT detect or return a negative BIGINT.

Usage:

--=====

SELECT ca.LeadingInt

FROM dbo.SomeTable st

CROSS APPLY dbo.LeadingInt(st.SomeStringColumn) ca

WHERE

Revision History:

Rev 00 - 25 Oct 2013 - Jeff Moden

- Initial Creation

**********************************************************************************************************************/

--===== Declare the I/O for this function

(@pString VARCHAR(8000))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

SELECT LeadingBigInt =

CASE

WHEN @pString LIKE '[0-9]%'

THEN CAST(SUBSTRING(@pString,1,ISNULL(NULLIF(PATINDEX('%[^0-9]%',@pString),0)-1,8000)) AS BIGINT)

ELSE NULL

END

;

GO

I couldn’t test your function for performance simply by doing a SET statistics, which would have made your function look a whole lot slower than it actually is. There’s a link in the code below if you’d like to learn more about that. So, I turned on a batch oriented SQL Profiler for the following runs. Each piece of code runs consecutive times to see what might happen due to caching or not.

--=====================================================================================================================

-- Run the tests. Note that we CANNOT use SET STATISTICS for this test because TALADA's function is a scalar

-- function and SET STATISTICS doesn't measure CPU or Duration correctly for Scalar UDFs.

-- REF: http://www.sqlservercentral.com/articles/T-SQL/91724/

--=====================================================================================================================

GO

--===== Talada's RBAR Scalar Function =================================================================================

SELECT AddressLine

FROM #Addresses addr

WHERE dbo.LeadingNumbers(AddressLine) >= 4000

AND dbo.LeadingNumbers(AddressLine) < 5000

;

GO 3

--===== The iTVF Function =============================================================================================

SELECT AddressLine,ca.LeadingBigInt

FROM #Addresses addr

CROSS APPLY dbo.LeadingBigINT(AddressLine) ca

WHERE ca.LeadingBigInt >= 4000

AND ca.LeadingBigInt < 5000

;

GO 3

--

GO

--===== Talada's RBAR Scalar Function with extra isolation ============================================================

SELECT AddressLine

FROM #Addresses addr

WHERE AddressLine LIKE '[0-9]%'

AND dbo.LeadingNumbers(AddressLine) >= 4000

AND dbo.LeadingNumbers(AddressLine) < 5000

;

GO 3

--===== The iTVF Function with extra isolation ========================================================================

SELECT AddressLine,ca.LeadingBigInt

FROM #Addresses addr

CROSS APPLY dbo.LeadingBigINT(AddressLine) ca

WHERE AddressLine LIKE '[0-9]%'

AND ca.LeadingBigInt >= 4000

AND ca.LeadingBigInt < 5000

;

GO 3

Here are the results from the SQL Profiler run…

And also notice that I also return the Leading Int for possible inclusion in a Temp Table or a join but still only calls the function once. Yours would have to make 3 calls.

To summarize, The RBAR code you wrote used roughly between 4 and 7 times more CPU and took 10 to 15 times longer to run. Yea, I know what’s next. Someone is going to say that a difference of 5.2 to 7.8 seconds on a million rows doesn’t amount to a hill of beans. Nothing could be further from the truth. The code at work similar to the RBAR function on this thread runs 40,000 times in 8 hours. Do the math… the non-RBAR, non-optimized function in this thread takes a total of about 13.2 CPU hours to execute. The RBAR function takes a total of almost 94.5 CPU hours.

Now, just imagine making everything on your machine 6-7 times faster like that. And it’s so easy if you just say “NO” to RBAR and allow the Von Neumann architecture in your computer to actually do its job. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)