Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Searching Address by Ranges Expand / Collapse
Author
Message
Posted Thursday, August 22, 2013 7:31 AM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, October 16, 2014 11:42 AM
Points: 142, Visits: 921
I still strongly disagree. UDFs are much, much easier to incrementally add customization rather than throw away a CTE because it can no longer be used because of additional requirements. CTEs are often dead ends. Performance is almost never a problem for once-and-done importing. I only waste time on optimization when something is taking a few seconds to run in production.

CTEs are good for one point of view...optimization.
UTFs are good for hundreds of points of view...maintainability, upgradability, simplicity, usability, diversity.

No thanks! I will continue to use UTFs almost always.
Post #1487253
Posted Thursday, August 22, 2013 7:39 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Friday, October 17, 2014 9:40 AM
Points: 6,778, Visits: 13,971
Bill Talada (8/22/2013)
I still strongly disagree. UDFs are much, much easier to incrementally add customization rather than throw away a CTE because it can no longer be used because of additional requirements. CTEs are often dead ends. Performance is almost never a problem for once-and-done importing. I only waste time on optimization when something is taking a few seconds to run in production.

CTEs are good for one point of view...optimization.
UTFs are good for hundreds of points of view...maintainability, upgradability, simplicity, usability, diversity.

No thanks! I will continue to use UTFs almost always.


CTE's are irrelevant to optimisation. Except for recursive CTE's, they're all about code simplification.

I'll agree with you that UDF's can be very useful indeed - providing they are properly written. There's a simple test. If a UDF contains the keywords BEGIN and END, then it is not.


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1487257
Posted Friday, October 25, 2013 1:35 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 8:45 PM
Points: 35,342, Visits: 31,875
Bill Talada (8/22/2013)

I'm completely sick of anti-rbar preachers. They're just getting their ego kicks through trying to one-up everybody.


Since I invented the term “RBAR” and I’m most definitely an “Anti-RBAR Preacher”, I should take personal offense to your very near slanderous choice of words, but I won’t because I know you’re just frustrated.

Here is what is happening: code is being run on a VonNeumann architecture. Everything is procedurally processed in loops. Yes, even CTEs are loops - they're just running looping code in a compiled format instead of an interpreted format
.

For someone that claims such knowledge of how the Von Neumann architecture works, I’m absolutely floored that you’ve completely missed the point where RBAR is concerned. WHILE loops and other forms of RBAR defeat the performance of the very loops that you speak of because SQL Server is, in fact, an interpretive language, not a compiler language. Every single iteration of a WHILE loop is handled separately and must either produce a new Execution Plan or figure out that it can reuse an existing Execution Plan. That means that WHILE loops loop at interpretive speeds rather than the very efficient machine-language-level compiled speeds you speak of.

Advantages of CTEs:
1) Run faster
2) Programmers get to display their ability to write one
3) Works similar to nature in its fractal and recursive style


You've totally missed the purpose of the CTE in Dwain’s code. First, CTEs don’t usually run any faster than normal code or even a derived table or View. And the “fractal and recursive style” of a Recursive CTE is actually hidden RBAR that’s frequently worse and usually more resource intensive than even a well written WHILE loop. Second, you really need to slow down and read Dwain’s code. The CTE in his code has absolutely nothing to do with any attempt at a solution. Dwain was simply being a bit lazy and used the CTE to present the test data rather than creating a test table.

Disadvantages of CTEs:
1) Not reusable compared to functions
2) Doesn't simplify the code one is actually working on
3) Must be rewritten as a UDF when more complex logic is added that requires procedural processing


While I agree that CTEs don’t provide the reusable encapsulation of a function, they actually can and do greatly simplify code very much in the manner of using an interim Temp Table for “Divide’n’Conquer” methods and by allowing common calculations to be solved once and brought forward through cascading CTEs. But, I digress because, like I said, the CTE in Dwain’s code had absolutely nothing to do with the solution.

The post originator specified this was a one-off job. Therefore, extra work in optimization is a complete waste.


Why do so many people think that 1) one-off jobs can be sloppy and/or slow and 2) that there would be any “extra work in optimization” for such code? If people actually practiced writing good, high performance all the time, it would become second nature and would be no extra effort at all. In fact, highly performant, non-RBAR code is frequently shorter and easier to write than RBAR but it’s like any other skill… ya gotta know and ya gotta practice.

I gave him a general purpose function. I have hundreds of these that I can quickly put to use in importing weirdly formatted customer data. UDFs are very useful from many points-of-view; CTEs are advantageous from one point of view - production code where performance is top priority.


I absolutely agree that having a library of such general purpose functions is critical to RAD for applications and RSD for script development. The problem with that is if you use RBAR in all those functions and with very few exceptions, virtually everything you and everyone else use them for will be quite a bit unnecessarily slower than they should. For those people that justify such slowness by saying they’re only going to “use them on a small amount of rows” need to have their heads examined. There is no justification for any code that runs even as little as 2 times slower even if the difference per row is measured in microseconds especially since the faster code is usually easier and quicker to write and implement.

I’m currently going through such throws at work with such code. In fact, the problem the OP posted is very similar to functionality of some of the code I’ve had to repair for performance so let’s stick with that problem for this example.

First, let’s build some test data. I have well over a million rows that I have to worry about at work but a million rows will do nicely here. I’m using a GUID as an address line because it has a mix of characters much like an address line would and some of those GUIDs have leading digits. Here’s the code to build the test data as well as the RBAR function you posted and a possible alternative that I threw together (I’m sure that someone could probably optimize it a bit but I didn’t want to spend much time on it. Sound familiar?).
--=====================================================================================================================
-- Create and populate a test table to measure performance and resource usage with.
-- We'll start off without an index.
-- The test table contains a single column of GUIDs converted to strings to simulate a street address with
-- possible leading digits.
--=====================================================================================================================
--===== Conditionally drop the test table to make reruns in SSMS easier
IF OBJECT_ID('tempdb..#Addresses','U') IS NOT NULL
DROP TABLE #Addresses
;
GO
--===== Create a million row test table of simulated address lines
SELECT TOP 1000000
AddressLine = CAST(NEWID() AS VARCHAR(36))
INTO #Addresses
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
;
GO
--===== Since we're going to be dropping objects in this test script, do this in a nice safe place that everyone has.
USE tempdb
;
GO
--=====================================================================================================================
-- Conditionally drop and recreate Talada's function just as he posted it.
--=====================================================================================================================
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.LeadingNumbers') AND type IN (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION dbo.LeadingNumbers
GO

create function dbo.LeadingNumbers
(
@in varchar(100)
)
RETURNS int
AS
BEGIN
declare
@out varchar(10),
@ptr int,
@len int

set @len = LEN(@in)
set @ptr = 1

while @ptr <= @len
begin
if charindex(substring(@in, @ptr, 1),'0123456789') = 0
goto done
set @ptr += 1
end

done:
if @ptr = 0
set @out = ''
else
set @out = left(@in, @ptr - 1)

RETURN cast(@out as int)

END
GO
--=====================================================================================================================
-- Conditionally drop and build a function that does the same thing without any RBAR.
--=====================================================================================================================
--====== Conditionally drop the new function to make reruns easier in SSMS
IF OBJECT_ID('tempdb.dbo.LeadingBigInt','IF') IS NOT NULL
DROP FUNCTION dbo.LeadingBigInt
;
GO
CREATE FUNCTION dbo.LeadingBigInt
/**********************************************************************************************************************
Purpose:
Given a string of characters, returns the leading digits as an Integer or returns a null if there are no leading
digits. Note that this will return the entire string as a BIGINT if the string is all digits and the leading digits
will fit in the BIGINT datatype. Will cause an error if the BIGINT overruns.

Note that this function will NOT detect or return a negative BIGINT.

Usage:
--=====
SELECT ca.LeadingInt
FROM dbo.SomeTable st
CROSS APPLY dbo.LeadingInt(st.SomeStringColumn) ca
WHERE

Revision History:
Rev 00 - 25 Oct 2013 - Jeff Moden
- Initial Creation
**********************************************************************************************************************/
--===== Declare the I/O for this function
(@pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
SELECT LeadingBigInt =
CASE
WHEN @pString LIKE '[0-9]%'
THEN CAST(SUBSTRING(@pString,1,ISNULL(NULLIF(PATINDEX('%[^0-9]%',@pString),0)-1,8000)) AS BIGINT)
ELSE NULL
END
;
GO


I couldn’t test your function for performance simply by doing a SET statistics, which would have made your function look a whole lot slower than it actually is. There’s a link in the code below if you’d like to learn more about that. So, I turned on a batch oriented SQL Profiler for the following runs. Each piece of code runs consecutive times to see what might happen due to caching or not.

--=====================================================================================================================
-- Run the tests. Note that we CANNOT use SET STATISTICS for this test because TALADA's function is a scalar
-- function and SET STATISTICS doesn't measure CPU or Duration correctly for Scalar UDFs.
-- REF: http://www.sqlservercentral.com/articles/T-SQL/91724/
--=====================================================================================================================
GO
--===== Talada's RBAR Scalar Function =================================================================================
SELECT AddressLine
FROM #Addresses addr
WHERE dbo.LeadingNumbers(AddressLine) >= 4000
AND dbo.LeadingNumbers(AddressLine) < 5000
;
GO 3
--===== The iTVF Function =============================================================================================
SELECT AddressLine,ca.LeadingBigInt
FROM #Addresses addr
CROSS APPLY dbo.LeadingBigINT(AddressLine) ca
WHERE ca.LeadingBigInt >= 4000
AND ca.LeadingBigInt < 5000
;
GO 3
--
GO
--===== Talada's RBAR Scalar Function with extra isolation ============================================================
SELECT AddressLine
FROM #Addresses addr
WHERE AddressLine LIKE '[0-9]%'
AND dbo.LeadingNumbers(AddressLine) >= 4000
AND dbo.LeadingNumbers(AddressLine) < 5000
;
GO 3
--===== The iTVF Function with extra isolation ========================================================================
SELECT AddressLine,ca.LeadingBigInt
FROM #Addresses addr
CROSS APPLY dbo.LeadingBigINT(AddressLine) ca
WHERE AddressLine LIKE '[0-9]%'
AND ca.LeadingBigInt >= 4000
AND ca.LeadingBigInt < 5000
;
GO 3

Here are the results from the SQL Profiler run…



And also notice that I also return the Leading Int for possible inclusion in a Temp Table or a join but still only calls the function once. Yours would have to make 3 calls.

To summarize, The RBAR code you wrote used roughly between 4 and 7 times more CPU and took 10 to 15 times longer to run. Yea, I know what’s next. Someone is going to say that a difference of 5.2 to 7.8 seconds on a million rows doesn’t amount to a hill of beans. Nothing could be further from the truth. The code at work similar to the RBAR function on this thread runs 40,000 times in 8 hours. Do the math… the non-RBAR, non-optimized function in this thread takes a total of about 13.2 CPU hours to execute. The RBAR function takes a total of almost 94.5 CPU hours.

Now, just imagine making everything on your machine 6-7 times faster like that. And it’s so easy if you just say “NO” to RBAR and allow the Von Neumann architecture in your computer to actually do its job.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems


  Post Attachments 
Leading Digits Performance.PNG (24 views, 20.81 KB)
Post #1508355
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse