Searching Address by Ranges

Question

Post reply

Searching Address by Ranges

Anthony Willms

SSC Veteran

Points: 272
More actions
August 20, 2013 at 1:08 pm

#297908

I have a scenario where I have been given a table full of address ranges that I need to compare against an existing table containing addresses. The address range table is structured with the following columns: StreetName, RangeFrom, RangeTo.
A particular example is that I need to find an address that looks like 1516 Adams St by using the address range table with the following: StreetName = 'Adams', RangeFrom = 1200, RangeTo = 1599. In this case I can use a like statement using '1[2-5][0-9][0-9]' + StreetName + '%' to capture the desired records. However, it becomes a bit more tricky when the ranges are some thing like 1505-1629 or 12-158.
Does anyone know of a good way to perform this function?
Thanks for any assistance you can provide.
Tony Willms

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Bill Talada SSChampion Points: 11954 More actions · Answer 1

A "good" way to perform your query would be to design the table with address parts instead of address lines.

primary_number: 3127

street_predirection: E

street_name: Warm Springs

street_suffix: Rd

secondary_number: 200

secondary_designator: Ste

city_name: Las Vegas

state_abbreviation: NV

zipcode: 89120

plus4_code: 3134

delivery_point: 50

delivery_point_check_digit: 4

first_line: 3127 E Warm Springs Rd Ste 200

last_line: Las Vegas NV 89120-3134

Anthony Willms SSC Veteran Points: 272 More actions · Answer 2

Yes, of course that would be ideal. However, I am dealing with data provided by external sources where the data is not split as you indicate above. Therefore, I am hoping to find a way to achieve the desired goal without having to parse everything out first. This may be a one-off query to begin with, so I am hoping not not have to parse out thousands of addresses...

Anthony Willms SSC Veteran Points: 272 More actions · Answer 3

Bill,

I just noticed the url you posted at the bottom of your post. Let me look into that further...

Bill Talada SSChampion Points: 11954 More actions · Answer 4

This function will return an int from the beginning of a string.

SELECT dbo.LeadingNumbers('4568 Lancaster Avenue');

SELECT *

FROM Addresses

WHERE AddressLine like '%Lancaster%'

AND dbo.LeadingNumbers(AddressLine) >= 4000

AND dbo.LeadingNumbers(AddressLine) < 5000

;

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.LeadingNumbers') AND type IN (N'FN', N'IF', N'TF', N'FS', N'FT'))

DROP FUNCTION dbo.LeadingNumbers

GO

create function dbo.LeadingNumbers

(

@in varchar(100)

)

RETURNS int

AS

BEGIN

declare

@out varchar(10),

@ptr int,

@len int

set @len = LEN(@in)

set @ptr = 1

while @ptr <= @len

begin

if charindex(substring(@in, @ptr, 1),'0123456789') = 0

goto done

set @ptr += 1

end

done:

if @ptr = 0

set @out = ''

else

set @out = left(@in, @ptr - 1)

RETURN cast(@out as int)

END

GO

Anthony Willms SSC Veteran Points: 272 More actions · Answer 5

Bill,

Thanks! This did exactly what I needed. I don't know why I didn't think about using a separate function... Sometimes I try to do too much within a stored proc...

Tony

Dwain Camps SSC Guru Points: 86908 More actions · Answer 6

I have to ask why you'd resort to a UDF (not to mention calling it twice!) when it is quite straightforward to simply strip off the leading number as follows:

WITH Addresses (AddressLine) AS (

SELECT '1505-1629 Adams St'

UNION ALL SELECT '12-158 Adams St'

UNION ALL SELECT '450 Adams St'

UNION ALL SELECT 'Adams St'

UNION ALL SELECT '')

SELECT AddressLine

,HouseNo=CASE PATINDEX('%[^0123456789]%', AddressLine) WHEN 0 THEN ''

ELSE LEFT(AddressLine, PATINDEX('%[^0123456789]%', AddressLine)-1) END

FROM Addresses;

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Bill Talada SSChampion Points: 11954 More actions · Answer 7

I'm completely sick of anti-rbar preachers. They're just getting their ego kicks through trying to one-up everybody. Here is what is happening: code is being run on a VonNeumann architecture. Everything is procedurally processed in loops. Yes, even CTEs are loops - they're just running looping code in a compiled format instead of an interpreted format.

Advantages of CTEs:

1) Run faster

2) Programmers get to display their ability to write one

3) Works similar to nature in its fractal and recursive style

Disadvantages of CTEs:

1) Not reusable compared to functions

2) Doesn't simplify the code one is actually working on

3) Must be rewritten as a UDF when more complex logic is added that requires procedural processing

The post originator specified this was a one-off job. Therefore, extra work in optimization is a complete waste. I gave him a general purpose function. I have hundreds of these that I can quickly put to use in importing weirdly formatted customer data. UDFs are very useful from many points-of-view; CTEs are advantageous from one point of view - production code where performance is top priority.

Dwain Camps SSC Guru Points: 86908 More actions · Answer 8

Actually, I've rarely found CTEs to improve performance over similar set-based constructs, e.g., a derived table. All they do is add readability.

In this case, I wasn't even trying for that. The CTE was simply a convenient place to put some sample data.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

Bill, it's not a personal thing, we all make mistakes. The point of folks posting faster and more efficient code samples is that any posted code here in ssc could be randomly picked up and used. Folks often don't read from the beginning of a thread, nor indeed to the end, hence bad code must be flagged as such as soon as possible, preferably in the very next post and offering a properly written alternative, and an explanation. The same applies to advice and statements.

Folks like Dwain go to considerable lengths to set a high standard for forum code and whilst there may occasionally be an element of friendly competition between ourselves, it's very rarely directed at the OP. Why bother? How important can it be to encourage coders to think about the cost of their queries? You only have to lurk around ssc for a day or two to find out that it’s absolutely vital.

Your function is awful isn’t it? I can’t believe you saved it in your toolkit 😛

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Bill Talada SSChampion Points: 11954 More actions · Answer 10

I still strongly disagree. UDFs are much, much easier to incrementally add customization rather than throw away a CTE because it can no longer be used because of additional requirements. CTEs are often dead ends. Performance is almost never a problem for once-and-done importing. I only waste time on optimization when something is taking a few seconds to run in production.

CTEs are good for one point of view...optimization.

UTFs are good for hundreds of points of view...maintainability, upgradability, simplicity, usability, diversity.

No thanks! I will continue to use UTFs almost always.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

Bill Talada (8/22/2013)
I still strongly disagree. UDFs are much, much easier to incrementally add customization rather than throw away a CTE because it can no longer be used because of additional requirements. CTEs are often dead ends. Performance is almost never a problem for once-and-done importing. I only waste time on optimization when something is taking a few seconds to run in production.
CTEs are good for one point of view...optimization.
UTFs are good for hundreds of points of view...maintainability, upgradability, simplicity, usability, diversity.
No thanks! I will continue to use UTFs almost always.

CTE's are irrelevant to optimisation. Except for recursive CTE's, they're all about code simplification.

I'll agree with you that UDF's can be very useful indeed - providing they are properly written. There's a simple test. If a UDF contains the keywords BEGIN and END, then it is not.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Jeff Moden SSC Guru Points: 1003851 More actions · Answer 12

Bill Talada (8/22/2013)
I'm completely sick of anti-rbar preachers. They're just getting their ego kicks through trying to one-up everybody.

Since I invented the term “RBAR” and I’m most definitely an “Anti-RBAR Preacher”, I should take personal offense to your very near slanderous choice of words, but I won’t because I know you’re just frustrated.

Here is what is happening: code is being run on a VonNeumann architecture. Everything is procedurally processed in loops. Yes, even CTEs are loops - they're just running looping code in a compiled format instead of an interpreted format

.

For someone that claims such knowledge of how the Von Neumann architecture works, I’m absolutely floored that you’ve completely missed the point where RBAR is concerned. WHILE loops and other forms of RBAR defeat the performance of the very loops that you speak of because SQL Server is, in fact, an interpretive language, not a compiler language. Every single iteration of a WHILE loop is handled separately and must either produce a new Execution Plan or figure out that it can reuse an existing Execution Plan. That means that WHILE loops loop at interpretive speeds rather than the very efficient machine-language-level compiled speeds you speak of.

Advantages of CTEs:
1) Run faster
2) Programmers get to display their ability to write one
3) Works similar to nature in its fractal and recursive style

You've totally missed the purpose of the CTE in Dwain’s code. First, CTEs don’t usually run any faster than normal code or even a derived table or View. And the “fractal and recursive style” of a Recursive CTE is actually hidden RBAR that’s frequently worse and usually more resource intensive than even a well written WHILE loop. Second, you really need to slow down and read Dwain’s code. The CTE in his code has absolutely nothing to do with any attempt at a solution. Dwain was simply being a bit lazy and used the CTE to present the test data rather than creating a test table.

Disadvantages of CTEs:
1) Not reusable compared to functions
2) Doesn't simplify the code one is actually working on
3) Must be rewritten as a UDF when more complex logic is added that requires procedural processing

While I agree that CTEs don’t provide the reusable encapsulation of a function, they actually can and do greatly simplify code very much in the manner of using an interim Temp Table for “Divide’n’Conquer” methods and by allowing common calculations to be solved once and brought forward through cascading CTEs. But, I digress because, like I said, the CTE in Dwain’s code had absolutely nothing to do with the solution.

The post originator specified this was a one-off job. Therefore, extra work in optimization is a complete waste.

Why do so many people think that 1) one-off jobs can be sloppy and/or slow and 2) that there would be any “extra work in optimization” for such code? If people actually practiced writing good, high performance all the time, it would become second nature and would be no extra effort at all. In fact, highly performant, non-RBAR code is frequently shorter and easier to write than RBAR but it’s like any other skill… ya gotta know and ya gotta practice.

I gave him a general purpose function. I have hundreds of these that I can quickly put to use in importing weirdly formatted customer data. UDFs are very useful from many points-of-view; CTEs are advantageous from one point of view - production code where performance is top priority.

I absolutely agree that having a library of such general purpose functions is critical to RAD for applications and RSD for script development. The problem with that is if you use RBAR in all those functions and with very few exceptions, virtually everything you and everyone else use them for will be quite a bit unnecessarily slower than they should. For those people that justify such slowness by saying they’re only going to “use them on a small amount of rows” need to have their heads examined. There is no justification for any code that runs even as little as 2 times slower even if the difference per row is measured in microseconds especially since the faster code is usually easier and quicker to write and implement.

I’m currently going through such throws at work with such code. In fact, the problem the OP posted is very similar to functionality of some of the code I’ve had to repair for performance so let’s stick with that problem for this example.

First, let’s build some test data. I have well over a million rows that I have to worry about at work but a million rows will do nicely here. I’m using a GUID as an address line because it has a mix of characters much like an address line would and some of those GUIDs have leading digits. Here’s the code to build the test data as well as the RBAR function you posted and a possible alternative that I threw together (I’m sure that someone could probably optimize it a bit but I didn’t want to spend much time on it. Sound familiar?).

--=====================================================================================================================

-- Create and populate a test table to measure performance and resource usage with.

-- We'll start off without an index.

-- The test table contains a single column of GUIDs converted to strings to simulate a street address with

-- possible leading digits.

--=====================================================================================================================

--===== Conditionally drop the test table to make reruns in SSMS easier

IF OBJECT_ID('tempdb..#Addresses','U') IS NOT NULL

DROP TABLE #Addresses

;

GO

--===== Create a million row test table of simulated address lines

SELECT TOP 1000000

AddressLine = CAST(NEWID() AS VARCHAR(36))

INTO #Addresses

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

;

GO

--===== Since we're going to be dropping objects in this test script, do this in a nice safe place that everyone has.

USE tempdb

;

GO

--=====================================================================================================================

-- Conditionally drop and recreate Talada's function just as he posted it.

--=====================================================================================================================

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.LeadingNumbers') AND type IN (N'FN', N'IF', N'TF', N'FS', N'FT'))

DROP FUNCTION dbo.LeadingNumbers

GO

create function dbo.LeadingNumbers

(

@in varchar(100)

)

RETURNS int

AS

BEGIN

declare

@out varchar(10),

@ptr int,

@len int

set @len = LEN(@in)

set @ptr = 1

while @ptr <= @len

begin

if charindex(substring(@in, @ptr, 1),'0123456789') = 0

goto done

set @ptr += 1

end

done:

if @ptr = 0

set @out = ''

else

set @out = left(@in, @ptr - 1)

RETURN cast(@out as int)

END

GO

--=====================================================================================================================

-- Conditionally drop and build a function that does the same thing without any RBAR.

--=====================================================================================================================

--====== Conditionally drop the new function to make reruns easier in SSMS

IF OBJECT_ID('tempdb.dbo.LeadingBigInt','IF') IS NOT NULL

DROP FUNCTION dbo.LeadingBigInt

;

GO

CREATE FUNCTION dbo.LeadingBigInt

/**********************************************************************************************************************

Purpose:

Given a string of characters, returns the leading digits as an Integer or returns a null if there are no leading

digits. Note that this will return the entire string as a BIGINT if the string is all digits and the leading digits

will fit in the BIGINT datatype. Will cause an error if the BIGINT overruns.

Note that this function will NOT detect or return a negative BIGINT.

Usage:

--=====

SELECT ca.LeadingInt

FROM dbo.SomeTable st

CROSS APPLY dbo.LeadingInt(st.SomeStringColumn) ca

WHERE

Revision History:

Rev 00 - 25 Oct 2013 - Jeff Moden

- Initial Creation

**********************************************************************************************************************/

--===== Declare the I/O for this function

(@pString VARCHAR(8000))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

SELECT LeadingBigInt =

CASE

WHEN @pString LIKE '[0-9]%'

THEN CAST(SUBSTRING(@pString,1,ISNULL(NULLIF(PATINDEX('%[^0-9]%',@pString),0)-1,8000)) AS BIGINT)

ELSE NULL

END

;

GO

I couldn’t test your function for performance simply by doing a SET statistics, which would have made your function look a whole lot slower than it actually is. There’s a link in the code below if you’d like to learn more about that. So, I turned on a batch oriented SQL Profiler for the following runs. Each piece of code runs consecutive times to see what might happen due to caching or not.

--=====================================================================================================================

-- Run the tests. Note that we CANNOT use SET STATISTICS for this test because TALADA's function is a scalar

-- function and SET STATISTICS doesn't measure CPU or Duration correctly for Scalar UDFs.

-- REF: http://www.sqlservercentral.com/articles/T-SQL/91724/

--=====================================================================================================================

GO

--===== Talada's RBAR Scalar Function =================================================================================

SELECT AddressLine

FROM #Addresses addr

WHERE dbo.LeadingNumbers(AddressLine) >= 4000

AND dbo.LeadingNumbers(AddressLine) < 5000

;

GO 3

--===== The iTVF Function =============================================================================================

SELECT AddressLine,ca.LeadingBigInt

FROM #Addresses addr

CROSS APPLY dbo.LeadingBigINT(AddressLine) ca

WHERE ca.LeadingBigInt >= 4000

AND ca.LeadingBigInt < 5000

;

GO 3

--

GO

--===== Talada's RBAR Scalar Function with extra isolation ============================================================

SELECT AddressLine

FROM #Addresses addr

WHERE AddressLine LIKE '[0-9]%'

AND dbo.LeadingNumbers(AddressLine) >= 4000

AND dbo.LeadingNumbers(AddressLine) < 5000

;

GO 3

--===== The iTVF Function with extra isolation ========================================================================

SELECT AddressLine,ca.LeadingBigInt

FROM #Addresses addr

CROSS APPLY dbo.LeadingBigINT(AddressLine) ca

WHERE AddressLine LIKE '[0-9]%'

AND ca.LeadingBigInt >= 4000

AND ca.LeadingBigInt < 5000

;

GO 3

Here are the results from the SQL Profiler run…

And also notice that I also return the Leading Int for possible inclusion in a Temp Table or a join but still only calls the function once. Yours would have to make 3 calls.

To summarize, The RBAR code you wrote used roughly between 4 and 7 times more CPU and took 10 to 15 times longer to run. Yea, I know what’s next. Someone is going to say that a difference of 5.2 to 7.8 seconds on a million rows doesn’t amount to a hill of beans. Nothing could be further from the truth. The code at work similar to the RBAR function on this thread runs 40,000 times in 8 hours. Do the math… the non-RBAR, non-optimized function in this thread takes a total of about 13.2 CPU hours to execute. The RBAR function takes a total of almost 94.5 CPU hours.

Now, just imagine making everything on your machine 6-7 times faster like that. And it’s so easy if you just say “NO” to RBAR and allow the Von Neumann architecture in your computer to actually do its job. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)