Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

substring comparison for last 2 characters Expand / Collapse
Author
Message
Posted Tuesday, July 2, 2013 6:39 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 4:42 AM
Points: 3,440, Visits: 5,398
erikd (7/2/2013)
I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.

select dbo.ExtractInteger(patid) as  [patid]
from table


USE [Sample]
GO
/****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE @Count INT
DECLARE @IntNumbers VARCHAR(1000)
SET @Count = 0
SET @IntNumbers = ''

WHILE @Count <= LEN(@String)
BEGIN
IF SUBSTRING(@String,@Count,1) >= '0'
AND SUBSTRING(@String,@Count,1) <= '9'
BEGIN
SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)
END
SET @Count = @Count + 1
END

RETURN @IntNumbers
END



SVFs are notoriouly poor for performance, even when measured correctly.

Initially I'd do it like this. Until that is I had the time to go back and determine the fastest method.

WITH Phones (phone) AS (
SELECT '66814382803' UNION ALL SELECT '123A123456789'
UNION ALL SELECT '0814382803'
)
SELECT Item
FROM Phones
CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')
WHERE [Matched] = 1 AND LEN(Item) BETWEEN 7 AND 11


I am curious what your function would return for that second phone number in my list.

PatternSplitCM is a iTVF that can be found in the 4th article in my signature links. Since it is a general purpose function, much faster ways could probably be devised that would get you to the same result.

Note that PatternSplitCM would also work for the OP's case.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1469809
Posted Tuesday, July 2, 2013 6:51 PM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Yesterday @ 2:24 PM
Points: 492, Visits: 2,041
dwain.c (7/2/2013)
erikd (7/2/2013)
I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.

select dbo.ExtractInteger(patid) as  [patid]
from table


USE [Sample]
GO
/****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE @Count INT
DECLARE @IntNumbers VARCHAR(1000)
SET @Count = 0
SET @IntNumbers = ''

WHILE @Count <= LEN(@String)
BEGIN
IF SUBSTRING(@String,@Count,1) >= '0'
AND SUBSTRING(@String,@Count,1) <= '9'
BEGIN
SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)
END
SET @Count = @Count + 1
END

RETURN @IntNumbers
END



SVFs are notoriouly poor for performance, even when measured correctly.

Initially I'd do it like this. Until that is I had the time to go back and determine the fastest method.

WITH Phones (phone) AS (
SELECT '66814382803' UNION ALL SELECT '123A123456789'
UNION ALL SELECT '0814382803'
)
SELECT Item
FROM Phones
CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')
WHERE [Matched] = 1 AND LEN(Item) BETWEEN 7 AND 11


I am curious what your function would return for that second phone number in my list.

PatternSplitCM is a iTVF that can be found in the 4th article in my signature links. Since it is a general purpose function, much faster ways could probably be devised that would get you to the same result.

Note that PatternSplitCM would also work for the OP's case.


Hi Dwain,

I work with a lot of tables, but none of them are very large. The largest one I currently deal with is for a monthly tracker that is recently up to around 300k rows, so I don't notice a lot (if any) performance increases that I see people talk about. So, for my purposes, the SVF works well.

Here's what it returns from the second number you posted:

select dbo.extractinteger('123A123456789') as [For Dwain]


For Dwain
123123456789


I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.
Post #1469811
Posted Tuesday, July 2, 2013 7:18 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 4:42 AM
Points: 3,440, Visits: 5,398
erikd (7/2/2013)

I work with a lot of tables, but none of them are very large.


Please don't get me wrong when I say this. That is today. Once your application is successful beyond your wildest dreams and your tables start having many millions of rows, you'll probably be forced to rethink that statement. Performance should rarely be measured as "good enough," mainly because by the time it becomes "not good enough" it will be a serious challenge to make it good enough once again.

erikd (7/2/2013)

Here's what it returns from the second number you posted:

select dbo.extractinteger('123A123456789') as [For Dwain]


For Dwain
123123456789


I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.


I knew the answer but what I was wondering is does that seem like the right thing to do? For example suppose the string was 123456789x1231 where x denotes an extension.

I hope you find value in the article. Like I said, I consider PatternSplitCM to be sort of a stopgap even though for the task it does it performs reasonably well. Most of the time, you'll find you can construct a more direct approach to a particular problem that will run faster.

PatternSplitCM will get you to working code faster but with a little more effort you can get your code working faster.

Did I just say that?



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1469817
Posted Tuesday, July 2, 2013 7:48 PM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Yesterday @ 2:24 PM
Points: 492, Visits: 2,041
dwain.c (7/2/2013)
erikd (7/2/2013)

I work with a lot of tables, but none of them are very large.


Please don't get me wrong when I say this. That is today. Once your application is successful beyond your wildest dreams and your tables start having many millions of rows, you'll probably be forced to rethink that statement. Performance should rarely be measured as "good enough," mainly because by the time it becomes "not good enough" it will be a serious challenge to make it good enough once again.

erikd (7/2/2013)

Here's what it returns from the second number you posted:

select dbo.extractinteger('123A123456789') as [For Dwain]


For Dwain
123123456789


I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.


I knew the answer but what I was wondering is does that seem like the right thing to do? For example suppose the string was 123456789x1231 where x denotes an extension.

I hope you find value in the article. Like I said, I consider PatternSplitCM to be sort of a stopgap even though for the task it does it performs reasonably well. Most of the time, you'll find you can construct a more direct approach to a particular problem that will run faster.

PatternSplitCM will get you to working code faster but with a little more effort you can get your code working faster.

Did I just say that?


No, you're right. Good enough isn't good enough. What I was getting at is that I don't have a table large enough to test good enough against better or best. I don't even have a particularly good testing environment (a SQL 2005 instance on a rather ancient and rickety server that half a dozen people RDP into at a time). But, a guy can dream, when he's done dreaming about Johnnie Walker Blue and rare steaks.

Phone numbers with extensions are screened out, because they can't be auto dialed, so it's GENERALLY not an issue, though it's caused a few headaches when clients send over their contact lists with everything plopped into one column.

I am going to try to thumb wrestle your pattern splitter to do what mine does and just return the cleaned result.

Post #1469818
Posted Tuesday, July 2, 2013 8:07 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 4:42 AM
Points: 3,440, Visits: 5,398
Erik - Just because your Prod system doesn't have a million rows to test against, doesn't mean you can't test against a million rows!

The test harness in the article demonstrates the generic approach for this (which I do not take credit for by the way).

This would directly compare PatternSplitCM against your SVF.

CREATE TABLE #Phones (phone VARCHAR(12));

WITH Tally (n) AS (
SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM sys.all_columns a CROSS JOIN sys.all_columns b)
INSERT INTO #Phones
SELECT 1000000+ABS(CHECKSUM(NEWID()))%1000000
FROM Tally;

DECLARE @Phone VARCHAR(12), @StartDT DATETIME;

PRINT 'PatternSplitCM';
SELECT @StartDT = GETDATE();
SET STATISTICS TIME ON;
SELECT @Phone=Item
FROM #Phones
CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')
WHERE [Matched]=1;
SET STATISTICS TIME OFF;
SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

PRINT 'extractinteger';
SELECT @StartDT = GETDATE();
SET STATISTICS TIME ON;
SELECT @Phone=dbo.extractinteger(phone)
FROM #Phones;
SET STATISTICS TIME OFF;
SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

GO
DROP TABLE #Phones;

Edit: I modified it to SELECT the ElapsedMS of each run, which is needed due to a bias that is introduced by applying STATISTICS TIME to an SVF as demonstrated here: http://www.sqlservercentral.com/articles/T-SQL/91724/



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1469821
Posted Tuesday, July 2, 2013 8:41 PM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Yesterday @ 2:24 PM
Points: 492, Visits: 2,041
dwain.c (7/2/2013)
Erik - Just because your Prod system doesn't have a million rows to test against, doesn't mean you can't test against a million rows!

The test harness in the article demonstrates the generic approach for this (which I do not take credit for by the way).

This would directly compare PatternSplitCM against your SVF.

CREATE TABLE #Phones (phone VARCHAR(12));

WITH Tally (n) AS (
SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM sys.all_columns a CROSS JOIN sys.all_columns b)
INSERT INTO #Phones
SELECT 1000000+ABS(CHECKSUM(NEWID()))%1000000
FROM Tally;

DECLARE @Phone VARCHAR(12), @StartDT DATETIME;

PRINT 'PatternSplitCM';
SELECT @StartDT = GETDATE();
SET STATISTICS TIME ON;
SELECT @Phone=Item
FROM #Phones
CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')
WHERE [Matched]=1;
SET STATISTICS TIME OFF;
SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

PRINT 'extractinteger';
SELECT @StartDT = GETDATE();
SET STATISTICS TIME ON;
SELECT @Phone=dbo.extractinteger(phone)
FROM #Phones;
SET STATISTICS TIME OFF;
SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

GO
DROP TABLE #Phones;

Edit: I modified it to SELECT the ElapsedMS of each run, which is needed due to a bias that is introduced by applying STATISTICS TIME to an SVF as demonstrated here: http://www.sqlservercentral.com/articles/T-SQL/91724/


Hi Dwain,

I'm having a little trouble getting your splitter to parse. After changing the way it sets local variables to work with 2k5, I get these errors:


Msg 156, Level 15, State 1, Procedure PatternSplitLoop, Line 42
Incorrect syntax near the keyword 'WHEN'.
Msg 102, Level 15, State 1, Procedure PatternSplitLoop, Line 54
Incorrect syntax near 'a'.


In the mean time, I did run your performance test on my SVF alone:

ElapsedMS
401603

But I have no idea if that's good or bad.
Post #1469823
Posted Tuesday, July 2, 2013 9:11 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 4:42 AM
Points: 3,440, Visits: 5,398
You're going to have a bit of a problem if you're using SQL 2005.

You don't need to worry about PatternSplitLoop as it is not the fastest of the methods. Use PatternSplitCM instead.

PatternSplitCM has one construct that is not compatible with SQL 2005. That is the method of generating the Tally table.

    WITH numbers AS (
SELECT TOP(ISNULL(DATALENGTH(@List), 0))
n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) g (n))

You'll need to change this to a Ben-gan style in-line tally table (using cascading CTEs). Like this:

WITH Nbrs_3(n) AS (SELECT 1 UNION SELECT 0)
,Nbrs_2 (n) AS (SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2)
,Nbrs_1 (n) AS (SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2)
,Nbrs_0 (n) AS (SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2)
,Nbrs (n) AS (SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2)
,numbers (n) AS (SELECT TOP(ISNULL(DATALENGTH(@List), 0))
ROW_NUMBER() OVER (ORDER BY n) As n FROM Nbrs)


I think a direct replacement of what I've given you into the body of PatternSplitCM should make it compilable on SQL 2005 (I hope).



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1469827
Posted Tuesday, July 2, 2013 9:13 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 6:23 AM
Points: 35,772, Visits: 32,445
@timscronin,

From your initial and follow-up descriptions, it looks like Scott's solution works. Are you all set now or do you still need some help?


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1469828
Posted Tuesday, July 2, 2013 9:14 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 4:42 AM
Points: 3,440, Visits: 5,398
Oh and BTW, I was getting times of around 65000 ms (for PatternSplitCM) on my clocked down laptop against the 1,000,000 rows of random phone numbers I was generating in the test harness I gave you.

While not directly comparable due to different machines, once you get PatternSplitCM working, I think you should see quite a difference.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1469829
Posted Tuesday, July 2, 2013 9:16 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 4:42 AM
Points: 3,440, Visits: 5,398
Jeff Moden (7/2/2013)
@timscronin,

From your initial and follow-up descriptions, it looks like Scott's solution works. Are you all set now or do you still need some help?


Good idea there to check back with the OP. Erik and I have hijacked this thread on a bit of a tangent.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1469830
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse