substring comparison for last 2 characters

Question

substring comparison for last 2 characters

timscronin

SSCertifiable

Points: 6851
More actions
July 1, 2013 at 8:58 am

#395067

I have a field with data like below I need to join on
patid
123453
124344ME
323390
3233MS
I need to remove the last 2 charcters if they are not numeric. So my results for the join should be
patid
123453
124344
323390
3233

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Cadavre SSC-Forever Points: 41690 More actions · Answer 1

timscronin (7/1/2013)
I have a field with data like below I need to join on
patid
123453
124344ME
323390
3233MS
I need to remove the last 2 charcters if they are not numeric. So my results for the join should be
patid
123453
124344
323390
3233

What would you do with "1234C5" and "12345C" ??

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

timscronin SSCertifiable Points: 6851 More actions · Answer 2

wont happen, the records with the 2 characters at the end are state abbreviations added on

Luis Cazares SSC Guru Points: 183698 More actions · Answer 3

A possible solution would be like this. I'm worried about performance though. Couldn't you correct the data in the first place?

CREATE TABLE #Table1( patid varchar(10))

CREATE TABLE #Table2( patid varchar(10))

INSERT INTO #Table1 VALUES('123453'),('124344ME'),('323390'),('3233MS')

INSERT INTO #Table2 VALUES('123453'),('124344'),('323390'),('3233')

SELECT *

FROM #Table1 a

JOIN #Table2 b ON CASE WHEN a.patid NOT LIKE '%[^0-9]%' THEN a.patid ELSE LEFT(a.patid, LEN(a.patid) - 2) END = b.patid

DROP TABLE #Table1

DROP TABLE #Table2

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Adi Cohn-120898 SSC-Dedicated Points: 33944 More actions · Answer 4

Here is one more way:

create table Demo (Col1 varchar(10))

go

insert into Demo(Col1) values ('123453'),('124344ME'),('123453'),('124344ME'),('323390'),('3233MS')

select Col1, substring(Col1,1, CASE WHEN Col1 like '%[aA-zZ]' THEN len(Col1)-2 else len(Col1) end)

from Demo

--------------------------------------------------------------
To know how to ask questions and increase the chances of getting asnwers:
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

timscronin SSCertifiable Points: 6851 More actions · Answer 5

problem is with data like this (my bad should have included it)

1233IL

22M33

Would exclude 2nd record

Steven Willis SSCrazy Eights Points: 9893 More actions · Answer 6

Create some sample data:

IF OBJECT_ID('tempdb..#SampleData') IS NOT NULL

DROP TABLE #SampleData

CREATE TABLE #SampleData (

[ID] INT IDENTITY(1,1) NOT NULL,

[strVal] NVARCHAR(50) NULL,

PRIMARY KEY (ID))

INSERT INTO #SampleData

VALUES

('123453ME'),('124344'),('323390'),('32339826455MS'),

('345678'),('39WA'),('986545'),('1234AK'),

('345XX8'),('398644GA'),('F986545'),('98234AZ')

Now query the data and strip off the unwanted suffixes

as well as eliminating any entries with an embedded alpha char.

SELECT

r.strVal

FROM

(

SELECT

(CASE

WHEN PATINDEX('%[A-Za-z]%',RIGHT(strVal,2)) > 0

THEN REPLACE(strVal,RIGHT(strVal,2),'')

ELSE strVal

END) AS strVal

FROM

#SampleData AS sd

) r

WHERE

PATINDEX('%[A-Za-z]%',strVal) = 0

ScottPletcher SSC Guru Points: 100949 More actions · Answer 7

SELECT

mt.##col1##, ..., jt.##co11##, ...

FROM dbo.maintable mt

LEFT OUTER JOIN jointable jt ON

--ignore last two bytes of patid if they are alpha, because it's a state abbrev, not part of key value

jt.key = LEFT(mt.patid, LEN(mt.patid) - CASE WHEN RIGHT(mt.patid, 2) LIKE '[a-z][a-z]' THEN 2 ELSE 0 END)

Edit: Added sql code tags to colorize code.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003867 More actions · Answer 8

timscronin (7/1/2013)
wont happen, the records with the 2 characters at the end are state abbreviations added on

Then it's easy... Build a "State" table (you should have one anyway) and the your WHERE clause would look like the following...

WHERE RIGHT(PatID,2) IN (SELECT StateAbbv FROM dbo.State)

Be advised that no matter what you do, performance is going to stink because of the PatID column being wrapped in a function. What I'd recommend doing if you need to do this a lot is to make the RIGHT(Pat,2) a PERSISTED calculated column in the table and index it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sqldriver SSChampion Points: 10440 More actions · Answer 9

I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.

select dbo.ExtractInteger(patid) as [patid]

from table

USE [Sample]

GO

/****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))

RETURNS VARCHAR(1000)

AS

BEGIN

DECLARE @Count INT

DECLARE @IntNumbers VARCHAR(1000)

SET @Count = 0

SET @IntNumbers = ''

WHILE @Count <= LEN(@String)

BEGIN

IF SUBSTRING(@String,@Count,1) >= '0'

AND SUBSTRING(@String,@Count,1) <= '9'

BEGIN

SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)

END

SET @Count = @Count + 1

END

RETURN @IntNumbers

END

Dwain Camps SSC Guru Points: 86908 More actions · Answer 10

erikd (7/2/2013)
I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.
select dbo.ExtractInteger(patid) as [patid]
from table
USE [Sample]
GO
/****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE @Count INT
DECLARE @IntNumbers VARCHAR(1000)
SET @Count = 0
SET @IntNumbers = ''
WHILE @Count <= LEN(@String)
BEGIN
IF SUBSTRING(@String,@Count,1) >= '0'
AND SUBSTRING(@String,@Count,1) <= '9'
BEGIN
SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)
END
SET @Count = @Count + 1
END
RETURN @IntNumbers
END

SVFs are notoriouly poor for performance, even when measured correctly.

Initially I'd do it like this. Until that is I had the time to go back and determine the fastest method.

WITH Phones (phone) AS (

SELECT '66814382803' UNION ALL SELECT '123A123456789'

UNION ALL SELECT '0814382803'

)

SELECT Item

FROM Phones

CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')

WHERE [Matched] = 1 AND LEN(Item) BETWEEN 7 AND 11

I am curious what your function would return for that second phone number in my list.

PatternSplitCM is a iTVF that can be found in the 4th article in my signature links. Since it is a general purpose function, much faster ways could probably be devised that would get you to the same result.

Note that PatternSplitCM would also work for the OP's case.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

sqldriver SSChampion Points: 10440 More actions · Answer 11

dwain.c (7/2/2013)
erikd (7/2/2013)
I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.
select dbo.ExtractInteger(patid) as [patid]
from table
USE [Sample]
GO
/****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE @Count INT
DECLARE @IntNumbers VARCHAR(1000)
SET @Count = 0
SET @IntNumbers = ''
WHILE @Count <= LEN(@String)
BEGIN
IF SUBSTRING(@String,@Count,1) >= '0'
AND SUBSTRING(@String,@Count,1) <= '9'
BEGIN
SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)
END
SET @Count = @Count + 1
END
RETURN @IntNumbers
END
SVFs are notoriouly poor for performance, even when measured correctly.
Initially I'd do it like this. Until that is I had the time to go back and determine the fastest method.
WITH Phones (phone) AS (
SELECT '66814382803' UNION ALL SELECT '123A123456789'
UNION ALL SELECT '0814382803'
)
SELECT Item
FROM Phones
CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')
WHERE [Matched] = 1 AND LEN(Item) BETWEEN 7 AND 11
I am curious what your function would return for that second phone number in my list.
PatternSplitCM is a iTVF that can be found in the 4th article in my signature links. Since it is a general purpose function, much faster ways could probably be devised that would get you to the same result.
Note that PatternSplitCM would also work for the OP's case.

Hi Dwain,

I work with a lot of tables, but none of them are very large. The largest one I currently deal with is for a monthly tracker that is recently up to around 300k rows, so I don't notice a lot (if any) performance increases that I see people talk about. So, for my purposes, the SVF works well.

Here's what it returns from the second number you posted:

select dbo.extractinteger('123A123456789') as [For Dwain]

For Dwain
123123456789

I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.

Dwain Camps SSC Guru Points: 86908 More actions · Answer 12

erikd (7/2/2013)
I work with a lot of tables, but none of them are very large.

Please don't get me wrong when I say this. That is today. Once your application is successful beyond your wildest dreams and your tables start having many millions of rows, you'll probably be forced to rethink that statement. Performance should rarely be measured as "good enough," mainly because by the time it becomes "not good enough" it will be a serious challenge to make it good enough once again.

erikd (7/2/2013)
Here's what it returns from the second number you posted:
select dbo.extractinteger('123A123456789') as [For Dwain]
For Dwain
123123456789
I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.

I knew the answer but what I was wondering is does that seem like the right thing to do? For example suppose the string was 123456789x1231 where x denotes an extension.

I hope you find value in the article. Like I said, I consider PatternSplitCM to be sort of a stopgap even though for the task it does it performs reasonably well. Most of the time, you'll find you can construct a more direct approach to a particular problem that will run faster.

PatternSplitCM will get you to working code faster but with a little more effort you can get your code working faster.

Did I just say that? 😉

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

sqldriver SSChampion Points: 10440 More actions · Answer 13

dwain.c (7/2/2013)
erikd (7/2/2013)
I work with a lot of tables, but none of them are very large.
Please don't get me wrong when I say this. That is today. Once your application is successful beyond your wildest dreams and your tables start having many millions of rows, you'll probably be forced to rethink that statement. Performance should rarely be measured as "good enough," mainly because by the time it becomes "not good enough" it will be a serious challenge to make it good enough once again.
erikd (7/2/2013)
Here's what it returns from the second number you posted:
select dbo.extractinteger('123A123456789') as [For Dwain]
For Dwain
123123456789
I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.
I knew the answer but what I was wondering is does that seem like the right thing to do? For example suppose the string was 123456789x1231 where x denotes an extension.
I hope you find value in the article. Like I said, I consider PatternSplitCM to be sort of a stopgap even though for the task it does it performs reasonably well. Most of the time, you'll find you can construct a more direct approach to a particular problem that will run faster.
PatternSplitCM will get you to working code faster but with a little more effort you can get your code working faster.
Did I just say that? 😉

No, you're right. Good enough isn't good enough. What I was getting at is that I don't have a table large enough to test good enough against better or best. I don't even have a particularly good testing environment (a SQL 2005 instance on a rather ancient and rickety server that half a dozen people RDP into at a time). But, a guy can dream, when he's done dreaming about Johnnie Walker Blue and rare steaks.

Phone numbers with extensions are screened out, because they can't be auto dialed, so it's GENERALLY not an issue, though it's caused a few headaches when clients send over their contact lists with everything plopped into one column.

I am going to try to thumb wrestle your pattern splitter to do what mine does and just return the cleaned result.

Dwain Camps SSC Guru Points: 86908 More actions · Answer 14

Erik - Just because your Prod system doesn't have a million rows to test against, doesn't mean you can't test against a million rows!

The test harness in the article demonstrates the generic approach for this (which I do not take credit for by the way).

This would directly compare PatternSplitCM against your SVF.

CREATE TABLE #Phones (phone VARCHAR(12));

WITH Tally (n) AS (

SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT 1))

FROM sys.all_columns a CROSS JOIN sys.all_columns b)

INSERT INTO #Phones

SELECT 1000000+ABS(CHECKSUM(NEWID()))%1000000

FROM Tally;

DECLARE @Phone VARCHAR(12), @StartDT DATETIME;

PRINT 'PatternSplitCM';

SELECT @StartDT = GETDATE();

SET STATISTICS TIME ON;

SELECT @Phone=Item

FROM #Phones

CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')

WHERE [Matched]=1;

SET STATISTICS TIME OFF;

SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

PRINT 'extractinteger';

SELECT @StartDT = GETDATE();

SET STATISTICS TIME ON;

SELECT @Phone=dbo.extractinteger(phone)

FROM #Phones;

SET STATISTICS TIME OFF;

SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

GO

DROP TABLE #Phones;

Edit: I modified it to SELECT the ElapsedMS of each run, which is needed due to a bias that is introduced by applying STATISTICS TIME to an SVF as demonstrated here: http://www.sqlservercentral.com/articles/T-SQL/91724/

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St