Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

how to strip out all non alpha numeric characters? Expand / Collapse
Author
Message
Posted Tuesday, March 23, 2010 8:59 AM


SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Thursday, September 22, 2011 1:51 PM
Points: 276, Visits: 400
--drop table #addr
create table #addr
( id int identity (1,1)
, address1 varchar(200)
)

insert into #addr
select '532 40th ave. n.e. apt)104'
union all
select '1000 sector way st. apt)111-ac))'
union all
select '3333 summer-point place'
union all
select 'p.o.box 22000a-330042'
union all
select '5566 first stett n. lot.204-a apt)2-c'


--should return only alpha numeric chars.
'53240thaveneapt104'
'1000sectorwaystapt111ac'
'3333summerpointplace'
'pobox22000a330042'
'5566firststettnlot204aapt2c'
Post #888194
Posted Tuesday, March 23, 2010 10:03 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Today @ 5:47 PM
Points: 8,150, Visits: 20,892
i think this will do what you are after; this leaves 0-9,A-Z,a-z and strips out everything else...commas,punctuation, hi ascii, etc.


select dbo.StripNonAlphaNumeric(address1) from #addr
--results
53240thaveneapt104
1000sectorwaystapt111ac
3333summerpointplace
pobox22000a330042
5566firststettnlot204aapt2c

select ascii('a'),ascii('z'),ascii('A'),ascii('Z')
GO
CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000)
;WITH tally (N) as
(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii upper case letters A-Z is 65 thru 90
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii lower case letters a-z is 97 thru 122
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END

FROM tally WHERE Tally.N <= LEN(@OriginalText)

RETURN @CleanedText
END




Lowell

--There is no spoon, and there's no default ORDER BY in sql server either.

SQL Forums are unique among other technical forums for getting exact solutions.. If you provided a CREATE TABLE... and INSERT INTO... scripts, volunteers here can give you a tested, working solution to your questions. Help us help you and provide that and a description of what you are trying to accomplish!
Post #888285
Posted Tuesday, March 23, 2010 11:02 AM


SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Thursday, September 22, 2011 1:51 PM
Points: 276, Visits: 400
That's exactly what i was looking for. Thanks Lowel. I knew there was a way via ascii but hadn't done it before. This ones going in the tool box.

Again thank you!
Post #888349
Posted Tuesday, March 23, 2010 12:52 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 3:59 PM
Points: 6,522, Visits: 11,526
Maybe you could make the CASE statement of Lowells nice function a little shorter:
CASE 
WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END

It seems like it'll return the same value.




Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question
How to post performance related questions
Links for Tally Table , Cross Tabs and Dynamic Cross Tabs , Delimited Split Function
Post #888443
Posted Tuesday, March 23, 2010 1:37 PM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Yesterday @ 9:26 AM
Points: 1,295, Visits: 4,815
lmu92 (3/23/2010)
Maybe you could make the CASE statement of Lowells nice function a little shorter:
CASE 
WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END

It seems like it'll return the same value.


It seems, but it really depends on the collation being used. Check out the following script:

DECLARE @OriginalText VARCHAR(MAX) = '0;:ëêà^AaZz'
SELECT
CASE
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END,
CASE
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AS,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END,
CASE
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AI,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
FROM
(
SELECT Number N FROM master..spt_values WHERE Type = 'P'
) Tally
where
Tally.N <= LEN(@OriginalText)

If you really want to remove any character but 0-9, a-z, A-Z compare the Ascii-codes. Also note that in a binary collation ':' and ';' fall between [0-Z].

Peter
Post #888481
Posted Tuesday, March 23, 2010 1:46 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 3:59 PM
Points: 6,522, Visits: 11,526
Thanx for clarification, Peter!
I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement...




Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question
How to post performance related questions
Links for Tally Table , Cross Tabs and Dynamic Cross Tabs , Delimited Split Function
Post #888488
Posted Tuesday, March 23, 2010 2:18 PM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Yesterday @ 9:26 AM
Points: 1,295, Visits: 4,815
lmu92 (3/23/2010)
Thanx for clarification, Peter!
I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement...

Well, you were close. I think this one will do the job too
  CASE 
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-9a-zA-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END,

BTW, where did you get [0-Z] from. I was surprised to see that ':' or ';' do not fall within that range when using a collation like Latin1_General_CI_Ax. I scanned BOL, but I can not find anything about this pattern.

Peter
Post #888516
Posted Tuesday, March 23, 2010 2:59 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Monday, January 23, 2012 11:01 AM
Points: 3,539, Visits: 5,118
You may also want to file away this version, which uses an inline table valued function. It performs significantly faster if you have to clean up a large number of rows in a table with one pass. I modified the function above to get rid of anything except 0-9 and tested it against an existing function that does the same thing.


CREATE FUNCTION dbo.itvf_NumbersOnly
(
@Input Varchar(max)
)
RETURNS TABLE
AS
RETURN
(
WITH data as
(select N,substring(@input,N,1) as element
from dbo.tally
where substring(@input,N,1) between '0' and '9' -- alter to include only characters needed
and N <= len(@input)
)

Select(Select ''+element
FROM data
ORDER BY N
FOR XML PATH('')
) as [NumberString]
)
GO


Also, I should note that CLR is said to be significantly faster at string manipulation, but we don't do CLR here yet, so I can't test. There was a huge thread on the subject here. There may be a variation in performance depending on the size of the strings to be cleaned and the number of characters to be replaced.

---------------
-- code to test
---------------
create table #test (rowID int identity primary key, data varchar(max))
insert into #test
select top 100000 newID()
from dbo.tally

select top 10 * from #test


set statistics io on;
set statistics time on;

select rowid,Data,NumberString
into #test2
from #test
cross apply dbo.itvf_NumbersOnly(data)

set statistics time off;
set statistics io off;


print '----------------------------------------------------------------'
set statistics io on;
set statistics time on;

select rowid,data,dbo.StripNonNumeric(data) as NumberString
into #test3
from #test

set statistics time off;
set statistics io off;

select top 10 * from #test2 order by rowid
select top 10 * from #test3 order by rowid

drop table #test
drop table #test2
drop table #test3




__________________________________________________

That guy's about as useless as a nvarchar(1). -- Original
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Whenever possible, I prefer to eat the rude. -- Hannibal Lecter
Post #888543
Posted Tuesday, March 23, 2010 3:22 PM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Yesterday @ 9:26 AM
Points: 1,295, Visits: 4,815
...substring(@input,N,1) between '0' and '9' ...


Still the same mistake. This will match characters like '³' and '²'.
Post #888549
Posted Tuesday, March 23, 2010 4:53 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Monday, January 23, 2012 11:01 AM
Points: 3,539, Visits: 5,118
I see what you mean, but my point wasn't the selection/omission criteria in the where clause. In fact, I learned something from the earlier posts and made a mental note to be aware of potential differences in collation.

The point is the slow performance of a user-defined function compared to an inline table-valued function. At 100,000 rows I was seeing a difference of seconds versus minutes. Take that for what its worth.


__________________________________________________

That guy's about as useless as a nvarchar(1). -- Original
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Whenever possible, I prefer to eat the rude. -- Hannibal Lecter
Post #888582
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse