Tame Those Strings! Part 3 - Using REPLACE

,

This is a continuation of string manipulation techniques. If you are interested, you can read the other articles,

though you do not need to read it before this one. These are mostly beginning

programming articles, but advanced T-SQL programmers may still find something useful here.

  • Part 1 deals with SUBSTRING and how it can be used to extract some information

    from a field of data

  • Part 2 deals with CHARINDEX and how it can be used to extract some information

    from a field of data when the data is delimited rather than stored in a particular format.

Introduction

Continuing on with taming strings.

In Part 1, I worked with phone numbers to find the area codes in phone

number data. This data was in a variety of formats and frankly,

quite a mess. Since we allowed users to enter data with very little

validation, we got all sorts of invalid data as well as typos.

Well, the time finally came for us to clean and validate as much phone

data as possible, and so I decided to write about my little adventure.

The Problem

Dealing with phone number data is more than a trivial exercise,

both from the technical as well a business viewpoints. If you

limit yourself to US phone numbers, then the problem is simplified,

but still requires some work. In today's Internet, 24x7 uptime requirements,

and international applications, I have seen this become less and less of

an option, so I plan on designing the db to hold all kinds of

information. Another article will tackle the database design side of storing

phone data; this article looks at getting all the data in a standard format.

We had initially designed a single phone number field on forms as a way

to handle all types of data in an ergonomic fashion. There was no

validation of the data being entered so as to make the data entry as

smooth as possible. The result of this was phone number data that looked

like this:

1-(520)-555-5821
(904)555-1877
2711 5555458
610-555-3723
000
6195557693
11111111111111111
(111) 555-0986
181-5554761
31 471 55505
31 22555599
1-800-555-9495
415 555 3851
+11455561022
1603555775
++492055599971
407.555.4770
555-6609
202/555-2228
248-555-8300*8263
812-555-8444x312
9999999
1800555SOFT ext. 132
512/555-2000x2563
(800) 555-2588  XT 7
none
800-555-0963 X90454

As you can see, there are a variety of formats in which data was

entered (BTW, these are not real phone numbers, I scrambled quite

a few of the numbers and made the exchanges 555. If any of them actually

work, I apologize in advance). Users have entered parenthesis, dashes, periods, +, etc.

as delimiters for various sections of the phone numbers. The spacing,

abbreviations, and formats vary from company to company (much to the

chagrin of developers) and cry out for some standardization.

However, forcing standardization during the data entry annoys and angers (

and loses) users.

My company

finally came up with a standard data entry solution with some validation

of data, but I needed to then standardize the existing data.

The technical solution to storing the data was to standardize the

formatting and then include parsing routines in the application

logic to handle the display and data entry. We decided a single

field will contain the user entered data as numerics only with all

other characters stripped out with one exception. For users that

have a phone extension, we would include an "x" between the phone

number and the extension. By standardizing on this format, all developers

can expect a certain format and apply simpler parsing rules to

format the data for display.

What String Manipulation?

Here is a problem that needs to be solved, what does this have to do

with string manipulation? After all, shouldn't the validation be occuring

in the presentation or business layers? Couldn't I use a rule or user defined

datatype to standardize the system?

Patience, patience, all will be clear soon.

In order to clean this data, I will be using string manipulation to

reformat all the existing phone data in the standard format. Regardless of future validation

and schema changes, I would still have to get the existing data into a standard

format to ensure the application works.

Now it would be nice to write one SQL statement that would rip

through the data and reformat all the data as expected. A couple

problems with this; one, it would be difficult to explain in an

article. Two, it would probably take more time to decipher than

most of you would want to spend. Three, I can't come up with one.

Since reason three will tend to override reasons one and two in anything

I write, here is what I came up with.

Since I try to avoid row processing which is slow and the kind of thing I

usually chastise junior DBAs for implementing, I decided to

look for patterns which would enable me to develop an algorithm to

clean up this data. I made notes as I conducted the analysis so that I could

present readers with some insight into how I go about solving a problem or

developing a solution.

Analysis

The first thing I do when solving most problems is look for a pattern.

After all, computers are very good at performing repetitive actions

and SQL is especially well suited to working with a batch of data.

In my examination of the data, I found that most data was basically in one

of a few formats:

	999 999-9999
	(999) 999-9999
	999-999-9999

and the extensions tended to look like:

	x999
	x.999
	ext 999
	ext.999

The rest of the data is pretty close to these formats with some

variations (more or less spaces, +xx for international numbers, some

people entered slashes ("/") or periods (".") or some other

character, but since I may end up handling these individually or

as "one-offs" (things that do not fit a pattern or are an exception),

I will ignore these for now.

The format that I am trying to standardize all data into is the following:

9999999999x9999

where all non-numeric characters and spaces are removed. Actually

spaces are not important for the parsing routines we employ, so

I will ignore embedded spaces, but I think you will be able

to adapt my solution easily to remove them.

Back to the analysis...

There are multiple places in the field where each non-numeric

character can occur. I could write a series of substring

updates (like in Part 1), but this

would be quite a bit of work, lots and lots of statements, and probably not of

interest to any of you. As I examine the data, what I really want

to do is remove all the dashes ( or other characters) from the field.

So how can we remove the parenthesis, dashes, and periods from the

phone numbers? One way that I think should work is a simple CONVERT or

CAST of the field to an integer value. If I run the following,

	select cast( '(555) 555-5555' as int)

I would expect to get:

	5555555555

but instead, I get:

Server: Msg 245, Level 16, State 1, Line 1

Syntax error converting the varchar value '(555) 555-55555' to a column

of data type int.

Why this occurs, I am not sure (if you are on the SQL Server team,

let me know), but it does not. SUBSTRING is not a great option, so I decided

to search BOL a little and see what my options were.

The Solution

The first thing I saw in BOL which looked promising was PATINDEX

which lets you find the location of a particular string within another

string. This along with CHARINDEX (see Part 2)

is useful for finding the location

of a pattern within a string. I suppose I could have used this and then written a

bunch of substrings that would concatenate the before and after portions of the string,

but this did not seem very efficient either.

It was around this time that one of our senior developers was doing some data cleaning

and sent me a script to execute on the production database. This script contained a series

of SQL statements in a batch that ran REPLACE to change the wording in a number of rows in

a particular table. The idea then came to me that I could use this function to quickly remove all

the non-numeric characters in my data!

The Code

The REPLACE function is very powerful and is used to do exactly what you would expect it to do.

It searches a string for some other string that you supply (much like CHARINDEX and PATINDEX), but

replaces the second string with a third string that you supply. So if you ran the following code:

declare @a char( 30)
select @a = 'Bob and Jim played catch.'
select @a, replace( @a, 'Jim', 'Jeff')

You get

---------------------------  ---------------------------
Bob and Jim played catch.    Bob and Jeff played catch.

As you can see, the first string, @a, is searched for any occurence of

the second string, "Jim". If an occurence is found, then it is replaced with

the third string, "Jeff".

To apply this to my phone number problem, I decided to write a series of statements

that would replace the characters I wanted removed. The three obvious characters I

wanted to remove were the dashes, paranthesis, and periods. Actually these are four characters

and I ended up with the following script:

update UserInfo
  set phone = replace( phone, '-', '')
update UserInfo
  set phone = replace( phone, '(', '')
update UserInfo
  set phone = replace( phone, ')', '')
update UserInfo
  set phone = replace( phone, '.', '')

One thing that many people do not know about many functions (at least I am surprised

how many people do not know) is that you can often use an empty string wherever a character

parameter is called for. In this case, substituting an empty string as the replacement character

allows me to remove offending characters.

Conclusions

There are more possible solutions for solving this problem, but some are better

left to another article. I hope this has helped some of you understand how to manipulate

and clean data a little better as well as sparked some ideas for even better

solutions. My solution worked well and was set based so it was somewhat efficient as well.

For those of you interested in further information about REPLACE, here is the URL

for the

online reference for REPLACE

As always, I welcome feedback and please rate this article below (and any you read on Swynk).

It helps to motivate and assist us authors in writing better columns.

Steve Jones

December 2000


Return to Steve Jones Home

 

Rate

4.5 (2)

Share

Share

Rate

4.5 (2)