Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

REPLACE Multiple Spaces with One

By Jeff Moden, (first published: 2009/11/16)

Prologue

This article was originally published on November 16th, 2009. As with anything else, improvements can be made to code and the subject of this article is no exception. Although this article is still an interesting read (I left everything after the "Introduction" heading as it originally was), I no longer recommend the method in this article because it has been soundly beaten (more than 6 times faster) for performance by another T-SQL method as demonstrated by Michael Meierruth. If you prefer to "cut to the chase", that method can be found in a post in the discussion for this article at the following URL:

http://www.sqlservercentral.com/Forums/FindPost821209.aspx

The discussion that followed this article is also fascinating and a large number of people took part in some rather wonderful testing. Some folks even took the time to create and post some CLRs to solve the same problem. All in all, the discussion makes for an incredible learning experience which is typical of the amazing community of professional people we've all grown to know and love here at SSC. I learn something new here everyday.

One of the things that came out of the discussion is the fact that the default collation can make a huge impact on performance. The the following link to the post in the discussion where Paul White demonstrates that not-so-little nuance:

http://www.sqlservercentral.com/Forums/FindPost821565.aspx

Be you Neophyte or "Ninja", thanks for being a part of this community, folks.

--Jeff Moden

Introduction

Replacing multiple spaces with a single space is an old problem. If you Google the problem, you find that most folks still resort to While Loops in functions or maybe even a Tally table or (ugh!) XML in a function to solve this seemingly complex problem. The truth is that you don't need the RBAR of a User Defined Function at all.

This article explains the "puzzle solving" behind a common set based method to replace unknown numbers of adjacent spaces with a single space by establishing and replacing simple patterns in a set based fashion.

The Problem

You have a column of data that looks something like the following...

--===== Create and populate a test table.
     -- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.   ' UNION ALL
 SELECT 'So                     does                      this!' UNION ALL
 SELECT 'As                                does                        this' UNION ALL
 SELECT 'This, that, and the other  thing.' UNION ALL
 SELECT 'This needs no repair.'

The goal is to convert the sections of multiple spaces of unknown length to a single space each as quickly as possible. There are some additional requirements. You can't make the data any bigger during the process because it might get too big for VARCHAR(8000) (or whatever size the column is) or you can't use VARCHAR(MAX) because you're using SQL Server 2000.

The Method Explained

I'll use the letter "O" to act as a "visible" space so you can see what I'm talking about... consider the following sets of spaces, please....

O
OO
OOO
OOOO
OOOOO
OOOOOO
OOOOOOO
OOOOOOOO

Remember the goal is to convert all of those sets of spaces to just a single space without the use of RBAR even if the RBAR is through the simple use of a UDF. I'll also state that the goal is to remove the extra spaces without making the original string any larger in the process because it may already be as large as it can be for the given datatype.

So... STEP 1 is to identify pairs of spaces. This is done by modifying the second space in each pair of spaces to be an "unlikely" character. In this case, I'll use the visible character of "X" (which isn't unlikely but serves this visual example) to represent a space that has been changed in a pair of spaces. When we replace all pairs of space "OO" with "OX", we get the following

O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX

STEP 2 is to replace all occurrences of "XO" with NOTHING...

O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX

... and that leaves us with ...

O
OX
O
OX
O
OX
O
OX

STEP 3 is to replace "X" with NOTHING...

O
OX
O
OX
O
OX
O
OX

... and that leaves us with just singles spaces everywhere...

O
O
O
O
O
O
O
O

Again... the "O"s represent unmodified spaces and the "X"s represent spaces changed to some "unlikely character" like a special non printable, almost non type-able character like ASCII 7 (the "Bell" character). It's just a matter of 3 nested REPLACE functions to handle ANY number of spaces to accomplish the puzzle we solved above. It can all be done in a single set-based query without loops or even UDF's.

Be careful which "unlikely character" you pick, though. We'll talk more about that when we get to the "Unlikely Characters and Collation" section of this article further below.

The Code

Ok... now that you know how it works, here's the code that accomplishes the 3 steps as 3 nested REPLACE's. I've included the test table I previously covered in "The Problem" section of this article just to make things easy to run. Notice that I've also added an LTRIM/RTRIM to take out any leading or trailing spaces, as well...

--===== Create and populate a test table.
     -- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.   ' UNION ALL
 SELECT 'So                     does                      this!' UNION ALL
 SELECT 'As                                does                        this' UNION ALL
 SELECT 'This, that, and the other  thing.' UNION ALL
 SELECT 'This needs no repair.'

--===== Reduce each group of multiple spaces to a single space
     -- for a whole table without functions, loops, or other
     -- forms of slow RBAR.  In the following example, CHAR(7)
     -- is the "unlikely" character that "X" was used for in 
     -- the explanation.
 SELECT REPLACE(
            REPLACE(
                REPLACE(
                    LTRIM(RTRIM(OriginalString))
                ,'  ',' '+CHAR(7))  --Changes 2 spaces to the OX model
            ,CHAR(7)+' ','')        --Changes the XO model to nothing
        ,CHAR(7),'') AS CleanString --Changes the remaining X's to nothing
   FROM @Demo
  WHERE CHARINDEX('  ',OriginalString) > 0
Results from Code Above
CleanString
This has multiple unknown spaces in it.
So does this!
As does this
This, that, and the other thing.

Unlikely Characters and Collation

Just a quick note on "unlikely characters". You do have to be really careful about what you select as an "unlikely character" for the "X" of the "OX" model we previously discussed. For example, if you have the not-so-uncommon collation of Latin1_General_CI_AI on a column and you've chosen the "unlikely character" of Thorn (þ), you could end up deleting a whole lot more than you bargained for. According to Wikipedia, "þ" (the Thorn character) still survives as 30th character of the Icelandic alphabet and appears in other alphabets, as well. Further, in many languages, it has been replaced by the "th" digraph and, in certain collations like the Latin1_General_CI_AI collation, the "þ" character and "th" are treated as equals. For example...

--===== Create and populate a test table.
-- *** NOTICE THE COLLATION SETTING ON THE STRING COLUMN. ***
DECLARE @Demo TABLE(OriginalString VARCHAR(8000) COLLATE Latin1_General_CI_AI)
 INSERT INTO @Demo (OriginalString)
 SELECT '  This      has multiple   unknown                 spaces in        it.  ' UNION ALL
 SELECT ' So                     does                      this!   ' UNION ALL
 SELECT '   As                                does                                this' UNION ALL
 SELECT 'This, that, and the other thing.'

--===== This uses a "thorn" character as the "X" of the "OX" model 
 SELECT LTRIM(RTRIM(
            REPLACE(REPLACE(REPLACE(OriginalString,'  ',' þ'),'þ ',''),'þ','')
        ))
  FROM @Demo

Because of the equivalent treatment of the Thorn (þ) character and the "th" digraph, you end up with a most undesirable result...

Results from Code Above
CleanString
is has multiple unknown spaces in it.
So does is!
As does is
is, at, and e oer ing.

... which is nothing like what we wanted.

There's also a danger in selecting the wrong "control character" (ASCII characters 0 through 31) as the "unlikely character". I chose CHAR(7) which is a very benign character in today's world of electronics. It was designed to literally ring the bell on old Tele-Type machines and is just passed through by today's electronics.

If you chose CHAR(0), you've just chosen the "NULL" character and anything that follows it's appearance will simply disappear. Using the current test table in this article, you'd get an output that looks like the following...

Results from Code Above
CleanString
This
So
As
This,that,andtheotherthing.

If you chose CHAR(1), you've just chosen the ASCII "Start of Header" character which is still in use today. When certain electronics see this character, it can cause some very strange behavior usually resulting in failure of your code.

Another good "unlikely character" is CHAR(8) which is the ASCII "Backspace" character. It's normally never included in any type of assembled text now adays. CHAR(127) also works well because it's a left over for "Delete" from the paper tape world. It actually punches all the holes in a paper tape (7 of them) to quite literally delete a character.

For more information on the ASCII "Control Characters", please see the following URL:

http://www.lammertbies.nl/comm/info/ascii-characters.html#cont

Conclusion

To borrow a phrase from R. Barry Young's series of articles on the subject, "There must be 15 ways to lose your cursors". Even the seemingly complex task of condensing multiple adjacent spaces to a single space can be done without a loop.

When you run into a problem where you're absolutely sure there's no way to do it without a loop, have another look. There's usually a high performance way to avoid the loop.

Thanks for listening, folks.

--Jeff Moden


"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

Total article views: 43257 | Views in the last 30 days: 46
 
Related Articles
FORUM

Detecting Specific Characters in a Varchar Field

spaces or invalid characters

FORUM

derived column --replace " with space expression error

derived column --replace " with space expression error

FORUM

In select query values with space appended takes higher precedence than normal characters

In select query values with space appended takes higher precedence than normal characters . How to o...

FORUM

replace characters on condition in table and cell

replace characters on condition in table and cells

SCRIPT

Function to Expand Tab Characters

A function to replace tab characters with the correct number of spaces to align the text as original...

Tags
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones