Removing multiple patterns from a string

Removing multiple patterns from a string

Recently one of my clients had a request that was a challenge to do effectively. I don’t think it’s a very common requirement but I still decided to blog about it. Who knows, maybe one of you will actually benefit.

Pattern removal

Within a existing query, one of the columns in the result set holds string data that includes sequences and patterns that need to be removed in the output. Most of those patterns can occur more than once, and then all of them need to be removed. Some of these patterns overlap with each other and then the longest one needs to be removed. One of the patterns can only occur at the end of the string, the rest can occur anywhere.

The actual use case and the list of patterns that I had to remove are considered a confidential competitive advantage by my client, so I will just make up a list here to illustrate the problem. In this fake requirement, the following patterns must all be removed if anywhere in the string: “-@@%”, “@@%”, “-@%”, “@%”, “No.X#X”, and “^^^”. Also, “@X” needs to be removed when at the end of the string. In these patterns, @ is a placeholder for a numeric symbol, X for an alphabetic symbol, and # for an alphabetic or numeric symbol. All other symbols are to be taken literally. (So “%” is the percent mark, not the LIKE pattern for any character!).

Challenges

When looking at the requirements, there were a few solutions that I briefly considered and then rejected.

  • I had previously done substring removal with REPLACE, which has the benefit that it will automatically replace all occurrences it finds – but it is limited to finding fixed strings; it does not support pattern recognition.
  • I though about doing a query with a PATINDEX expression and a join to a numbers table to find all matching patterns. This worked fine, until I started testing with a string that included two copies of the same pattern. With PATINDEX I was only able to find the first matching pattern, because it does not support a parameter to specify the start position for the search.
  • Conversely, CHARINDEX does support a starting position – but is once more limited to finding fixed stings, and does not support pattern matching.
  • A fairly straightforward and obvious choice would have been to create a user-defined function (either a T-SQL user-defined function or even a CLR user-defined function). The code for such a function would be quite simple. However, the pattern removal had to be done on the result of a query, and adding a user-defined function in a query wrecks performance in more ways than you can count. So this was not an option for me.

I realized that I had to think “out of the box” for this one. The final solution I crafted effectively breaks each input string into its individual characters, then uses a WHERE clause to filter out each character that is part of at least one of the given patterns, and then concatenates together all the characters that remain.

Here is the query I used (along with a simple test frame). In the paragraphs below I will walk through and explain the key elements.

DECLARE @test table
    (PartNo varchar(50) NOT NULL PRIMARY KEY);
INSERT  @test
VALUES ('ABC-12%D^^^E3T'),
       ('AB345%6%-44%^^^^No.XXXNo.W8RNo.8R8D');


WITH Patterns
  AS (SELECT    Pattern,
                Length
      FROM      (VALUES ('%-[0-9][0-9][%]%', 4),        -- -@@%
                        ('%[0-9][0-9][%]%', 3),         -- @@%
                        ('%-[0-9][%]%', 3),             -- -@%
                        ('%[0-9][%]%', 2),              -- @%
                        ('%No.[A-Z][A-Z0-9][A-Z]%', 6), -- No.X#X
                        ('%^^^%', 3),                   -- ^^^
                        ('%[0-9][A-Z]', 2)              -- @X (end of string)
      ) AS Patterns (Pattern, Length) )
SELECT  t.PartNo,
        (SELECT SUBSTRING(t.PartNo, n.N, 1)
         FROM   dbo.Numbers AS n
         WHERE  n.N BETWEEN 1
                    AND     LEN(t.PartNo)
         AND    NOT EXISTS (SELECT  *
                            FROM    Patterns AS p
                            WHERE   PATINDEX(
                                        p.Pattern,
                                        STUFF(t.PartNo,
                                              1,
                                              CASE WHEN n.N > p.Length
                                                   THEN n.N - p.Length
                                                   ELSE 0
                                              END,
                                              '')) BETWEEN 1
                                                       AND CASE WHEN n.N > p.Length
                                                                THEN p.Length
                                                                ELSE n.N
                                                           END)
        ORDER BY n.N
        FOR XML PATH(''))
FROM    @test AS t;

Specifying the patterns

WITH Patterns
  AS (SELECT    Pattern,
                Length
      FROM      (VALUES ('%-[0-9][0-9][%]%', 4),        -- -@@%
                        ('%[0-9][0-9][%]%', 3),         -- @@%
                        ('%-[0-9][%]%', 3),             -- -@%
                        ('%[0-9][%]%', 2),              -- @%
                        ('%No.[A-Z][A-Z0-9][A-Z]%', 6), -- No.X#X
                        ('%^^^%', 3),                   -- ^^^
                        ('%[0-9][A-Z]', 2)              -- @X (end of string)
      ) AS Patterns (Pattern, Length) )

I assume that the CTE where the patterns are defined needs no explanation. Note that for matching only the character % I use [%] instead of relying on an escape character. Both options are valid.

For one of my earlier attempts I had to ensure that for potentially overlapping patterns (e.g. @% and @@%, or @% and -@%) the longest pattern comes first. However, my final solution no longer relied on this order so patterns can be specified in any order.

You will notice that I specify a pattern length as well. This is required in the main query. It is probably possible to derive this by parsing the pattern, but that would just add complexity for no gain. It was easy enough for me to count the length of each pattern.

Splitting the string

SELECT SUBSTRING(t.PartNo, n.N, 1)
FROM   dbo.Numbers AS n
WHERE  n.N BETWEEN 1
           AND     LEN(t.PartNo)

The main query starts by joining the input data to a table of numbers. Each row in the result set represents a single character from the original string. You can easily verify this by replacing t.PartNo in the above fragment with a string constant.

Finding the patterns

NOT EXISTS (SELECT  *
            FROM    Patterns AS p
            WHERE   PATINDEX(
                        p.Pattern,
                        STUFF(t.PartNo,
                              1,
                              CASE WHEN n.N > p.Length
                                   THEN n.N - p.Length
                                   ELSE 0
                              END,
                              ''))

The fragment above is where I test whether the current character of the string is included in any of the specified patterns. I use PATINDEX for that test, which returns either 0 if no match is found, or the starting position otherwise. However, as mentioned before, PATINDEX does not accept a starting position so there is no simple way to find a second occurrence of the same pattern.

To work around that, I use STUFF to remove all characters that precede the first character that we care about for this specific pattern. So for example, to test for pattern “-@@%” at position 9 in the string ‘AB345%6%-44%^^^^No.XXXNo.W8RNo.8R8D’, I remove the first 5 positions so that I search the pattern in the remainder of the string: ‘%6%-44%^^^^No.XXXNo.W8RNo.8R8D’. This will then match the location of “-44%” which was the second occurrence of this pattern in the original string, but is the first occurrence of it after stripping the first 5 characters.

How much to strip?

The STUFF function used above takes four arguments: input string, starting and ending position of substring to replace, and replacement substring. The latter is empty because I just want to remove part of the string. The starting position is 1 because I want to remove a substring starting at the first character. But what about the ending position? How much to remove?

When I am evaluating position 9 in the input string and I want to know if that position is part of the pattern “-@@%”, then I need to take into account that the pattern has 4 characters. So character 9 can be the first, second, third, or fourth character, if the pattern starts at position 9, 8, 7, and 6 respectively. The left-most starting position is character 6. This means I can remove characters 1 through 5. Or rather, 1 through (current position minus pattern length).

However, at the start of the string that expression results in negative numbers, and STUFF throws an error if the end position is negative. So I use a CASE expression to ensure that all these negative numbers are replaced by zero. STUFF does not mind starting position 1 and ending position 0, that simply does nothing – which is exactly what I want when testing patterns at the start of the string. When looking for pattern “-@@%” at position 3, I simply want to start searching at the left-most position.

Recognize a match

BETWEEN 1
    AND CASE WHEN n.N > p.Length
             THEN p.Length
             ELSE n.N
        END)

If PATINDEX finds no match at all, then it returns zero. If it does find a match, then it returns the starting position of that match (which is the first in case of multiple matches). So how do we use this to determine whether the position we are now looking at is a match?

Let’s return to the previous example where I look for pattern “-@@%” in position 9 of the input string. We have already removed the first 5 characters, so the original position 9 is now position 4 in the remainder of the string.

This means that is a match is found that starts at positions 1, 2, 3, or 4, then this character is part of it. If a match is found at position 5 or later, then there is a pattern match somewhere to the right of the original position 9, but position 9 itself is not part of it. And if no match is found, zero is returned.

Bottom line: the current position is a match if, after removing the irrelevant left-hand part of the string, a match is found starting at a position between 1 and the length of the pattern.

Exception

The above method results in false positives at the start of the string. For a pattern of length 4, no left-hand part is stripped when looking at positions 1 through 4; for each of these the PATINDEX is applied to the original string. Now let’s say that a matching pattern starts in position 3. That means PATINDEX returns 3 for each of these 4 positions, but only positions 3 and 4 are actually part of the pattern.

This is the reason why we need the CASE expression above. For positions 5 and beyond, it returns 4 (the pattern length). But for positions 1 to 4, it returns the position itself. Which is correct in this case. Position 2 should be excluded if the pattern is found to start at position 1 or 2, but not at 3 or 4.

Putting it back together

ORDER BY n.N
FOR XML PATH(''))

The final step in the logic is to concatenate all the characters that were not removed due to being included in a pattern together, in their original order. That is what the above fragment does. I am not going to say a whole lot about it – this method of string concatenation is fairly well-known.

Do note that, because this “abuses” XML, you can get some weird results if special XML characters are used. For example, spaces, ampersand characters, or less-than signs, etc. If these are in your data, then you can ensure that they are not changed by using the TYPE attribute on the FOR XML clause, which returns the data as XML; you will then also need to CAST the final result of this subquery back to an appropriate string data type.

CAST((subquery)
     ORDER BY n.N
     FOR XML PATH(''), TYPE) AS varchar(50))

(I use varchar(50) above because that is the data type of the input, and removing patterns can never increase the length).

Conclusion

There’s a lot of complex stuff going on in this query, and if I would encounter this query in code I inherit without any comments or explanation I would not be happy. But it does get the task done, in a set-based way and with very acceptable performance.

As uncommon as the requirement was to me, there is always a chance that someone else one day has to do something similar. If they do, then I hope that they’ll find this blog post so that they don’t need to spend any further time to find a solution!

T-SQL Tuesday #111: Why, tell me why
No I can!

Related Posts

No results found.

4 Comments. Leave new

  • This is exactly what I am looking for but I am hoping to output the patterns, not remove them. How could we adjust this query to do that? I’ve been playing around with it but no luck yet!

    Reply
  • Hugo Kornelis
    April 29, 2022 12:26

    Thanks, Rachel!

    It’s a long time ago since I made this post and the details are not fresh in my mind anymore. But upon re-reading the query and the explanation in the post, here’s my thought.
    The solution presented is based on a test to see if a character is included in any of the patterns. If it is, we skip it; if the character is NOT in any pattern we keep it, and the remaining characters are then concatenated back together.
    So my first thought is that your requirement would be fulfilled by simply reversing the logic in the query. Simply put: change “AND NOT EXISTS to AND EXISTS.

    (Note that the above is 100% UNtested, so please make sure to test it extensively with all the common cases and with all the most nasty weird edge cases you can come up with)

    Reply
  • Hello,
    Thank you so much for sharing this. I’m trying to modify this CTE to the requirement but it constantly fails. I am trying to extract result columns from this bad data column but I have failed to do so. how can I get clean this data?

    Thank you.
    Sample data:

    Id Bad Data Result1 Result 2
    1 LTY – REEL – Mark Hudson – 999.265.6233 LTY Mark Hudson
    23 IOO Craig David 123-456-7891 IOO Craig David
    28 LTK – MEKL – MIKAL MEKL – 123.558.1234 LTK MIKAL MEKL
    36 RAW- RANDY JHON- 123-456-7891 RAW RANDY JHON
    2 MNQR- JERRY RAM 123-456-7891 MNQR JERRY RAM
    I12 LTK – MEKL – TERRY LA- 123.456.7891 LTK TERRY LA
    L-3 LTK – MEKL – Jos Rick – 1234567891 LTK Jos Rick
    JK MDKL JERRY RAM 123 456 7891 MDKL JERRY RAM
    9 MUINK- OSCAR PISTARI- 123-456-7891 MUINK OSCAR PISTARI
    10 MUINK-Leo David MUINK Leo David
    11 ALT – Stephen Smith 123-456-7891 ALT Stephen Smith

    Reply
  • Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close