How to find multiple repeated occurrences with regular expression using like?

Question

How to find multiple repeated occurrences with regular expression using like?

jagus

SSC Enthusiast

Points: 173
More actions
May 19, 2020 at 8:11 pm

#3753046

Seems SQL Server doesn't have a regular expression function.
I am trying to query for records in which a varchar field contains values 123.x.5, where x is a integer of 1-3 digits. So I want to retrieve values 123.1.5, 123.12.5 and 123.123.5, but not 123.12.3.5. I can get the single digit case with "like '123.[0-9].5' but '123.[0-9]+.5' returns nothing, and likewise using '*'.
I could OR together as in " x like '123.[0-9].5' OR x like '123.[0-9][0-9].5' OR x like '123.[0-9][0-9][0-9]' " but that seems excessive cumbersome. Trying just "x like '123.%.5' will get these records, but also those with 123.12.3.5, which I don't want.
Anyone know of a better method of doing this?

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

BTylerWhite SSCommitted Points: 1569 More actions · Answer 1

It's possible to use the PARSENAME function to split these values to octets. If you're only trying to identify these values with three octets, you could possibly use something like this:

WITH CTE (Something)
AS
(
    SELECT '123.1.5'
    UNION ALL
    SELECT '123.12.5'
    UNION ALL
    SELECT '123.123.5'
    UNION ALL
    SELECT '123.12.3.5'
)
SELECT Something
FROM CTE
WHERE PARSENAME(CTE.Something, 4) IS NULL;

You may still need to do some additional validation in the WHERE clause, but hopefully this helps get you started!

Jeff Moden SSC Guru Points: 1004591 More actions · Answer 2

jagus wrote:

Seems SQL Server doesn't have a regular expression function.
I am trying to query for records in which a varchar field contains values 123.x.5, where x is a integer of 1-3 digits. So I want to retrieve values 123.1.5, 123.12.5 and 123.123.5, but not 123.12.3.5. I can get the single digit case with "like '123.[0-9].5' but '123.[0-9]+.5' returns nothing, and likewise using '*'.
I could OR together as in " x like '123.[0-9].5' OR x like '123.[0-9][0-9].5' OR x like '123.[0-9][0-9][0-9]' " but that seems excessive cumbersome. Trying just "x like '123.%.5' will get these records, but also those with 123.12.3.5, which I don't want.
Anyone know of a better method of doing this?

What else is in the column? Is it just numbers separated by two or 3 decimal points? The reason why I'm asking is because using something like PARSENAME or just about any other method, save one, is going to guarantee a scan of the entire table or index you have. In order to do the one thing that might not force a scan, I need to know exactly what can and cannot be in this column.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jagus SSC Enthusiast Points: 173 More actions · Answer 3

All the values in the this field are up to three digit numbers separated by decimal points, but there are up to nine (maybe slightly more) such octets. Some values are just a single number with no decimal points. So value can be something like 3, or 123.123.123.123.123.123 or larger.

Jeffrey Williams SSC Guru Points: 90348 More actions · Answer 4

Here is one way - but I am sure there is a better/faster method:

Declare @testTable Table (testString varchar(30));
 
 Insert Into @testTable
 Values ('123.1.4'), ('123.12.5'), ('123.12.3.5');

 Select *
   From @testTable tt
  Where len(tt.testString) - len(replace(tt.testString, '.', '')) = 2;

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

Jeff Moden SSC Guru Points: 1004591 More actions · Answer 5

jagus wrote:

All the values in the this field are up to three digit numbers separated by decimal points, but there are up to nine (maybe slightly more) such octets. Some values are just a single number with no decimal points. So value can be something like 3, or 123.123.123.123.123.123 or larger.

Ok... so let's do a little test. The following code generates 1 million random rows with 1 to 9 octets on each row. You'll need the fnTally function to do this and you can get that at the like-named link in my signature line below. Don't let that scare you... it only takes 4 seconds on my laptop and the table only occupies 31.6MB in TempDB (a safe place to test).

   DROP TABLE IF EXISTS #MyHead;
    SET STATISTICS TIME ON
;
   WITH  
 cteGenBaseMultiplier AS
(
 SELECT  t.N
        ,Multiplier = ABS(CHECKSUM(NEWID())%9)+1
   FROM dbo.fnTally(1,1300000) t --There''s usually over 200,000 dupes in a million rows for this
)
,cteGenStrings AS
(
 SELECT SomeString = CONVERT(VARCHAR(40),STRING_AGG(ABS(CHECKSUM(NEWID())%256),'.'))
   FROM cteGenBaseMultiplier gs
  CROSS APPLY dbo.fnTally(1,gs.Multiplier) t
  GROUP BY gs.N
)
 SELECT TOP 1000000
        SomeString = ISNULL(SomeString,'') --ISNULL makes the resulting column NOT NULL
   INTO #MyHead
   FROM cteGenStrings
  GROUP BY SomeString
;
 ALTER TABLE dbo.#MyHead ADD PRIMARY KEY CLUSTERED (SomeString);
   SET STATISTICS TIME OFF
;

Once you've built that test table, let's find out how fast it is to find things using your method...

SET STATISTICS TIME,IO ON;
 SELECT * FROM #MyHead
  WHERE SomeString LIKE '123.[0-9].5'
     OR SomeString LIKE '123.[0-9][0-9].5'
     OR SomeString LIKE '123.[0-9][0-9][0-9].5'
SET STATISTICS TIME,IO OFF;

That bad boy didn't take long to run (less than a millisecond) and only consumed 21 reads. Not bad. But that only returns things with 123.???.5. What if we want to return anything with 123 in the first octet and 5 in the 3rd? Well then we need a trailing wild-card like the following basically doubling the criteria if we still want it to be SARGable...

SET STATISTICS TIME,IO ON;
 SELECT * FROM #MyHead
  WHERE SomeString LIKE '123.[0-9].5'
     OR SomeString LIKE '123.[0-9][0-9].5'
     OR SomeString LIKE '123.[0-9][0-9][0-9].5'
     OR SomeString LIKE '123.[0-9].5.%'
     OR SomeString LIKE '123.[0-9][0-9].5.%'
     OR SomeString LIKE '123.[0-9][0-9][0-9].5.%'
SET STATISTICS TIME,IO OFF;

Still, that only took 3ms duration (still "0" CPU) and had the same number of reads... 21.

Just about any other trick will cause a non-SARGable query that will do a full table/index scan.

Now, shifting gears a bit, the number system we're working with smacks of a WBS (Work Breakdown Structure) of some other type of organizational hierarchy. That probably means that it's not going to change that often. I could waste your time by showing you how to split out the first 4 octets into a nasty fast lookup table that also contained the original string to solve the particular problem you posted but then I got to thinking...

It seems to me that you're going to have many more and bigger issues with all of this later. If the strings of octets you have are UNIQUE, we could convert this to both an Adjacency List Hierarchy (for super easy maintenance) and Nested Sets (for nasty fast performance and a whole lot of flexibility for other things). Each conversion on a million rows only takes about 19 seconds on my new laptop. It takes a lot less time for sets small than that.

Would that be something you're interested in or are you only in need of solving your current problem, which you already know how to do with the triple criteria you used?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jcelko212 32090 SSCrazy Eights Points: 9304 More actions · Answer 6

First, you need to learn the basic terms. Rows are not records. Rows can have constraints, can be virtual, etc. while a record is a physical unit of storage. Fields are not columns. If you read the SQL Standards, you'll find the field is a meaningful sub-unit of a column, such as a year within a date.

Yes, the LIKE predicate in SQL Server is weak. If Microsoft had full ANSI/ISO standard SQL you would use the "similar to" predicate. This is based on the POSIX standard regular expressions.

Since I use this style of numbering or sections, in my books, I found it was easier to pad each of the subsections with zeros in the source. That means I would write "123.01.05." where you are writing "123.1.5" and sometimes I even put "123.01.05.00." to get a fixed-length identifier.

I found that some of my word processors over the years I been able to let the word processor do this for me.

Please post DDL and follow ANSI/ISO standards when asking for help.

How to find multiple repeated occurrences with regular expression using like?

Cookies on SQLServerCentral