How to find multiple repeated occurrences with regular expression using like?

  • Seems SQL Server doesn't have a regular expression function.

    I am trying to query for records in which a varchar field contains values  123.x.5, where x is a integer of 1-3 digits.   So I want to retrieve values 123.1.5, 123.12.5 and 123.123.5, but not 123.12.3.5.     I can get the single digit case with  "like '123.[0-9].5' but '123.[0-9]+.5' returns nothing, and likewise using '*'.

    I could OR together as in " x like '123.[0-9].5'  OR x like '123.[0-9][0-9].5'  OR x like '123.[0-9][0-9][0-9]' "  but that seems excessive cumbersome.    Trying just "x like '123.%.5'  will get these records, but also those with 123.12.3.5, which I don't want.

    Anyone know of a better method of doing this?

  • It's possible to use the PARSENAME function to split these values to octets. If you're only trying to identify these values with three octets, you could possibly use something like this:

    WITH CTE (Something)
    AS
    (
    SELECT '123.1.5'
    UNION ALL
    SELECT '123.12.5'
    UNION ALL
    SELECT '123.123.5'
    UNION ALL
    SELECT '123.12.3.5'
    )
    SELECT Something
    FROM CTE
    WHERE PARSENAME(CTE.Something, 4) IS NULL;

    You may still need to do some additional validation in the WHERE clause, but hopefully this helps get you started!

  • jagus wrote:

    Seems SQL Server doesn't have a regular expression function.

    I am trying to query for records in which a varchar field contains values  123.x.5, where x is a integer of 1-3 digits.   So I want to retrieve values 123.1.5, 123.12.5 and 123.123.5, but not 123.12.3.5.     I can get the single digit case with  "like '123.[0-9].5' but '123.[0-9]+.5' returns nothing, and likewise using '*'.

    I could OR together as in " x like '123.[0-9].5'  OR x like '123.[0-9][0-9].5'  OR x like '123.[0-9][0-9][0-9]' "  but that seems excessive cumbersome.    Trying just "x like '123.%.5'  will get these records, but also those with 123.12.3.5, which I don't want.

    Anyone know of a better method of doing this?

    What else is in the column?  Is it just numbers separated by two or 3 decimal points?  The reason why I'm asking is because using something like PARSENAME or just about any other method, save one, is going to guarantee a scan of the entire table or index you have.  In order to do the one thing that might not force a scan, I need to know exactly what can and cannot be in this column.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • All the values in the this field are up to three digit numbers separated by decimal points, but there are up to nine (maybe slightly more) such octets.   Some values are just a single number with no decimal points.   So value can be something like 3, or 123.123.123.123.123.123 or larger.

  • Here is one way - but I am sure there is a better/faster method:

    Declare @testTable Table (testString varchar(30));

    Insert Into @testTable
    Values ('123.1.4'), ('123.12.5'), ('123.12.3.5');

    Select *
    From @testTable tt
    Where len(tt.testString) - len(replace(tt.testString, '.', '')) = 2;

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • jagus wrote:

    All the values in the this field are up to three digit numbers separated by decimal points, but there are up to nine (maybe slightly more) such octets.   Some values are just a single number with no decimal points.   So value can be something like 3, or 123.123.123.123.123.123 or larger.

    Ok... so let's do a little test.  The following code generates 1 million random rows with 1 to 9 octets on each row.  You'll need the fnTally function to do this and you can get that at the like-named link in my signature line below.  Don't let that scare you... it only takes 4 seconds on my laptop and the table only occupies 31.6MB in TempDB (a safe place to test).

       DROP TABLE IF EXISTS #MyHead;
    SET STATISTICS TIME ON
    ;
    WITH
    cteGenBaseMultiplier AS
    (
    SELECT t.N
    ,Multiplier = ABS(CHECKSUM(NEWID())%9)+1
    FROM dbo.fnTally(1,1300000) t --There''s usually over 200,000 dupes in a million rows for this
    )
    ,cteGenStrings AS
    (
    SELECT SomeString = CONVERT(VARCHAR(40),STRING_AGG(ABS(CHECKSUM(NEWID())%256),'.'))
    FROM cteGenBaseMultiplier gs
    CROSS APPLY dbo.fnTally(1,gs.Multiplier) t
    GROUP BY gs.N
    )
    SELECT TOP 1000000
    SomeString = ISNULL(SomeString,'') --ISNULL makes the resulting column NOT NULL
    INTO #MyHead
    FROM cteGenStrings
    GROUP BY SomeString
    ;
    ALTER TABLE dbo.#MyHead ADD PRIMARY KEY CLUSTERED (SomeString);
    SET STATISTICS TIME OFF
    ;

    Once you've built that test table, let's find out how fast it is to find things using your method...

    SET STATISTICS TIME,IO ON;
    SELECT * FROM #MyHead
    WHERE SomeString LIKE '123.[0-9].5'
    OR SomeString LIKE '123.[0-9][0-9].5'
    OR SomeString LIKE '123.[0-9][0-9][0-9].5'
    SET STATISTICS TIME,IO OFF;

    That bad boy didn't take long to run (less than a millisecond) and only consumed 21 reads.  Not bad.  But that only returns things with 123.???.5.  What if we want to return anything with 123 in the first octet and 5 in the 3rd?  Well then we need a trailing wild-card like the following basically doubling the criteria if we still want it to be SARGable...

    SET STATISTICS TIME,IO ON;
    SELECT * FROM #MyHead
    WHERE SomeString LIKE '123.[0-9].5'
    OR SomeString LIKE '123.[0-9][0-9].5'
    OR SomeString LIKE '123.[0-9][0-9][0-9].5'
    OR SomeString LIKE '123.[0-9].5.%'
    OR SomeString LIKE '123.[0-9][0-9].5.%'
    OR SomeString LIKE '123.[0-9][0-9][0-9].5.%'
    SET STATISTICS TIME,IO OFF;

    Still, that only took 3ms duration (still "0" CPU) and had the same number of reads... 21.

    Just about any other trick will cause a non-SARGable query that will do a full table/index scan.

    Now, shifting gears a bit, the number system we're working with smacks of a WBS (Work Breakdown Structure) of some other type of organizational hierarchy.  That probably means that it's not going to change that often.  I could waste your time by showing you how to split out the first 4 octets into a nasty fast lookup table that also contained the original string to solve the particular problem you posted but then I got to thinking...

    It seems to me that you're going to have many more and bigger issues with all of this later.   If the strings of octets you have are UNIQUE, we could convert this to both an Adjacency List Hierarchy (for super easy maintenance) and Nested Sets (for nasty fast performance and a whole lot of flexibility for other things).  Each conversion on a million rows only takes about 19 seconds on my new laptop.  It takes a lot less time for sets small than that.

    Would that be something you're interested in or are you only in need of solving your current problem, which you already know how to do with the triple criteria you used?

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • First, you need to learn the basic terms. Rows are not records. Rows can have constraints,  can be virtual, etc. while a record is a physical unit of storage. Fields are not columns. If you read the SQL Standards, you'll find the field is a meaningful sub-unit of a column, such as a year within a date.

    Yes, the LIKE predicate in SQL Server is weak. If Microsoft had full ANSI/ISO standard SQL you would use the "similar to" predicate. This is based on the POSIX standard regular expressions.

    Since I use this style of numbering or sections, in my books, I found it was easier to pad each of the subsections with zeros in the source. That means I would write "123.01.05." where you are writing "123.1.5" and sometimes I even put "123.01.05.00." to get a fixed-length identifier.

    I found that some of my word processors over the years I been able to let the word processor do this for me.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply