Regex Help

  • I've just added CLR Regex Functions to our SQL Servers.

    I'm trying to replace some of our less efficient string parsing TSQL functions with the new .Net CLR Regex Functions.

    This is technically a regex, not a SQL question but I'm not sure where else to ask and hopefully some on here have some regex experience.

    I have a single "Notes" field that contains horrible non normalised data whereby the single fields contains multiple note entries.  Each note is datetime stamped by our application so there's always a consistent datetime stamp preceding each note.

    I need a regex to pass to the RegExMatches Table Valued Function so the function spits out each note as a row.

    The closest I've got is this

    - [0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}\s*.*\s*.*\s*.*

    but it assumes that there's only 2 paragraphs after then 2 header rows.  I need the regex to handle unlimited rows/paragraphs in each note and to keep capturing until it reaches the next note date/time header.

    Any help appreciated.

    Test input string below

    - 05/02/2020 11:14:15
    - Logged by: Ms K test
    note 1 paragraph 1

    note 1 paragraph 3

    note 1 paragraph 4

    note 1 paragraph 5

    - 02/01/2020 13:12:39
    - Logged by: Mr F test
    note 2 paragraph 1
    note 2 paragraph 2

    - 10/10/2020 15:11:00
    - Logged by: Mr S tester
    note 3 paragraph 1
    note 3 paragraph 2

    note 3 paragraph 2
    Attachments:
    You must be logged in to view attached files.
  • My regex is extremely rusty, but I think this should do what you are wanting as long as your paragraph and note don't contain a "-" character:

    -[\s\S]*?-[\s\S]*?(?=- [0-3][0-9]\/[0-3][0-9]\/[0-9]{4})

    Basically, we look for "- " followed by anything until it finds a second "- " then looks until it finds a "- " followed by a date.  I wasn't sure if you were in dd/MM/YYYY format or MM/dd/YYYY so that is why I did the numbers ranging from 0 to 3 for the first digit on the first and second part of the date.

    The only case where it will fail is if someone puts in "- " followed by a date in the note or paragraph section.

    I used the following site:

    http://refiddle.com/

    to help test it and it seems to work from what I can tell as long as your text block delimiter is a - character.

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I've just added CLR Regex Functions to our SQL Servers.

    I'm trying to replace some of our less efficient string parsing TSQL functions with the new .Net CLR Regex Functions.

    For things like the problem you posted I often find myself replacing less efficient string parsing .Net CLR Regex Functions with ngrams8k.

    DECLARE @string VARCHAR(8000) = 
    '- 05/02/2020 11:14:15
    - Logged by: Ms K test
    note 1 paragraph 1

    note 1 paragraph 3

    note 1 paragraph 4

    note 1 paragraph 5

    - 02/01/2020 13:12:39
    - Logged by: Mr F test
    note 2 paragraph 1
    note 2 paragraph 2

    - 10/10/2020 15:11:00
    - Logged by: Mr S tester
    note 3 paragraph 1
    note 3 paragraph 2

    note 3 paragraph 2';

    SELECT
    f.NoteNumber,
    f.MatchedPattern,
    f.DateDelimiter,
    f.NotePosition,
    NoteLength = note.Ln,
    Note = SUBSTRING(@string,f.NotePosition,note.Ln)
    FROM
    (
    SELECT
    NoteNumber = ROW_NUMBER() OVER (ORDER BY (ng.Position)),
    DateDelimiter = ng.Token,
    NotePosition = ng.Position,
    NextPosition = LEAD(ng.Position,1,a.LN) OVER (ORDER BY (ng.Position)),
    MatchedPattern = b.X
    FROM (VALUES(LEN(@string),
    '[0-9][0-9]','/',':',' ','-')) AS a(Ln,D,F,C,S,H)
    CROSS APPLY samd.ngrams8k(@string,21) AS ng -- Split into 21-Grams
    CROSS APPLY (VALUES(H+S+D+F+D+F+D+D+S+D+C+D+C+D)) AS b(X) -- Bernie Expression
    CROSS APPLY (VALUES(PATINDEX(b.X,ng.Token))) AS i(Idx) -- ItemIndex
    WHERE i.Idx = 1
    ) AS f
    CROSS APPLY (VALUES(f.NextPosition-f.NotePosition)) AS note(Ln);

    This Returns:

    x1

    I know you only needed the Notes but, because it's so easy to do so I numbered them for you (without a sort), showed you the matched pattern, matched patterntext (I named it "Date Delimiter"), The position in the string where the note begins and it's length. This without any loops, I/O, temp/hash tables, mutable variables or other bad coding practices.

    I also have a bunch of regex CLRs loaded but stopped comparing performance to ngrams8k a long time ago.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks Brian,

    That regex is close.  It matches all the notes except the last one because it stops matching when it gets to the next date so the last note will never have another datetime stamp at the end so is excluded.

    Is there any way to fix that?

    Attachments:
    You must be logged in to view attached files.
  • After a little more noodling around, I think I refined (by trial and error more than skill) Brian's regex and came up with this which works

     

    -[\s\S]*?-[\s\S]*?(?=- [0-3][0-9]\/[0-3][0-9]\/[0-9]{4})|[\s\S]*?(?<=\n)[\s\S]*

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply