Finding a substring and extracting a number associated with it

  • The Dixie Flatline (7/6/2016)


    Hey JLS, I think I have a solution, but for some strange reason the website won't let me post or upload the code if I spell out TRY_CONVERT more than once.

    Do a search_replace on the code below and let me know.

    Key points:

    1. Expanded rules for what can be considered a date string.

    2. Used PATINDEX to determine if admissions came before discharges in CTE ([AdmissionFirst])

    3. Used ROW_NUMBER() in CTE2 to make sure that DateX was always the last row for a given RowID and that the order of integers in [textstring] was preserved.

    4. Used CASE statements in the final query to decide which integers were used for admissions and which for discharges, based on the value of the AdmissionFirst flag

    ---------------------------------------------------------------------------------------------------------------------------------------------------------

    -- Search/replace "try_C(" to the correct function

    ---------------------------------------------------------------------------------------------------------------------------------------------------------

    with cte as (select rowID, itemnumber,convert(varchar(50),item) as item

    , case when PATINDEX('%Admission%Discharge%',TextString) > 0 then 'Y' else '' end as AdmissionFirst

    , case when item like '%/%/%' or (try_C(date,Item) is not null and len(item) > 5) then item end as DateX

    from #input i

    cross apply DelimitedSplit8K(TextString, ' ')

    )

    ,cte2 as (select *, ROW_NUMBER() over(Partition by RowID order by rowID,datex, ItemNumber) as RowNum

    from cte

    where item > ''

    and (try_C(int,Item) is not null

    or Item like '%/%/%'

    or try_C(date,Item) is not null ))

    select RowID

    ,max(case when AdmissionFirst = 'Y' and RowNum = 1 then Item

    when AdmissionFirst <> 'Y' and RowNum = 2 then Item

    end) as Admissions

    ,max(case when AdmissionFirst = 'Y' and RowNum = 2 then Item

    when AdmissionFirst <> 'Y' and RowNum = 1 then Item

    end) as Discharges

    ,max(DateX) as DateX

    from cte2

    group by RowID

    Hi Dixie....

    is this how you wanted to post your code?

    with cte as (select rowID, itemnumber,convert(varchar(50),item) as item

    , case when PATINDEX('%Admission%Discharge%',TextString) > 0 then 'Y' else '' end as AdmissionFirst

    , case when item like '%/%/%' or (TRY_CONVERT(date,Item) is not null and len(item) > 5) then item end as DateX

    from #input i

    cross apply DelimitedSplit8K(TextString, ' ')

    )

    ,cte2 as (select *, ROW_NUMBER() over(Partition by RowID order by rowID,datex, ItemNumber) as RowNum

    from cte

    where item > ''

    and (TRY_CONVERT(int,Item) is not null

    or Item like '%/%/%'

    or TRY_CONVERT(date,Item) is not null ))

    select RowID

    ,max(case when AdmissionFirst = 'Y' and RowNum = 1 then Item

    when AdmissionFirst <> 'Y' and RowNum = 2 then Item

    end) as Admissions

    ,max(case when AdmissionFirst = 'Y' and RowNum = 2 then Item

    when AdmissionFirst <> 'Y' and RowNum = 1 then Item

    end) as Discharges

    ,max(DateX) as DateX

    from cte2

    group by RowID

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Yes. I'm not sure why it wasn't letting me post that.

    More importantly: Does it work for you?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (7/6/2016)


    Yes. I'm not sure why it wasn't letting me post that.

    More importantly: Does it work for you?

    yeah.. I think it works...sorry but havent had time to fully investigate.

    hopefully will hav esome more time tomorrow.

    fancy buidling a test harness of 1 miilions rows to see perfomeance?

    <grin>

    best JLS

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Already did that. I created a table with an identity column and put your text strings into it, then just kept ramping it up with the following loop.

    while (select count(*) from #input) < 1000000

    begin

    insert into #input

    select top (10000) TextString

    from #input;

    end

    The extraction query itself takes about 15 seconds for just over 1 million rows on my box. Execution plan says 30% of that is just writing the data out to a #results table.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (7/6/2016)


    Already did that. The extraction query itself takes about 15 seconds on my box. Execution plan says 30% of that is just writing the data out to a #results table.

    while (select count(*) from #input) < 1000000

    begin

    insert into #input

    select top (10000) TextString

    from #input;

    end

    edit ...doh

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Couldn't help myself but have a little play with this myself.

    Basically this determines the order of the words admission and discharge, grabs and remove the date (if possible) from the string, grabs the first and second number from the string and returns them in the the appropriate column determined by the word order.

    This is all done in a bunch of nested queries.

    Performance should be really good. Will fail if there isn't two numbers. If it can't determine a date it will return an empty string ... I'm looking at you row 14 🙂

    WITH TestValues AS(

    SELECT *

    FROM (values

    ( 1, 'There were 5 admissions and 3 discharge since reporting 22/04/2016'),

    ( 2, 'today here were 6 discharges and 3 admission since reporting 21/04/2016'),

    ( 3, 'since reporting 30/04/2016 the discharges are 8 and the admissions are 10' ),

    ( 4, ' the discharges are 20 and the admissions are 30 on 10/05/2016 '),

    ( 5, ' the admission was 100 on 15/05/2016 and the discharge was 30 '),

    ( 6, ' on 15/08/2016 5 admissions and discharge 10' ),

    ( 7, ' 100 admission 200 discharges 25/12/2018' ),

    ( 8, ' discharge 500 on 01/01/2020 and 100 admission '),

    ( 9, ' discharge 1 2 admission 01/01/2020' ),

    ( 10, ' 20180101 100 admission 200 discharges' ),

    ( 11, '20160101 admission 1 2 discharges '),

    ( 12, 'here are the results up to 12/12/12 Admission 300 400 Discharge ' ) ,

    ( 13, ' Discharges 9000 20/May/2015 8000 Admissions ' ) ,

    ( 14, ' Discharges 99999 88888 Admissions 161216' ),

    ( 15, ' Discharges I think were 99 and maybe 88 possibly less Admissions on 2016-05-15' )

    )

    dt (rowID,TextString)

    )

    SELECT ReportDate

    ,CASE NumOrder WHEN -1 THEN FirstNum ELSE SecondNum END Admissions

    ,CASE NumOrder WHEN -1 THEN SecondNum ELSE FirstNum END Discharges

    FROM (

    SELECT NumOrder, ReportDate, FirstNum

    ,CAST(LEFT(STUFF(LeftOver,1,PATINDEX('%[0-9]%',LeftOver)-1,''),CHARINDEX(' ',STUFF(LeftOver,1,PATINDEX('%[0-9]%',LeftOver)-1,''))-1) AS INT) SecondNum

    FROM (

    SELECT NumOrder, ReportDate

    ,CAST(LEFT(STUFF(DateGone,1,PATINDEX('%[0-9]%',DateGone)-1,''),CHARINDEX(' ',STUFF(DateGone,1,PATINDEX('%[0-9]%',DateGone)-1,''))-1) AS INT) FirstNum

    ,STUFF(STUFF(DateGone,1,PATINDEX('%[0-9]%',DateGone)-1,''),1,CHARINDEX(' ',STUFF(DateGone,1,PATINDEX('%[0-9]%',DateGone)-1,''))-1,'') LeftOver

    FROM (

    SELECT SIGN( PATINDEX( '%admission%',TextString ) - PATINDEX( '%discharge%',TextString )) NumOrder

    ,CASE

    WHEN PATINDEX('%[0-9][0-9][/-]%[/-]%[0-9][0-9]%',TextString) > 0 THEN

    SUBSTRING(TextString,PATINDEX('%[0-9][0-9][/-]%[/-]%[0-9][0-9]%',TextString),10)

    ELSE

    SUBSTRING(TextString,PATINDEX('%[1-2][8,9,0,1][0-9][0-9][0-1][0-9][0-3][0-9]%',TextString),8 * SIGN(PATINDEX('%[1-2][8,9,0,1][0-9][0-9][0-1][0-9][0-3][0-9]%',TextString)))

    END ReportDate

    ,CASE

    WHEN PATINDEX('%[0-9][0-9][/-]%[/-]%[0-9][0-9]%',TextString) > 0 THEN

    REPLACE(TextString,SUBSTRING(TextString,PATINDEX('%[0-9][0-9][/-]%[/-]%[0-9][0-9]%',TextString),10 ),'')+' '

    ELSE

    REPLACE(TextString,SUBSTRING(TextString,PATINDEX('%[1-2][8,9,0,1][0-9][0-9][0-1][0-9][0-3][0-9]%',TextString),8 * SIGN(PATINDEX('%[1-2][8,9,0,1][0-9][0-9][0-1][0-9][0-3][0-9]%',TextString))),'')+' '

    END

    DateGone

    FROM TestValues

    ) A

    ) B

    ) C

  • My customized solution would be to hunt down the idiot(s) that are creating the mess in the files and take them to a nice, sit down, pork chop dinner. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You always cut to the heart of the pork chop... errrrr matter..... Jeff 😀

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 8 posts - 31 through 37 (of 37 total)

You must be logged in to reply to this topic. Login to reply