Determine which pattern comes first in a string<!-- 864 --><!-- 864 -->

  • Offhand, what's the best way to do this?

    DelimitedSplit8k and PatternSplitLoop seem to have potential, but I'm just plain stuck on some things:

    1. DelimitedSplit8k: the delimiter helps split the folder paths, but the pattern can be within the strings that result.

    2. PatternSplitLoop: I would have to cross apply 16 times and have an awful WHERE clause to determine which of the four strings matched first.

    Unless I'm missing something. Short example is below.

    WITH testctes (string, pattern) AS (

    SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein', 'his first' UNION ALL

    SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato', 'hers first' UNION ALL

    SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers', 'mine first' UNION ALL

    SELECT 'more_stuff.yours\mine\hers\his_falafel', 'yours first'

    )

    SELECT string, pattern, ca.item, ca.itemnumber

    FROM testctes

    CROSS APPLY [dbo].[PatternSplitLoop] (string, '%his%') ca

  • Okay, now that's just a really cool puzzle. The following uses the "Swiss Army Knife" that is Jeff's DelimitedSplit8K function.

    WITH cteText(string) AS (

    SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

    SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

    SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

    SELECT 'more_stuff.yours\mine\hers\his_falafel'

    ),

    cteSplits AS (

    SELECT cteText.string, s3.ItemNumber, s3.Item,

    ROW_NUMBER() OVER(PARTITION BY cteText.string ORDER BY cteText.string, s1.ItemNumber, s2.ItemNumber, s3.ItemNumber) position

    FROM cteText

    CROSS APPLY DelimitedSplit8K(string, '.') s1

    CROSS APPLY DelimitedSplit8K(s1.item, '\') s2

    CROSS APPLY DelimitedSplit8K(s2.item, '_') s3

    ),

    cteFirsts(string, position) as (

    SELECT string, MIN(position)

    FROM cteSplits

    WHERE Item IN ('his', 'hers', 'mine')

    GROUP BY string

    )

    SELECT s.string, s.Item, s.position

    FROM cteSplits s

    INNER JOIN cteFirsts f ON f.string = s.string

    and f.position = s.position

    ORDER BY s.Item;

    cteText defines the values you want to parse.

    cteSplits uses Jeff's DelimitedSplit8K to split them out by your delimiters. You'll have to CROSS APPLY any other delimiters you need.

    cteFirsts determines the first occurrence of any string you're looking for in each split set.

    The query then joins cteSplits and cteFirsts together to get the string, matched item and position number.

    I hope this is what you're looking for, but I have to admit that I had fun with it.

  • Hi Ed,

    Thanks, that looks like it will get me most of the way there. I made a dumb, though, and just realized that the column the strings are stored in is a VARCHAR(MAX), which is noted to slow down the function wonder that is DelimitedSplit8k.

    I've got it running now, but it's going on the 10 minute mark. That snarky dev with his IF...CHARINDEX that goes on for a page might have this beat :crazy:

    Thanks

  • I'm not sure if this can work for you. Seems simple but I can't guarantee performance.

    WITH cteText(string) AS (

    SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

    SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

    SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

    SELECT 'more_stuff.yours\mine\hers\his_falafel'

    )

    SELECT string, pattern

    FROM cteText

    CROSS APPLY (SELECT TOP 1 pattern

    FROM (VALUES( 'his', CHARINDEX('his', string)),

    ( 'hers', CHARINDEX('hers', string)),

    ( 'mine', CHARINDEX('mine', string)),

    ( 'yours', CHARINDEX('yours', string)))x( pattern, position)

    ORDER BY position)y

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis, nice work.

    sqldriver, I would give Luis's solution a try instead. It operates on the base strings themselves and doesn't split them out into their component parts. As you're using MAX data types (which don't like to be joined to, hence the slowdown of DelimitedSplit8K) you'll likely get better performance from Luis's solution. Using CHARINDEX on the base string instead of splitting them out is a better solution if you're not looking for isolated strings between delimiters.

  • I just wanted to note that the performance of DelimitedSplit8K function won't be affected if it receives a varchar(max) as input. It will, however, truncate the string to a 8000 characters if the string is longer than that. The performance problem would occur if you change the definition of the function.

    Truncation shouldn't be an issue if you're looking for the first appearance of a pattern.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Good point, Luis, but your solution still performs better. I've tried to get the DelimitedSplit8K function to handle MAX data types, and they simply don't like being joined to at all. I've tested it extensively using MAX data types that are less than 8K in size with identical data and the real difference is the data type itself.

    I did a performance comparison of the two approaches and you should definitely uses Luis's instead of mine. I created a physical table using a single Varchar(MAX) column with no indexing and no keys.

    On 10,000 rows, mine took 2750 ms and Luis's took 139 ms.

    Over 100,000 rows, mint took 33250 ms and Luis's took 538 ms.

    The bottom line is that if you're not looking to match isolated strings (based on delimiter), Luis's solution is the clear winner.

  • That runs fast, Luis, but I'm getting strange results. I would imagine it's something to do with the strings I'm actually looking for:

    CROSS APPLY (SELECT TOP 1 pattern

    FROM (VALUES( '_pim', CHARINDEX('[_]pim', PATH)),

    ( '_him', CHARINDEX('[_]him', PATH)),

    ( '.b.', CHARINDEX('.b.', PATH)),

    ( '.ib', CHARINDEX('.ib', PATH)),

    ( 'IM Conversation', CHARINDEX('IM Conversation', path))

    )x( pattern, position)

    I seem to get _him as the matched pattern, regardless of if it's first, or even in the string I'm searching. Any ideas?

    Thanks

  • Here's the whole query, if that helps. I would need some time to mask sample data well. These paths have a lot of client and employee identifying information in them.

    WITH cteString ( ArtifactID, PATH )

    AS ( SELECT ArtifactID ,

    PATH

    FROM [Document]

    WHERE PATH IS NOT NULL

    )

    SELECT ArtifactID ,

    PATH ,

    y.pattern ,

    CASE y.pattern

    WHEN '_pim' THEN 'PublicIM'

    WHEN '_him' THEN 'HubIMs'

    WHEN '.b.' THEN 'B Email'

    WHEN '.ib' THEN 'B Chats'

    WHEN 'IB Conversation' THEN 'B Chats'

    ELSE 'Check Subject and Record type etc.'

    END AS [Datatype]

    FROM cteString

    CROSS APPLY ( SELECT TOP 1

    pattern

    FROM ( VALUES

    ( '_pim', CHARINDEX('[_]pim', PATH)),

    ( '_him', CHARINDEX('[_]him', PATH)),

    ( '.b.', CHARINDEX('.b.', PATH)),

    ( '.ib', CHARINDEX('.ib', PATH)),

    ( 'IM Conversation', CHARINDEX('IM Conversation',

    path)) ) x ( pattern, position ) ORDER BY position

    ) y

  • It might be because you don't need the brackets to escape the underscore.

    The other problem is because if the pattern is not found, then you'll get a zero as the position. To prevent this, you can change the ORDER BY to something like this:

    ORDER BY CASE WHEN position = 0 THEN 999999 ELSE position END

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I'd also remove the CASE (which doesn't seems to match the pattern search) and include it in the APPLY results.

    WITH cteString ( ArtifactID, PATH )

    AS ( SELECT ArtifactID ,

    PATH

    FROM [Document]

    WHERE PATH IS NOT NULL

    )

    SELECT ArtifactID ,

    PATH ,

    y.pattern ,

    y.Datatype

    FROM cteString

    CROSS APPLY ( SELECT TOP 1

    pattern, Datatype

    FROM ( VALUES

    ( 'PublicIM', '_pim', CHARINDEX('[_]pim', PATH)),

    ( 'HubIMs', '_him', CHARINDEX('[_]him', PATH)),

    ( 'B Email', '.b.', CHARINDEX('.b.', PATH)),

    ( 'B Chats', '.ib', CHARINDEX('.ib', PATH)),

    ( 'Bloomberg Chats', 'IB Conversation', CHARINDEX('IB Conversation',

    path)) ) x (Datatype, pattern, position )

    ORDER BY CASE WHEN position = 0 THEN 999999 ELSE position END

    ) y

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Ah, look at that. Even better is filtering out where position > 0. That rules and runs super fast.

    Hope we never lose you to any Oracle forums :hehe:

    Thanks

  • I was just going to mention that. Here's a revised version to include strings with no matches at all using OUTER APPLY.

    WITH cteString ( ArtifactID, PATH )

    AS ( SELECT ArtifactID ,

    PATH

    FROM [Document]

    WHERE PATH IS NOT NULL

    )

    SELECT ArtifactID ,

    PATH ,

    y.pattern ,

    ISNULL( y.Datatype, 'Check Subject and Record type etc.') AS Datatype

    FROM cteString

    OUTER APPLY ( SELECT TOP 1

    pattern, Datatype

    FROM ( VALUES

    ( 'PublicIM', '_pim', CHARINDEX('_pim', PATH)),

    ( 'HubIMs', '_him', CHARINDEX('_him', PATH)),

    ( 'B Email', '.b.', CHARINDEX('.b.', PATH)),

    ( 'B Chats', '.ib', CHARINDEX('.ib', PATH)),

    ( 'Bloomberg Chats', 'IB Conversation', CHARINDEX('IB Conversation',

    path)) ) x (Datatype, pattern, position )

    WHERE position > 0

    ORDER BY position

    ) y

    EDIT: Oracle? :crazy::sick:

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I confess to a bit of lack of understanding here (meaning in what you expect your output to be), but since you mentioned PatternSplitLoop (presumably the one from my article) I felt obliged to reply in some fashion.

    Are you looking for a solution something like this? This lists either his, hers or mine, whichever appears first (as a word) in the string. A word being any string bounded by non-letters.

    WITH cteText(string) AS (

    SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

    SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

    SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

    SELECT 'more_stuff.yours\mine\hers\his_falafel'

    )

    SELECT string, item

    FROM

    (

    SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)

    FROM cteText a

    CROSS APPLY dbo.PatternSplitCM(string, '[a-z]') b

    WHERE [Matched]=1 AND item IN ('his', 'hers', 'mine')

    ) a

    WHERE rn=1;

    Two points here though:

    - If you're referring to the PatternSplitLoop function I think you are, use PatternSplitCM as it is much faster.

    - PatternSplitCM will probably suffer the same performance degradation using VARCHAR(MAX) that DelimitedSplit8K has.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • To get around the performance hit on the VARCHAR(MAX), you may be able to create a MAX version of the splitter you end up using and process your data something like this:

    WITH cteText(string) AS (

    SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

    SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

    SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

    SELECT 'more_stuff.yours\mine\hers\his_falafel'

    )

    SELECT string, item

    FROM

    (

    SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)

    FROM cteText a

    CROSS APPLY dbo.PatternSplitCM(string, '[a-z]') b

    WHERE LEN(string) <= 8000 AND [Matched]=1 AND item IN ('his', 'hers', 'mine')

    ) a

    WHERE rn=1

    UNION ALL

    SELECT string, item

    FROM

    (

    SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)

    FROM cteText a

    CROSS APPLY dbo.PatternSplitMAX(string, '[a-z]') b

    WHERE LEN(string) > 8000 AND [Matched]=1 AND item IN ('his', 'hers', 'mine')

    ) a

    WHERE rn=1;

    Emphasis on the "maybe."


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply