Remove duplicate substrings from string

  • Hello experts, I have a problem I'd appreciate advice on please.

    I'm asked to determine distinct combinations of sequences of values from a field which contains delimited values. My issue is that I need to remove instances of duplicated strings, while preserving sort order.

    Examples of the input and output are given in the table:

    create table #stringsplit (str_raw varchar(30), str_output varchar(30))

    insert into #stringsplit (str_raw, str_output)

    values ('A > A > A', 'A')
    , ('A > A > A > A > C', 'A > C')
    , ('A > A > A > A > C > A', 'A > C > A')
    , ('A > A > A > A > C > B', 'A > C > B')
    , ('D > B > B > D > C', 'D > B > D > C')

    I've seen examples of user defined functions where users parse and reconstruct strings, but not where the order of occurrence of the substring is preserved.

    Thanks
    Mickey

  • One idea:

    WITH Grps AS (
      SELECT SS.str_raw,
        DS.Item,
        DS.ItemNumber,
        ROW_NUMBER() OVER (PARTITION BY SS.str_raw ORDER BY DS.ItemNumber) -
        ROW_NUMBER() OVER (PARTITION BY SS.str_raw, DS.Item ORDER BY DS.ItemNumber) AS Grp
      FROM #stringsplit SS
       CROSS APPLY DelimitedSplit8K (REPLACE(SS.str_raw,' ',''), '>') DS),
    Distinctions AS (
      SELECT DISTINCT str_raw, Item, Grp
      FROM Grps)
    SELECT STUFF(REPLACE((SELECT ' > ' + Item
            FROM Distinctions sq
            WHERE sq.str_raw = D.str_raw
            ORDER BY Grp
            FOR XML PATH ('')),'>','>'),1,3,'')
    FROM Distinctions D
    GROUP BY D.str_raw;

    Note this uses jeff Mode's Delimited Split 8K; as STRING_SPLIT is only available on SQL Server 2016+.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thanks Thom, I used the DelimitedSplit8K described by Jeff Moden - http://www.sqlservercentral.com/articles/Tally+Table/72993/

    I get the results below:

    A
    A > C
    A > A > C
    A > C > B
    D > B > D > C

    Note the third row, which I'd want to be:

    A > C > A

    All other rows look good, but this one is out of sequence, can anyone describe why?
    Thanks
    Mickey

  • mickey w - Tuesday, April 24, 2018 2:58 AM

    Thanks Thom, I used the DelimitedSplit8K described by Jeff Moden - http://www.sqlservercentral.com/articles/Tally+Table/72993/

    I get the results below:

    A
    A > C
    A > A > C
    A > C > B
    D > B > D > C

    Note the third row, which I'd want to be:

    A > C > A

    All other rows look good, but this one is out of sequence, can anyone describe why?
    Thanks
    Mickey

    Give this a go:

    WITH Grps AS (
      SELECT SS.str_raw,
        DS.Item,
        DS.ItemNumber,
        ROW_NUMBER() OVER (PARTITION BY SS.str_raw ORDER BY DS.ItemNumber) -
        ROW_NUMBER() OVER (PARTITION BY SS.str_raw, DS.Item ORDER BY DS.ItemNumber) AS Grp
      FROM #stringsplit SS
       CROSS APPLY DelimitedSplit8K (REPLACE(SS.str_raw,' ',''), '>') DS),
    Distinctions AS (
      SELECT str_raw, Item, GRP,
        MIN(ItemNumber) AS ItemNumber
      FROM Grps
      GROUP BY str_raw, Item, Grp)
    SELECT D.str_raw,
       STUFF(REPLACE((SELECT ' > ' + Item
            FROM Distinctions sq
            WHERE sq.str_raw = D.str_raw
            ORDER BY ItemNumber
            FOR XML PATH ('')),'>','>'),1,3,'')
    FROM Distinctions D
    GROUP BY D.str_raw;

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • fantastic, thanks Thom this looks to have solved it.

    much appreciated.

  • For fun, here is a more efficient way (I think 😉 not tested
    😎

    USE TEEST;
    GO
    SET NOCOUNT ON;
    create table #stringsplit (str_raw varchar(30), str_output varchar(30))
    insert into #stringsplit (str_raw, str_output)
    values ('A > A > A', 'A')
    , ('A > A > A > A > C', 'A > C')
    , ('A > A > A > A > C > A', 'A > C > A')
    , ('A > A > A > A > C > B', 'A > C > B')
    , ('D > B > B > D > C', 'D > B > D > C')
    ;WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    ,SAMPLE_DATA AS
    (
      SELECT
       ROW_NUMBER() OVER (ORDER BY @@VERSION) AS STR_RID
       ,S.str_raw
      FROM #stringsplit S
    )
    ,GROUPED_SET AS
    (
    SELECT
      SS.STR_RID
     ,SS.str_raw + ' >' AS INP_STR
     ,NUMS.N
     ,SUBSTRING(SS.str_raw + ' > ',(NUMS.N * 4) -3, 4) AS GRP_STR
    FROM SAMPLE_DATA SS
    CROSS APPLY
    (
      SELECT TOP((LEN(SS.str_raw) + 3) / 4 )
       ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
      FROM T T1,T T2,T T3,T T4
    )
    NUMS(N)
    )
    ,COMP_SET AS
    (
    SELECT
     GS.STR_RID
    ,GS.GRP_STR
    ,LEAD(GS.GRP_STR,1,NULL) OVER
       (
        PARTITION BY GS.STR_RID
        ORDER BY  GS.N
       ) AS CMP_STR
    ,GS.N
    FROM GROUPED_SET  GS
    )
    ,OUTPUT_SET AS
    (
      SELECT
      CS.STR_RID
      ,(
       SELECT
        CONCAT
          (
           ''
           ,CASE
            WHEN CSS.GRP_STR = CSS.CMP_STR THEN ''
            ELSE CSS.GRP_STR
           END
          )
       FROM  COMP_SET   CSS
       WHERE CSS.STR_RID = CS.STR_RID
       FOR XML PATH (''), TYPE
      ).value('(./text())[1]','VARCHAR(MAX)') AS OUT_STR_RAW
      FROM  COMP_SET  CS
      GROUP BY CS.STR_RID
    )
    SELECT
      OS.STR_RID
     ,LTRIM(STUFF(OS.OUT_STR_RAW,LEN(OS.OUT_STR_RAW) - 1,2,'')) AS FINAL_STR
    FROM OUTPUT_SET OS;
    DROP TABLE #stringsplit;

    Output

    1    A
    2    A > C
    3    A > C > A
    4    A > C > B
    5    D > B > D > C

  • Interesting idea Eirikur. If I read your code correctly, however, this does assume that the value between the the Greater Than symbols is only 1 character.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A - Tuesday, April 24, 2018 9:03 AM

    Interesting idea Eirikur. If I read your code correctly, however, this does assume that the value between the the Greater Than symbols is only 1 character.

    You are correct Tom, the sample data only indicates one space character between the GT's and the values.
    😎

    If the sample data does not accurately reflect the characteristics of the real data, then this can easily be adjusted, I'm not in the business of assuming or guessing😉

    In this case, the two major factors, when it comes to performance, are the iterations in the delimiter detection part and the iterations in the reconstruction of the output string. Since the sample data indicates that this is a fixed width entry, we can reduce the number of iterations in the former by the length of each entry, here that is 1/4.

    Another factors are the plan simplification, the reduction of blocking (sort) operators from 7 to 1, on a large sets, this makes a lot of difference, the number of joins from 11 to 9 and the reduction of the total number of operators from approx. 67 to 48.

    Although the batch execution in the execution plan is not always too accurate, the difference here is 9:1 on my old laptop.

  • Another really fast way to tackle this (using NGrams8k - link is in my signature):

    SELECT *
    FROM #stringsplit s
    CROSS APPLY
    (
    SELECT STUFF(REPLACE((
      SELECT ' > ' + token
      FROM (VALUES (s.str_raw)) s(str_raw)
      CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string)
      CROSS APPLY dbo.ngrams8k(clean.string,1)
      WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1
            OR position = len(clean.string)
      ORDER BY position
      FOR XML PATH('')),'>','>'),1,3,'')
    ) new(string);

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B - Tuesday, April 24, 2018 1:03 PM

    Another really fast way to tackle this (using NGrams8k - link is in my signature):

    SELECT *
    FROM #stringsplit s
    CROSS APPLY
    (
    SELECT STUFF(REPLACE((
      SELECT ' > ' + token
      FROM (VALUES (s.str_raw)) s(str_raw)
      CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string)
      CROSS APPLY dbo.ngrams8k(clean.string,1)
      WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1
            OR position = len(clean.string)
      ORDER BY position
      FOR XML PATH('')),'>','>'),1,3,'')
    ) new(string);

    Very nice Alan, had totally forgotten about your nasty fast NGrams functions, this is far faster than my code.
    😎

  • Eirikur Eiriksson - Tuesday, April 24, 2018 11:54 PM

    Alan.B - Tuesday, April 24, 2018 1:03 PM

    Another really fast way to tackle this (using NGrams8k - link is in my signature):

    SELECT *
    FROM #stringsplit s
    CROSS APPLY
    (
    SELECT STUFF(REPLACE((
      SELECT ' > ' + token
      FROM (VALUES (s.str_raw)) s(str_raw)
      CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string)
      CROSS APPLY dbo.ngrams8k(clean.string,1)
      WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1
            OR position = len(clean.string)
      ORDER BY position
      FOR XML PATH('')),'>','>'),1,3,'')
    ) new(string);

    Very nice Alan, had totally forgotten about your nasty fast NGrams functions, this is far faster than my code.
    😎


    Agree, very nice indeed.
    But I think, based on my reading, that the WHERE can be simplified to get rid of the patindex() call:
    ...
    WHERE substring(clean.string, position+1, 1) <> token ...
    ...

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply