How to find the first recurring character of a string.

  • Alan.B - Thursday, November 2, 2017 6:32 PM

    I fixed Chris' Test harness code and my own solution which uses NGrams8K. It was faster. If you understand how NGrams8K works then you know that I essentially re-wrote Chris's query with one notable modification: The CHECKSUM around my Position removes two implicit conversions to int that you'll see in the execution plan for Chris' solution. Execution plans attached. Note that I also removed the rowid.  


    SELECT XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
    SELECT TOP (1) token --, position
    FROM dbo.ngrams8k(los.name,1)
    WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
    ORDER BY position
    ) ngEng(XCHar);


    IF 1 = 1
    BEGIN
    SET NOCOUNT ON;
    IF OBJECT_ID('tempdb..#ListOfStrings') IS NOT NULL DROP TABLE #ListOfStrings;

    SELECT TOP(100000)
    [RowID] = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),
    [name] = CAST(NEWID() AS VARCHAR(36))
    INTO #ListOfStrings
    FROM SYS.COLUMNS a, SYS.COLUMNS b
    END;

    --PRINT char(10)+'== John '+REPLICATE('=',50);

    SET STATISTICS IO, TIME ON;
    PRINT char(10)+'== EE '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;
    WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N)),
    POS_DIST AS
    (
    SELECT
      LOS.RowID
     ,NUMS.N
     ,SUBSTRING(LOS.[name],NUMS.N,1) AS XCHAR
     ,CHARINDEX(SUBSTRING(LOS.[name],NUMS.N,1),LOS.[name],NUMS.N + 1) - NUMS.N AS DISTANCE
    FROM #ListOfStrings LOS
    CROSS APPLY
    (
      SELECT TOP(LEN(LOS.[name])) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
      FROM T T1,T T22,T T3,T T4
    ) NUMS(N)
    WHERE CHARINDEX(SUBSTRING(LOS.[name],NUMS.N,1),LOS.[name],NUMS.N + 1) - NUMS.N > 0
    ),
    FIND_MATCH AS
    (
    SELECT
      PD.RowID
      ,ROW_NUMBER() OVER (PARTITION BY PD.RowID ORDER BY PD.N + PD.DISTANCE) XRID
      ,PD.XCHAR
    FROM POS_DIST PD
    )
    SELECT --FM.RowID,
    FM.XCHAR
    FROM FIND_MATCH FM
    WHERE FM.XRID = 1
    ORDER BY FM.RowID;
    SET STATISTICS IO, TIME OFF;

    PRINT char(10)+'== Chris3 '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;

    WITH Tally -- Chris3
    AS (SELECT n = 0 FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))
    SELECT -- c.RowID,
    XCHAR = SUBSTRING(c.name, x.n, 1)
    FROM #ListOfStrings c
    CROSS APPLY
    (
    SELECT TOP(1) n
    FROM (
    SELECT TOP(LEN(c.name)) n = ROW_NUMBER() OVER (ORDER BY @@VERSION)
    FROM Tally a, Tally b, Tally c
    ) t
    WHERE CHARINDEX(SUBSTRING(c.name, t.n, 1), LEFT(c.name, n-1)) > 0
    ORDER BY n
    ) x
    ORDER BY c.RowID;
    SET STATISTICS IO, TIME OFF;

    PRINT char(10)+'== Jason '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;
    WITH
    cte_psd AS
    (
    SELECT
      los.RowID,
      bin_val = CAST(p.pos AS BINARY(4)) + CAST(sv.sub_val AS BINARY(2)),
      drnk = DENSE_RANK() OVER (PARTITION BY los.RowID, sv.sub_val ORDER BY p.pos)
    FROM #ListOfStrings los
    CROSS APPLY
    (
      SELECT TOP (LEN(los.name))
      rn =ROW_NUMBER() OVER (ORDER BY n2.n)
      FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) n1(n)
      CROSS APPLY ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n2 (n) ) p (pos)
      CROSS APPLY ( VALUES (SUBSTRING(los.name, p.pos, 1)) ) sv (sub_val)
    )
    SELECT --psd.RowID,
    CAST(SUBSTRING(MIN(psd.bin_val), 5, 2) AS VARCHAR(2))
    FROM cte_psd psd
    WHERE psd.drnk > 1
    GROUP BY psd.RowID
    ORDER BY psd.RowID;
    SET STATISTICS IO, TIME OFF;

    PRINT char(10)+'== Alan '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;
    SELECT XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
    SELECT TOP (1) token --, position
    FROM dbo.ngrams8k(los.name,1)
    WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
    ORDER BY position
    ) ngEng(XCHar);

    SET STATISTICS IO, TIME OFF;

    Results

    == EE ================================================================================
    SQL Server Execution Times:
     CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:
     CPU time = 0 ms, elapsed time = 0 ms.
    SQL Server parse and compile time:
     CPU time = 393 ms, elapsed time = 393 ms.
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 7283, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 7735 ms, elapsed time = 9586 ms.

    == Chris3 ================================================================================
    SQL Server parse and compile time:
     CPU time = 0 ms, elapsed time = 3 ms.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 135, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 407 ms, elapsed time = 579 ms.

    == Jason ================================================================================
    SQL Server parse and compile time:
     CPU time = 74 ms, elapsed time = 74 ms.
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 8703 ms, elapsed time = 1534 ms.

    == Alan ================================================================================
    SQL Server parse and compile time:
     CPU time = 0 ms, elapsed time = 5 ms.
    <NO WORKTABLE HERE>
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 347 ms, elapsed time = 462 ms.

    As you can see, my solution is worktable free. While typing this up, however, I discovered a problem with Chris' solution - he includes ORDER BY c.RowID which adds overhead. Removing that removed the worktable on his. Here's the updated performance tests comparing Chris' and my solution. I ran them with serial and parallel execution plans:

    PRINT char(10)+'== Chris3 (serial)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    WITH Tally -- Chris3
    AS (SELECT n = 0 FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))
    SELECT -- c.RowID,
      @x = SUBSTRING(c.name, x.n, 1)
    FROM #ListOfStrings c
    CROSS APPLY
    (
      SELECT TOP(1) n
      FROM (
      SELECT TOP(LEN(c.name)) n = ROW_NUMBER() OVER (ORDER BY @@VERSION)
      FROM Tally a, Tally b, Tally c
    ) t
    WHERE CHARINDEX(SUBSTRING(c.name, t.n, 1), LEFT(c.name, n-1)) > 0
    ORDER BY n
    ) x
    OPTION (maxdop 1);
    PRINT datediff(ms,@st,getdate());
    GO 3

    PRINT char(10)+'== Alan (serial)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    SELECT @x = XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
     SELECT TOP (1) token --, position
      FROM dbo.ngrams8k(los.name,1)
      WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
      ORDER BY position
    ) ngEng(XCHar)
    OPTION (maxdop 1);
    PRINT datediff(ms,@st,getdate());
    GO 3

    PRINT char(10)+'== Chris3 (parallel)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    WITH Tally -- Chris3
    AS (SELECT n = 0 FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))
    SELECT -- c.RowID,
      @x = SUBSTRING(c.name, x.n, 1)
    FROM #ListOfStrings c
    CROSS APPLY
    (
      SELECT TOP(1) n
      FROM (
      SELECT TOP(LEN(c.name)) n = ROW_NUMBER() OVER (ORDER BY @@VERSION)
      FROM Tally a, Tally b, Tally c
    ) t
    WHERE CHARINDEX(SUBSTRING(c.name, t.n, 1), LEFT(c.name, n-1)) > 0
    ORDER BY n
    ) x
    OPTION (querytraceon 8649);
    PRINT datediff(ms,@st,getdate());
    GO 3

    PRINT char(10)+'== Alan (parallel)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    SELECT @x = XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
     SELECT TOP (1) token --, position
      FROM dbo.ngrams8k(los.name,1)
      WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
      ORDER BY position
    ) ngEng(XCHar)
    OPTION (querytraceon 8649);
    PRINT datediff(ms,@st,getdate());
    GO 3

    Results:
    == Chris3 (serial)================================================================================
    Beginning execution loop
    174
    170
    163
    Batch execution completed 3 times.

    == Alan (serial)================================================================================
    Beginning execution loop
    174
    170
    170
    Batch execution completed 3 times.

    == Chris3 (parallel)================================================================================
    Beginning execution loop
    66
    56
    63
    Batch execution completed 3 times.

    == Alan (parallel)================================================================================
    Beginning execution loop
    53
    56
    60
    Batch execution completed 3 times.


    Both identical. Ngrams just makes things simpler 😉

    Certainly does 🙂 Thanks for this Alan, some improvements in here for my test harness.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • By the logic of the request you should be looking not for the next occurence, but for the previous one.

    Using John's harness I quickly baked this solution:
    WITH Tally AS (
    SELECT ROW_NUMBER() OVER (ORDER BY object_id) AS n
    FROM sys.objects
    )
    SELECT
      c.name,
        c.object_id,
        y.[FIRST RECURRING CHARACTER],
        y.[FIRST RECURRING POSITION]
    FROM sys.all_columns c
    CROSS APPLY (
      SELECT TOP 1
            [FIRST RECURRING CHARACTER] = SUBSTRING(c.name,n,1),
            [FIRST RECURRING POSITION] = n
      FROM Tally t
      WHERE n > 1 and t.n <= LEN(c.name)
            and charindex(SUBSTRING(c.name,n,1), SUBSTRING(c.name,1, n-1), 1) > 0
      ORDER BY n
      ) y;

    Then I turned the page and found that Chris has already posted pretty much the same solution.
    Appropriate logic made the queries easy winners.

    _____________
    Code for TallyGenerator

Viewing 2 posts - 46 through 47 (of 47 total)

You must be logged in to reply to this topic. Login to reply