How to find the first recurring character of a string.

  • Alan.B - Thursday, November 2, 2017 6:32 PM

    I fixed Chris' Test harness code and my own solution which uses NGrams8K. It was faster. If you understand how NGrams8K works then you know that I essentially re-wrote Chris's query with one notable modification: The CHECKSUM around my Position removes two implicit conversions to int that you'll see in the execution plan for Chris' solution. Execution plans attached. Note that I also removed the rowid.  


    SELECT XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
    SELECT TOP (1) token --, position
    FROM dbo.ngrams8k(los.name,1)
    WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
    ORDER BY position
    ) ngEng(XCHar);


    IF 1 = 1
    BEGIN
    SET NOCOUNT ON;
    IF OBJECT_ID('tempdb..#ListOfStrings') IS NOT NULL DROP TABLE #ListOfStrings;

    SELECT TOP(100000)
    [RowID] = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),
    [name] = CAST(NEWID() AS VARCHAR(36))
    INTO #ListOfStrings
    FROM SYS.COLUMNS a, SYS.COLUMNS b
    END;

    --PRINT char(10)+'== John '+REPLICATE('=',50);

    SET STATISTICS IO, TIME ON;
    PRINT char(10)+'== EE '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;
    WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N)),
    POS_DIST AS
    (
    SELECT
      LOS.RowID
     ,NUMS.N
     ,SUBSTRING(LOS.[name],NUMS.N,1) AS XCHAR
     ,CHARINDEX(SUBSTRING(LOS.[name],NUMS.N,1),LOS.[name],NUMS.N + 1) - NUMS.N AS DISTANCE
    FROM #ListOfStrings LOS
    CROSS APPLY
    (
      SELECT TOP(LEN(LOS.[name])) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
      FROM T T1,T T22,T T3,T T4
    ) NUMS(N)
    WHERE CHARINDEX(SUBSTRING(LOS.[name],NUMS.N,1),LOS.[name],NUMS.N + 1) - NUMS.N > 0
    ),
    FIND_MATCH AS
    (
    SELECT
      PD.RowID
      ,ROW_NUMBER() OVER (PARTITION BY PD.RowID ORDER BY PD.N + PD.DISTANCE) XRID
      ,PD.XCHAR
    FROM POS_DIST PD
    )
    SELECT --FM.RowID,
    FM.XCHAR
    FROM FIND_MATCH FM
    WHERE FM.XRID = 1
    ORDER BY FM.RowID;
    SET STATISTICS IO, TIME OFF;

    PRINT char(10)+'== Chris3 '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;

    WITH Tally -- Chris3
    AS (SELECT n = 0 FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))
    SELECT -- c.RowID,
    XCHAR = SUBSTRING(c.name, x.n, 1)
    FROM #ListOfStrings c
    CROSS APPLY
    (
    SELECT TOP(1) n
    FROM (
    SELECT TOP(LEN(c.name)) n = ROW_NUMBER() OVER (ORDER BY @@VERSION)
    FROM Tally a, Tally b, Tally c
    ) t
    WHERE CHARINDEX(SUBSTRING(c.name, t.n, 1), LEFT(c.name, n-1)) > 0
    ORDER BY n
    ) x
    ORDER BY c.RowID;
    SET STATISTICS IO, TIME OFF;

    PRINT char(10)+'== Jason '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;
    WITH
    cte_psd AS
    (
    SELECT
      los.RowID,
      bin_val = CAST(p.pos AS BINARY(4)) + CAST(sv.sub_val AS BINARY(2)),
      drnk = DENSE_RANK() OVER (PARTITION BY los.RowID, sv.sub_val ORDER BY p.pos)
    FROM #ListOfStrings los
    CROSS APPLY
    (
      SELECT TOP (LEN(los.name))
      rn =ROW_NUMBER() OVER (ORDER BY n2.n)
      FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) n1(n)
      CROSS APPLY ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n2 (n) ) p (pos)
      CROSS APPLY ( VALUES (SUBSTRING(los.name, p.pos, 1)) ) sv (sub_val)
    )
    SELECT --psd.RowID,
    CAST(SUBSTRING(MIN(psd.bin_val), 5, 2) AS VARCHAR(2))
    FROM cte_psd psd
    WHERE psd.drnk > 1
    GROUP BY psd.RowID
    ORDER BY psd.RowID;
    SET STATISTICS IO, TIME OFF;

    PRINT char(10)+'== Alan '+REPLICATE('========================================',2);
    SET STATISTICS IO, TIME ON;
    SELECT XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
    SELECT TOP (1) token --, position
    FROM dbo.ngrams8k(los.name,1)
    WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
    ORDER BY position
    ) ngEng(XCHar);

    SET STATISTICS IO, TIME OFF;

    Results

    == EE ================================================================================
    SQL Server Execution Times:
     CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:
     CPU time = 0 ms, elapsed time = 0 ms.
    SQL Server parse and compile time:
     CPU time = 393 ms, elapsed time = 393 ms.
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 7283, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 7735 ms, elapsed time = 9586 ms.

    == Chris3 ================================================================================
    SQL Server parse and compile time:
     CPU time = 0 ms, elapsed time = 3 ms.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 135, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 407 ms, elapsed time = 579 ms.

    == Jason ================================================================================
    SQL Server parse and compile time:
     CPU time = 74 ms, elapsed time = 74 ms.
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 8703 ms, elapsed time = 1534 ms.

    == Alan ================================================================================
    SQL Server parse and compile time:
     CPU time = 0 ms, elapsed time = 5 ms.
    <NO WORKTABLE HERE>
    Table '#ListOfStrings______________________________________________________________________________________________________000000000013'. Scan count 9, logical reads 705, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 347 ms, elapsed time = 462 ms.

    As you can see, my solution is worktable free. While typing this up, however, I discovered a problem with Chris' solution - he includes ORDER BY c.RowID which adds overhead. Removing that removed the worktable on his. Here's the updated performance tests comparing Chris' and my solution. I ran them with serial and parallel execution plans:

    PRINT char(10)+'== Chris3 (serial)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    WITH Tally -- Chris3
    AS (SELECT n = 0 FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))
    SELECT -- c.RowID,
      @x = SUBSTRING(c.name, x.n, 1)
    FROM #ListOfStrings c
    CROSS APPLY
    (
      SELECT TOP(1) n
      FROM (
      SELECT TOP(LEN(c.name)) n = ROW_NUMBER() OVER (ORDER BY @@VERSION)
      FROM Tally a, Tally b, Tally c
    ) t
    WHERE CHARINDEX(SUBSTRING(c.name, t.n, 1), LEFT(c.name, n-1)) > 0
    ORDER BY n
    ) x
    OPTION (maxdop 1);
    PRINT datediff(ms,@st,getdate());
    GO 3

    PRINT char(10)+'== Alan (serial)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    SELECT @x = XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
     SELECT TOP (1) token --, position
      FROM dbo.ngrams8k(los.name,1)
      WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
      ORDER BY position
    ) ngEng(XCHar)
    OPTION (maxdop 1);
    PRINT datediff(ms,@st,getdate());
    GO 3

    PRINT char(10)+'== Chris3 (parallel)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    WITH Tally -- Chris3
    AS (SELECT n = 0 FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))
    SELECT -- c.RowID,
      @x = SUBSTRING(c.name, x.n, 1)
    FROM #ListOfStrings c
    CROSS APPLY
    (
      SELECT TOP(1) n
      FROM (
      SELECT TOP(LEN(c.name)) n = ROW_NUMBER() OVER (ORDER BY @@VERSION)
      FROM Tally a, Tally b, Tally c
    ) t
    WHERE CHARINDEX(SUBSTRING(c.name, t.n, 1), LEFT(c.name, n-1)) > 0
    ORDER BY n
    ) x
    OPTION (querytraceon 8649);
    PRINT datediff(ms,@st,getdate());
    GO 3

    PRINT char(10)+'== Alan (parallel)'+REPLICATE('========================================',2);
    GO
    DECLARE @st datetime = getdate(), @x char(1);
    SELECT @x = XCHar
    FROM #ListOfStrings los
    CROSS APPLY
    (
     SELECT TOP (1) token --, position
      FROM dbo.ngrams8k(los.name,1)
      WHERE CHARINDEX(token,SUBSTRING(los.name,1,CHECKSUM(position-1))) > 0
      ORDER BY position
    ) ngEng(XCHar)
    OPTION (querytraceon 8649);
    PRINT datediff(ms,@st,getdate());
    GO 3

    Results:
    == Chris3 (serial)================================================================================
    Beginning execution loop
    174
    170
    163
    Batch execution completed 3 times.

    == Alan (serial)================================================================================
    Beginning execution loop
    174
    170
    170
    Batch execution completed 3 times.

    == Chris3 (parallel)================================================================================
    Beginning execution loop
    66
    56
    63
    Batch execution completed 3 times.

    == Alan (parallel)================================================================================
    Beginning execution loop
    53
    56
    60
    Batch execution completed 3 times.


    Both identical. Ngrams just makes things simpler πŸ˜‰

    Certainly does πŸ™‚ Thanks for this Alan, some improvements in here for my test harness.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • By the logic of the request you should be looking not for the next occurence, but for the previous one.

    Using John's harness I quickly baked this solution:
    WITH Tally AS (
    SELECT ROW_NUMBER() OVER (ORDER BY object_id) AS n
    FROM sys.objects
    )
    SELECT
      c.name,
        c.object_id,
        y.[FIRST RECURRING CHARACTER],
        y.[FIRST RECURRING POSITION]
    FROM sys.all_columns c
    CROSS APPLY (
      SELECT TOP 1
            [FIRST RECURRING CHARACTER] = SUBSTRING(c.name,n,1),
            [FIRST RECURRING POSITION] = n
      FROM Tally t
      WHERE n > 1 and t.n <= LEN(c.name)
            and charindex(SUBSTRING(c.name,n,1), SUBSTRING(c.name,1, n-1), 1) > 0
      ORDER BY n
      ) y;

    Then I turned the page and found that Chris has already posted pretty much the same solution.
    Appropriate logic made the queries easy winners.

    _____________
    Code for TallyGenerator

Viewing 2 posts - 46 through 46 (of 46 total)

You must be logged in to reply to this topic. Login to reply