Gents
First: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.
on my laptop (Windows 8, 2 cores/4 logical cpu/16GB run SQL Server 2014 Ent) I ran your solution (with results on) and it takes 31 seconds complete (my test harness data below). I tried to improve the time in your query but had no success.
I have an ngrams function that I use for this kind of thing. Below is the function with my notes on how to use it:
IF OBJECT_ID('tempdb.dbo.nGrams8K') IS NOT NULL DROP FUNCTION dbo.nGrams8K
GO
CREATE FUNCTION dbo.nGrams8K (@string varchar(8000), @k int)
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2014
Last Updated on: 5/22/2015
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
For more information see: http://en.wikipedia.org/wiki/N-gram
Use:
Outputs a stream of tokens based on an input string.
Similar to mdq.nGrams:
http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
Except it only returns characters as long as K.
nGrams8K also includes the position of the "Gram" in the string.
Examples of how to use included as comments after the code.
********************************************************************/
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(N) AS (SELECT 1 FROM (VALUES (null),(null),(null),(null),(null)) x(n)),
E3(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b CROSS JOIN E1 c),
iTally(N) AS
(
SELECT TOP (LEN(@string)-(@k-1)) CHECKSUM(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))
FROM E3 a CROSS JOIN E3 b
)
SELECT
position = N,
token = SUBSTRING(@string,N,@k)
FROM iTally;
GO
/********************************************************************
(1) Basic Use
-----------------------------------------------------------
-- (A) With @k as 1,2 & 3:
SELECT position, token FROM dbo.nGrams8K('abcdefg',1);
SELECT position, token FROM dbo.nGrams8K('abcdefg',2);
SELECT position, token FROM dbo.nGrams8K('abcdefg',3);
-- (B) Using variables
DECLARE
@string varchar(20) = 'abcdefg12345', @tokenLen tinyint = 3;
SELECT position, token FROM dbo.nGrams8K(@string,@tokenLen)
GO
(2) Character Count (including 0 Counts)
-----------------------------------------------------------
-- (A) Basic character count
SELECT token = ISNULL(token,'Total:'), count(*)
FROM dbo.nGrams8K(newid(),1)
GROUP BY token
WITH ROLLUP;-- Added rollup for display
-- (B) Character Account including gaps
DECLARE
@alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ',
@string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dogs.';
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(@alphabet,1) a
LEFT JOIN dbo.nGrams8K(@string,1) b ON a.token=b.token
GROUP BY a.token
ORDER BY a.token; -- not required, for display only
GO
-- (C) Let's try French ordered by most frequent
DECLARE
@alphabet VARCHAR(36)='abcdefghijklmnopqrstuvwxyzéèçëòôöùàâ',
@string VARCHAR(100)='Le renard vert rapide saute par dessus le chien paresseux et le chien paresseux juste posé là.';
WITH charcount AS
(
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(@alphabet,1) a
LEFT JOIN dbo.nGrams8K(@string,1) b ON a.token=b.token
GROUP BY a.token
)
SELECT * FROM charcount ORDER BY ttl DESC;
GO
(3) a *SIMPLE* word count
-----------------------------------------------------------
DECLARE @string varchar(100)='THE QUICK Green FOX JUMPED OVER THE LAZY DOGS BACK';
SELECT @string AS string, count(*)+1 AS words
FROM dbo.nGrams8K(@string,1)
WHERE [token]=' '
GO
(4) search for occurances and location of a substring
-----------------------------------------------------------
DECLARE
@string VARCHAR(100)='The green fox jumps over the dog and the dog just laid there.',
@searchString VARCHAR(100)='The';
SELECT position, token AS searchString
FROM dbo.nGrams8K(@string,LEN(@searchstring)) b
WHERE token=@searchString;
GO
(5) Strip non-numeric characters from a string
-----------------------------------------------------------
DECLARE @string varchar(100) = 'abc123xyz555xxx999!';
WITH stripNonNumeric(cleanstring) AS
(
SELECT token+''
FROM dbo.nGrams8K(@string,1)
WHERE token LIKE '[0-9]'
FOR XML PATH('')
)
SELECT cleanstring
FROM stripNonNumeric;
(6) Find all occurances of a pattern in a string
------------------------------------------------------------
DECLARE
@string varchar(100) = 'zz12x345xxx555abc1234zz5xxx',
@pat varchar(100) = '[a-z][a-z][0-9]',
@len int = 3;
SELECT start_pos = position, token
FROM dbo.nGrams8K(@string,@len)
WHERE token LIKE @pat
SELECT * FROM dbo.findpat8k(@string,@pat,3);
GO
(7) Find the longest common substring between 2 strings
------------------------------------------------------------
-- (A) The Function
IF EXISTS
(
SELECT * FROM INFORMATION_SCHEMA.ROUTINES
WHERE ROUTINE_SCHEMA='dbo' AND ROUTINE_NAME = 'lcss'
)
DROP FUNCTION dbo.lcss;
GO
CREATE FUNCTION dbo.lcss(@string1 varchar(100), @string2 varchar(100))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH iTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM dbo.nGrams8K
(
CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string1
ELSE @string2 END,1
)
)
SELECT TOP (1) with ties token
FROM iTally
CROSS APPLY
dbo.nGrams8K
(
CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string1
ELSE @string2 END, n
)
WHERE N <=
CASE WHEN LEN(@string1)<=LEN(@string2) THEN LEN(@string1)
ELSE LEN(@string2) END
AND charindex
(
token,
CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string2
ELSE @string1 END
) > 0
ORDER BY len(token) DESC;
GO
-- (B) example of how to use lcss
DECLARE
@string1 varchar(100) = '999xxyyyzaa99xxyyyzz',
@string2 varchar(100) = '000xxxyyyzzz';
SELECT string1 = @string1, string2 = @string2, token
FROM dbo.lcss(@string1, @string2);
********************************************************************/
GO
I ran the following 10K Row Test:
SET NOCOUNT ON;
DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
PRINT 'Sean'+char(13);
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf(string);
PRINT 'Alan'+char(13);
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf_ajb(string);
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
... and here's the results:
Sean
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 1, logical reads 63, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 31172 ms, elapsed time = 31213 ms.
Alan
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 1, logical reads 63, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 328 ms, elapsed time = 384 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
I have no idea why my NGrams solution is so much faster considering that we are essentially doing the same thing. I included the query plans for both.
Lowell: I hope this helps.
-- Itzik Ben-Gan 2001