March 10, 2018 at 4:21 pm
@Jeffrey Williams 3188
Ditto what Lynn said about the requirements. For fun, I put together a couple set-based solutions that I would try. Each solution uses a function that uses ngrams8k at it's core. Note that the link has the code available for download and also describes what a word-level n-gram is.
The first solution uses a world-level N-gram tokenizer: WNGrams2012_8K. The second uses SubstringBetweenChar8K. Note the comments in each function for details about how they work. I don't have time to test but, of these two solutions the latter will be faster.
Updated Sample DataIF OBJECT_ID('tempdb..#testTable') IS NOT NULL DROP TABLE #testTable;
CREATE TABLE #testTable(InputString varchar(100));
INSERT #testTable VALUES ('HELLO5-E-100'), ('Hotel5-E-101'), ('BYE5-F-001');
INSERT #testTable VALUES ('XXX-HELLO5-E-100'), ('X-X-Hotel5-E-101'), ('BYE5-001');
Using WNGrams2012_8KSELECT
t.inputString, -- included for testing
Part1 = token,
Part2 = RIGHT(s.s,CHARINDEX('-',REVERSE(s.s)) - 1)
FROM #testTable t
CROSS APPLY (VALUES (t.InputString)) s(s)
CROSS APPLY dbo.WNGrams2012_8K(s.s,'-', (LEN(s.s)-LEN(REPLACE(s.s,'-',''))))
WHERE tokenNumber = 1;
SubstringBetweenChar8KSELECT
t.inputString,
Part1 = Token,
Part2 = RIGHT(s.s,CHARINDEX('-',REVERSE(s.s)) - 1)
FROM #testTable t
CROSS APPLY (VALUES (t.InputString)) s(s)
CROSS APPLY (VALUES (LEN(s.s)-LEN(REPLACE(s.s,'-','')))) d(c)
CROSS APPLY dbo.SubstringBetweenChar8K(s.s,0,d.c,'-');
Again, what Lynn posted will be the fastest and accommodate any number of "delimiters" but it's always fun to explore set-based alternatives.
-- Itzik Ben-Gan 2001
Viewing post 16 (of 16 total)
You must be logged in to reply to this topic. Login to reply