Full Text Indexing - Text Parsing Routine

,

The Challenge

When you make the decision to use full text indexing (FTI) in SQL Server, you

need to be very careful with the design of the user interface and the impact on

the end users.  Why do you ask? well is actually due to FTI itself and its

range of clauses that enhance the search. 

For example, if you type in "Electrical Training" in your application

search text box, and passed this directly to FTI using the CONTAINS clause like

this:

select orgname_name

from  organisation_name

where contains(*, '"Electrical Training"')

we happen to get 8 rows returned

Well that's fine and is probably no better than what you were doing before

FTI with LIKE '%Electrical Training%'; but do you really want to get your users

to type in more FTI friendly/smart clauses to expand the result set?  here

is a classic example related to word proximity:

select orgname_name

from organisation_name

where contains(*, '"Electrical" NEAR "Training"')

now we have 16 rows return, but required the user to wrapper double

quotes (perhaps) and enter the NEAR operator (yeh right!)

So do you smarten up your interface and give you users the endless

combinations of criteria to select from? or is it just business as usual and,

through a smart parsing solution, we "re-work" the users clause to expand on and

enrich the search experience without the need to train users or alter

front-ends.

This short paper discusses one possible solution for string parsing for FTI

searching.

One of Many Solutions

The following stored procedure is basically a string parser.  It takes a

single string as input, the returns a FTI friendly string back primarily

designed for CONTAINS or CONTAINSTABLE only.  The examples below are calls

to the FTI routine, showing the input and output strings.  Be aware that

INFLECTIONAL statements are ignored if you use a neutral language on your FTI

indexed columns.  Also, see documentation regarding support when used with

other clauses, namely contains, freetext etc.

Example 1 - Single Word

declare @parsed_string varchar(200) 
set @parsed_string  = "mcdonalds"  -- what the user entered
exec FTI_FixString 'drive safe', 'and', @parsed_string OUTPUT
print @parsed_string
        ("mcdonalds") OR ("mcdonalds*") OR (FORMSOF(INFLECTIONAL,"mcdonalds"))

Example 2 - Multiple Words

declare @parsed_string varchar(200) 
set @parsed_string  = "drive safe"  -- what the user entered
exec FTI_FixString 'drive safe', 'near', @parsed_string OUTPUT
print @parsed_string

"drive safe" OR ("drive" near "safe")

OR ("drive*" near "safe*") OR (FORMSOF(INFLECTIONAL,"drive") AND 

FORMSOF(INFLECTIONAL,"safe"))

This particular solution is a good one from a majority of searches (in our

application).  Please note that I stripped out the logging calls to an

audit table to track the incoming and outgoing parsed text for ongoing analysis

of user search criteria.

In the examples above, we look for an exact

match "drive safe", then a proximity match ("drive" near "safe") OR

("drive*" near "safe*")  and finally, an exact match using the

inflectional clause to really get FTI working for us.  The routine is very

each to customise to add/remove clauses as required.

The Stored Procedure

The routine is relatively simple, we take the incoming string, strip out

selected FTI syntax and some of the noise words.  From here we strip the

sting into into individual words and wrap around our FTI search clauses to

enhance the users search criteria as shown above.

NOTE - With your testing, if a table has multiple indexed columns make

sure FTI is searching appropriately over them as values in one column, but not

in another can cause much heartache.

CREATE PROCEDURE FTI_FixString
(
    @keywords varchar (500) = null , 
    @proximity varchar (10) = 'NEAR', -- and, or, near
    @parsedstring varchar(500) OUTPUT 
)
AS
SET CONCAT_NULL_YIELDS_NULL OFF 
SET NOCOUNT ON
BEGIN
DECLARE @sql varchar(2000)
DECLARE @where varchar(1000)
DECLARE @i int -- old posn in string
DECLARE @j int -- match posn in string
DECLARE @k int -- new start posn
DECLARE @len int -- string length
DECLARE @search varchar(500)
DECLARE @search2 varchar(500)
DECLARE @fuzzy varchar(1000)
DECLARE @keyword varchar(500)
DECLARE @synonyms varchar(255)
DECLARE @new_keywords varchar(1000)
declare @found int
-- Clean up the keywords string first, removing typical FTI join conditions, a majority of users dont type this in from long term analysis -- of their search clauses
set @keywords = LTRIM(RTRIM(@keywords))
set @keywords = REPLACE(@keywords,'*',' ')
set @keywords = REPLACE(@keywords,' AND ',' ')
set @keywords = REPLACE(@keywords,' OR ',' ')
set @keywords = REPLACE(@keywords,'"',' ')
set @keywords = REPLACE(@keywords,'(',' ')
set @keywords = REPLACE(@keywords,')',' ')
set @keywords = replace(@keywords, 'o''', '')
-- mcdonald's ==> mcdonald''s, escape out the ' which is a valid piece of punctuation
set @keywords = replace(@keywords, '''', '''''')
-- fix é type characters
set @found = charindex('é', @keywords)
if @found > 0
if @found = len(@keywords)
set @keywords = replace(@keywords, 'é', '')
else
set @keywords = replace(@keywords, 'é ', '* ')
-- set proximity type
if @proximity is null or @proximity = ''
set @proximity = 'NEAR'
-- Replace keywords with their synonyms if their exist
-- THIS IS COMMENTED OUT FOR A GOOD REASON, WE DON'T USE A KEYWORD TABLE LOOKUP, UNCOMMENT AND 
-- HOOK IN YOUR OWN AS REQUIRED.
/*
    SELECT @i = 1 -- start of the first string
    SELECT @j = 0 -- end of the first string unknown
    SELECT @len = LEN(@keywords)
    SELECT @new_keywords = ''
    WHILE (@j <= @len)
    BEGIN
        SELECT @k = @j+1 -- New start posn
        -- Detect the end of each substring
        SELECT @j = CHARINDEX(' ', @keywords, @k)
        -- Finish if no more found
        IF (@j <= 0)
        BEGIN
            SELECT @keyword = LTRIM(RTRIM(SUBSTRING(@keywords,@k,@len-@k+1)))
            SELECT @synonyms = synonyms FROM tbl_keyword_synonyms WHERE keyword = @keyword
            IF @@ROWCOUNT = 1
                SELECT @new_keywords = @new_keywords + ' ' + @synonyms
            ELSE
                SELECT @new_keywords = @new_keywords + ' ' + @keyword
            BREAK
        END
        IF (@j < @len)
        BEGIN
            IF (SUBSTRING(@keywords,@j-1,1) <> ' ') 
            BEGIN
                SELECT @keyword = LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i)))
                SELECT @synonyms = synonyms FROM tbl_keyword_synonyms WHERE keyword = @keyword
                IF @@ROWCOUNT = 1
                    SELECT @new_keywords = @new_keywords + ' ' + @synonyms
                ELSE
                    SELECT @new_keywords = @new_keywords + ' ' + @keyword
    
                SELECT @i = @j + 1
            END
        END
    END
    SELECT @keywords = @new_keywords
*/
SELECT @i = 1 -- start of the first string
SELECT @j = 0 -- end of the first string unknown
SELECT @len = LEN(@keywords)
SELECT @search = ''
SELECT @fuzzy = ''
WHILE (@j <= @len)
BEGIN
SELECT @k = @j+1 -- New start posn
-- cut up the string into its individual words, split this into its inflectional part and standard FTI search portion
SELECT @j = CHARINDEX(' ', @keywords, @k)
-- Finish if no more found
IF (@j <= 0)
BEGIN
    -- Add the last part of the string
    SELECT @search = @search + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@len-@i+1))) + '*"'
    SELECT @search2 = @search2 + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@len-@i+1))) + '"'
    if not SUBSTRING(@keywords,@k,@len-@k+1) in ('of', 'and', 'at', 'a', 'the', '&', 'it', 'for')
        SELECT @fuzzy = @fuzzy + 'FORMSOF(INFLECTIONAL,"' + LTRIM(RTRIM(SUBSTRING(@keywords,@k,@len-@k+1))) + '")'
        BREAK
    END 
    IF (@j < @len)
    BEGIN
        IF (SUBSTRING(@keywords,@j-1,1) <> ' ') 
        BEGIN
            SELECT @search = @search + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i))) + '*" ' + @proximity + ' '
            SELECT @search2 = @search2 + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i))) + '" ' + @proximity + ' '
            if not SUBSTRING(@keywords,@i,@j-@i) in ('of', 'and', 'at', 'a', 'the', '&', 'it', 'for')
                SELECT @fuzzy = @fuzzy + 'FORMSOF(INFLECTIONAL,"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i))) + '") AND '
            SELECT @i = @j + 1
        END
    END
END
if substring(@fuzzy, len(@fuzzy) - 2, 3) = ' OR'
    set @fuzzy = substring(@fuzzy, 1, len(@fuzzy) - 2)
if substring(@fuzzy, len(@fuzzy) - 3, 4) = ' AND'
    set @fuzzy = substring(@fuzzy, 1, len(@fuzzy) - 3)
-- Example: ("mcdonalds") OR ("mcdonalds*") OR (FORMSOF(INFLECTIONAL,"mcdonalds"))
SELECT @search = '"' + @keywords + '" OR '+ '(' + @search2 + ') OR (' + @search + ') OR (' + @fuzzy + ')'
set @parsedstring = @search
SET CONCAT_NULL_YIELDS_NULL ON
END
GO

Example Call

<<some stored procedure>>
SET NOCOUNT ON
set @ftisenabled = 1     -- 1 is YES, we are using Full Text Indexing, essential if catalogs are down/unavailable
if @ftisenabled = 1
begin
    set @ftop = 'CONTAINSTABLE'    -- so we can quickly move between different FTI operations for testing
    if ltrim(rtrim(@p_searchstring)) <> '' 
    begin
        set @p_searchstring= replace(@p_searchstring, '%', '') 
        exec FTI_FixString @p_searchstring, 'near', @p_searchstring OUTPUT
    end
else 
    set @ftisenabled = 0
end

<< do your search here using the altered @p_searchstring

in your CONTAINS clause >>

Conclusion

What does all this mean? well it highlights the importance of testing,

testing not only the result sets from FTI using its variety of clauses and

options, but understanding the impact on your end-users and what you, as a

DBA/Developer, need to take on board to make the road to FTI implementation a

simple and effective one. Have fun!

Credits

A big thanks to Lindsay Cocks for his early work in this area and

tireless testing.

Rate

4.67 (3)

Share

Share

Rate

4.67 (3)