Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Full Text Indexing - Text Parsing Routine

By Chris Kempster,

The Challenge

When you make the decision to use full text indexing (FTI) in SQL Server, you need to be very careful with the design of the user interface and the impact on the end users.  Why do you ask? well is actually due to FTI itself and its range of clauses that enhance the search. 

For example, if you type in "Electrical Training" in your application search text box, and passed this directly to FTI using the CONTAINS clause like this:

select orgname_name
from  organisation_name
where contains(*, '"Electrical Training"')

we happen to get 8 rows returned

Well that's fine and is probably no better than what you were doing before FTI with LIKE '%Electrical Training%'; but do you really want to get your users to type in more FTI friendly/smart clauses to expand the result set?  here is a classic example related to word proximity:

select orgname_name
from organisation_name
where contains(*, '"Electrical" NEAR "Training"')

now we have 16 rows return, but required the user to wrapper double quotes (perhaps) and enter the NEAR operator (yeh right!)

So do you smarten up your interface and give you users the endless combinations of criteria to select from? or is it just business as usual and, through a smart parsing solution, we "re-work" the users clause to expand on and enrich the search experience without the need to train users or alter front-ends.

This short paper discusses one possible solution for string parsing for FTI searching.

One of Many Solutions

The following stored procedure is basically a string parser.  It takes a single string as input, the returns a FTI friendly string back primarily designed for CONTAINS or CONTAINSTABLE only.  The examples below are calls to the FTI routine, showing the input and output strings.  Be aware that INFLECTIONAL statements are ignored if you use a neutral language on your FTI indexed columns.  Also, see documentation regarding support when used with other clauses, namely contains, freetext etc.

Example 1 - Single Word

declare @parsed_string varchar(200) 
set @parsed_string  = "mcdonalds"  -- what the user entered
exec FTI_FixString 'drive safe', 'and', @parsed_string OUTPUT
print @parsed_string
        ("mcdonalds") OR ("mcdonalds*") OR (FORMSOF(INFLECTIONAL,"mcdonalds"))

Example 2 - Multiple Words

declare @parsed_string varchar(200) 
set @parsed_string  = "drive safe"  -- what the user entered
exec FTI_FixString 'drive safe', 'near', @parsed_string OUTPUT
print @parsed_string

"drive safe" OR ("drive" near "safe") OR ("drive*" near "safe*") OR (FORMSOF(INFLECTIONAL,"drive") AND  FORMSOF(INFLECTIONAL,"safe"))

This particular solution is a good one from a majority of searches (in our application).  Please note that I stripped out the logging calls to an audit table to track the incoming and outgoing parsed text for ongoing analysis of user search criteria.

In the examples above, we look for an exact match "drive safe", then a proximity match ("drive" near "safe") OR ("drive*" near "safe*")  and finally, an exact match using the inflectional clause to really get FTI working for us.  The routine is very each to customise to add/remove clauses as required.

The Stored Procedure

The routine is relatively simple, we take the incoming string, strip out selected FTI syntax and some of the noise words.  From here we strip the sting into into individual words and wrap around our FTI search clauses to enhance the users search criteria as shown above.

NOTE - With your testing, if a table has multiple indexed columns make sure FTI is searching appropriately over them as values in one column, but not in another can cause much heartache.

CREATE PROCEDURE FTI_FixString
(
    @keywords varchar (500) = null , 
    @proximity varchar (10) = 'NEAR', -- and, or, near
    @parsedstring varchar(500) OUTPUT 
)
AS

SET CONCAT_NULL_YIELDS_NULL OFF 
SET NOCOUNT ON

BEGIN

DECLARE @sql varchar(2000)
DECLARE @where varchar(1000)
DECLARE @i int -- old posn in string
DECLARE @j int -- match posn in string
DECLARE @k int -- new start posn
DECLARE @len int -- string length
DECLARE @search varchar(500)
DECLARE @search2 varchar(500)
DECLARE @fuzzy varchar(1000)
DECLARE @keyword varchar(500)
DECLARE @synonyms varchar(255)
DECLARE @new_keywords varchar(1000)
declare @found int

-- Clean up the keywords string first, removing typical FTI join conditions, a majority of users dont type this in from long term analysis -- of their search clauses
set @keywords = LTRIM(RTRIM(@keywords))

set @keywords = REPLACE(@keywords,'*',' ')
set @keywords = REPLACE(@keywords,' AND ',' ')
set @keywords = REPLACE(@keywords,' OR ',' ')
set @keywords = REPLACE(@keywords,'"',' ')
set @keywords = REPLACE(@keywords,'(',' ')
set @keywords = REPLACE(@keywords,')',' ')

set @keywords = replace(@keywords, 'o''', '')

-- mcdonald's ==> mcdonald''s, escape out the ' which is a valid piece of punctuation
set @keywords = replace(@keywords, '''', '''''')

-- fix é type characters
set @found = charindex('é', @keywords)
if @found > 0
if @found = len(@keywords)
set @keywords = replace(@keywords, 'é', '')
else
set @keywords = replace(@keywords, 'é ', '* ')

-- set proximity type
if @proximity is null or @proximity = ''
set @proximity = 'NEAR'

-- Replace keywords with their synonyms if their exist
-- THIS IS COMMENTED OUT FOR A GOOD REASON, WE DON'T USE A KEYWORD TABLE LOOKUP, UNCOMMENT AND 
-- HOOK IN YOUR OWN AS REQUIRED.
/*
    SELECT @i = 1 -- start of the first string
    SELECT @j = 0 -- end of the first string unknown
    SELECT @len = LEN(@keywords)
    SELECT @new_keywords = ''

    WHILE (@j <= @len)
    BEGIN

        SELECT @k = @j+1 -- New start posn

        -- Detect the end of each substring
        SELECT @j = CHARINDEX(' ', @keywords, @k)

        -- Finish if no more found
        IF (@j <= 0)
        BEGIN

            SELECT @keyword = LTRIM(RTRIM(SUBSTRING(@keywords,@k,@len-@k+1)))
            SELECT @synonyms = synonyms FROM tbl_keyword_synonyms WHERE keyword = @keyword

            IF @@ROWCOUNT = 1
                SELECT @new_keywords = @new_keywords + ' ' + @synonyms
            ELSE
                SELECT @new_keywords = @new_keywords + ' ' + @keyword

            BREAK
        END

        IF (@j < @len)
        BEGIN
            IF (SUBSTRING(@keywords,@j-1,1) <> ' ') 
            BEGIN
                SELECT @keyword = LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i)))
                SELECT @synonyms = synonyms FROM tbl_keyword_synonyms WHERE keyword = @keyword

                IF @@ROWCOUNT = 1
                    SELECT @new_keywords = @new_keywords + ' ' + @synonyms
                ELSE
                    SELECT @new_keywords = @new_keywords + ' ' + @keyword
    
                SELECT @i = @j + 1
            END
        END
    END

    SELECT @keywords = @new_keywords
*/

SELECT @i = 1 -- start of the first string
SELECT @j = 0 -- end of the first string unknown
SELECT @len = LEN(@keywords)
SELECT @search = ''
SELECT @fuzzy = ''

WHILE (@j <= @len)
BEGIN
SELECT @k = @j+1 -- New start posn

-- cut up the string into its individual words, split this into its inflectional part and standard FTI search portion
SELECT @j = CHARINDEX(' ', @keywords, @k)

-- Finish if no more found
IF (@j <= 0)
BEGIN
    -- Add the last part of the string
    SELECT @search = @search + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@len-@i+1))) + '*"'
    SELECT @search2 = @search2 + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@len-@i+1))) + '"'

    if not SUBSTRING(@keywords,@k,@len-@k+1) in ('of', 'and', 'at', 'a', 'the', '&', 'it', 'for')
        SELECT @fuzzy = @fuzzy + 'FORMSOF(INFLECTIONAL,"' + LTRIM(RTRIM(SUBSTRING(@keywords,@k,@len-@k+1))) + '")'
        BREAK
    END 

    IF (@j < @len)
    BEGIN
        IF (SUBSTRING(@keywords,@j-1,1) <> ' ') 
        BEGIN
            SELECT @search = @search + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i))) + '*" ' + @proximity + ' '
            SELECT @search2 = @search2 + '"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i))) + '" ' + @proximity + ' '

            if not SUBSTRING(@keywords,@i,@j-@i) in ('of', 'and', 'at', 'a', 'the', '&', 'it', 'for')
                SELECT @fuzzy = @fuzzy + 'FORMSOF(INFLECTIONAL,"' + LTRIM(RTRIM(SUBSTRING(@keywords,@i,@j-@i))) + '") AND '

            SELECT @i = @j + 1
        END
    END
END

if substring(@fuzzy, len(@fuzzy) - 2, 3) = ' OR'
    set @fuzzy = substring(@fuzzy, 1, len(@fuzzy) - 2)
if substring(@fuzzy, len(@fuzzy) - 3, 4) = ' AND'
    set @fuzzy = substring(@fuzzy, 1, len(@fuzzy) - 3)

-- Example: ("mcdonalds") OR ("mcdonalds*") OR (FORMSOF(INFLECTIONAL,"mcdonalds"))

SELECT @search = '"' + @keywords + '" OR '+ '(' + @search2 + ') OR (' + @search + ') OR (' + @fuzzy + ')'

set @parsedstring = @search

SET CONCAT_NULL_YIELDS_NULL ON

END
GO

Example Call

<<some stored procedure>>

SET NOCOUNT ON
set @ftisenabled = 1     -- 1 is YES, we are using Full Text Indexing, essential if catalogs are down/unavailable

if @ftisenabled = 1
begin
    set @ftop = 'CONTAINSTABLE'    -- so we can quickly move between different FTI operations for testing

    if ltrim(rtrim(@p_searchstring)) <> '' 
    begin
        set @p_searchstring= replace(@p_searchstring, '%', '') 
        exec FTI_FixString @p_searchstring, 'near', @p_searchstring OUTPUT
    end
else 
    set @ftisenabled = 0
end

<< do your search here using the altered @p_searchstring in your CONTAINS clause >>

Conclusion

What does all this mean? well it highlights the importance of testing, testing not only the result sets from FTI using its variety of clauses and options, but understanding the impact on your end-users and what you, as a DBA/Developer, need to take on board to make the road to FTI implementation a simple and effective one. Have fun!

Credits

A big thanks to Lindsay Cocks for his early work in this area and tireless testing.

Total article views: 12024 | Views in the last 30 days: 17
 
Related Articles
ARTICLE

Fuzzy-String Search: Find misspelled information with T-SQL

An optimized Damerau-Levenshtein Distance (DLD) algorithm for "fuzzy" string matching in Transact-SQ...

FORUM
ARTICLE

Keyword Searching in SQL Server

Have you ever wanted to ensure that keywords in your data are easily searchable? Have you struggled ...

FORUM

search string help needed...

search string help needed...

ARTICLE

TSQL Challenge 34-Search for two keywords within the maximum distance

This challenge is all about searching for two keywords in a string with a maximum distance of 'one w...

Tags
miscellaneous    
programming    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones