When to use full-text

  • Are there any guidelines on when it is advisable to add full-text capabilities to a database? Number of records, length of fields, types of expected searches? I know the standard answers, about what a full-text index is and how it works, but haven't found any information on how to decide whether it makes sense to use it all.

    Here's the scenario: I have one primary table, with about twenty lookup tables attached to it. Each of these lookup tables has three fields: autonumber primary key, ActualValue varchar(<100), AlternateValues varchar(cca. 1000). The basic functionality, besides joins on the foreign key from the main table, is that the user inputs some text for a particular field. A stored procedure will then examine the (indexed, no dups) ActualValue field for an exact match. If one is found, done, else examine the AlternateValues field for a partial match, that is, Like '%input%'.

    Some records will have several possible alternates, enclosed in braces or some such as separators; some will have nothing.

    An example record might be:

    12345; Smith William; {Will Smith}{Bill Smith}{Willy Smitty}{William Smith Esquire}.

    If the use enters 'Will Smith', the procedure will find the correct record, but if the entry is 'Willy Smith', it will not, even though both names are present in the alternate form. They are not both present within the bounds of one set of braces, so it's not a match.

    This is simple with wildcards - Like '{%' + inputtext + '%}', but potentially slow. So:

    1. Does a scenario like this make sense for full-text indexing? Does it depend on the actual statistics of the table (record count, actual field contents)?

    2. Can full-text even address this requirement, of "all words to be within the same set of braces"?

  • I think your scenario highlights a very real problem with Full-Text.

    The searches are based on words, and variations of words, that SQL "knows." Because you are using proper names, instead of words that are in the dictionary, you are limited with full-text to the variations on proper names that Microsoft may have entered into their "dictionary." When you get to exotic names, or names spelled differently from the norm, you won't be able to find them with full-text searching.

    I've never had a reason to use full-text. If it weren't for the proper name issue, I would think your situation would be a decent reason to use it, but I'm unsure of what performance issues you may run into. Do you have a sandbox to test this out on?

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • It would make sense to use FTS for that, but instead of defining all the combinations in the AlternateText column, build/buy/find a custom thesaurus for names.

    If you look at the data on FTS thesauri, here http://msdn.microsoft.com/en-us/library/ms142491.aspx, you'll pretty quickly see how that can work. It can be pretty slick with custom "this string = this other string" functionality.

    There's a discussion on the subject here: http://social.msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/6e77ba90-aa63-4087-8b0d-5ac7bda183cd/

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Brandie Tarvin (7/25/2011)


    I think your scenario highlights a very real problem with Full-Text.

    The searches are based on words, and variations of words, that SQL "knows." Because you are using proper names, instead of words that are in the dictionary, you are limited with full-text to the variations on proper names that Microsoft may have entered into their "dictionary." When you get to exotic names, or names spelled differently from the norm, you won't be able to find them with full-text searching.

    I've never had a reason to use full-text. If it weren't for the proper name issue, I would think your situation would be a decent reason to use it, but I'm unsure of what performance issues you may run into. Do you have a sandbox to test this out on?

    I have a complete copy of the real database, under another name, running in the same instance of Sql Server Express. I use ODBC to access it, so I simply switch the ODBC pointer between real and test versions as necessary. I even have code in the app (only one at present) that talks to the database, which checks the ODBC target and computer user name. If my computer is linked to the test version or any other user's to the real version, all is well and the app starts normally. But if it sees that I'm talking to the real version or someone else is talking to the test version, it pops up a warning dialog (mostly to keep me from accidentally trashing real data, but also to warn users if I screwed up something on their machine and the ODBC link isn't right).

    The custom dictionary isn't a bad idea, some of the contents are proper names, and some tables contain hairy-chested scientific terms, which Microsoft is unlikely to have in their standard dictionary. Also multiple languages (English, Latin, Czech, German, Russian) figure in some of this.

    However, I thought that full-text indexed all words. Do you mean that if it runs into a word that isn't in the MS-supplied dictionary (or an additional custom one), it simply drops the word? And then will claim that it doesn't exist in the database? If so, we've gone far enough - that is absolutely unacceptable for this app.

  • GSquared (7/25/2011)


    It would make sense to use FTS for that, but instead of defining all the combinations in the AlternateText column, build/buy/find a custom thesaurus for names.

    I could do that, but I have 20 such tables, with wildly varying content. My Bill Smith record was just an example, to illustrate what I'm trying to do. Also, Brandie's post seems to indicate that such a thesaurus must be completely covering, for all possible terms in any table that will use this. If I understood that correctly, there is no point in going any further with this topic.

    If you look at the data on FTS thesauri, here http://msdn.microsoft.com/en-us/library/ms142491.aspx, you'll pretty quickly see how that can work. It can be pretty slick with custom "this string = this other string" functionality.

    Thanks, that will need some study, but it looks useful. However, I see no way to confine acceptable matches to a specific range in the AlternateValues column. Quite the opposite, it seems that my proposed braces would simply get discarded and leave me with no way to specify that all matching words must be in the same substring: that is, any substring delimited by braces (or some other such character(s) that will not occur in the texts).

    That would help with the name issue, if there was something there, but it seems that the discussion fizzled out with no solution. And I have much more than just proper names, 'Bill Smith' was only an illustrative example.

  • In an effort to not scare you off of full-text, I've called in reinforcements to offer alternatives. I have, unfortunately, reached the limit of my knowledge in this subject.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • The problem you will have with FTS is that it will ignore the braces and end up with a word list of

    Bill

    Esquire

    Smith

    Smitty

    Will

    William

    Willy

    so you could end up with incorrect output if looking for Will or Smith in each delimited set.

    To achive a better search result split the row into several to achieve

    Will Smith

    Bill Smith

    Willy Smitty

    William Smith Esquire

    which would make

    CONTAINS(col,'"Will" OR "Smith"')

    and

    CONTAINS(col,'"Willy" OR "Smith"')

    more accurate

    Far away is close at hand in the images of elsewhere.
    Anon.

  • FTS will index all words. The issue with the thesaurus is that if you haven't told it that "AU = Gold = Aurum = Oro", it won't know that. It will index the words separately. So long as you can build the right thesaurus, you'll get the equivalencies. If you don't, you'll still get the individual words.

    So, if you haven't told it that "Bill" = "Willy", you can still search for "Bill", but it won't find "Willy" results, unless you search for both in your query. Multiple ways to skin the same cat.

    I'm not sure if you can include multiple "words" in the same thesaurus entry. Haven't tried that. You'll need to test it to see if you can include "Billy Smith = Willy Smith". But I'm not sure why you'd want to do that. Why not single-word equivalences, which would get the same result. Then you search for "Bill NEAR Smith", and you get "Willy Smith", "Bill Smitty", "Billy Smith", et al. Much easier that way, and less error prone in most circumstances (I can't think of any where it would be more error prone, but that doesn't mean they don't exist, just that I can't think of them).

    On the lists other than proper names, you can probably find thesauri already built, or can find data sources that a little XML cleverness can turn into thesauri with very little effort compared to other possible solutions. I found that discussion about proper names in under a minute with Bing, so use the search engine you prefer and take a little time, and you'll probably find what you need.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • OK, thank you everyone for the advice. Based on all that, it doesn't seem like full-text is the correct tool for my needs in this particular instance. I'll stick with the wildcard approach for now.

    Most of the time, the text I need will be in the first, indexed column, so speed shouldn't be a problem there. And the cases when searches are likely to examine the Alternate column are generally batch imports, which are not response-critical. If it takes five minutes to load a batch instead of four, it's just not that big a deal - the (possible) improvement just doesn't seem to be worth the effort required to implement full-text.

    Appreciate all the thoughts on the matter.

  • Okay. Glad we could help at least that much.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply