When to use full-text

Question

When to use full-text

pdanes2

SSCommitted

Points: 1930
More actions
July 25, 2011 at 6:25 am

#241995

Are there any guidelines on when it is advisable to add full-text capabilities to a database? Number of records, length of fields, types of expected searches? I know the standard answers, about what a full-text index is and how it works, but haven't found any information on how to decide whether it makes sense to use it all.
Here's the scenario: I have one primary table, with about twenty lookup tables attached to it. Each of these lookup tables has three fields: autonumber primary key, ActualValue varchar(<100), AlternateValues varchar(cca. 1000). The basic functionality, besides joins on the foreign key from the main table, is that the user inputs some text for a particular field. A stored procedure will then examine the (indexed, no dups) ActualValue field for an exact match. If one is found, done, else examine the AlternateValues field for a partial match, that is, Like '%input%'.
Some records will have several possible alternates, enclosed in braces or some such as separators; some will have nothing.
An example record might be:
12345; Smith William; {Will Smith}{Bill Smith}{Willy Smitty}{William Smith Esquire}.
If the use enters 'Will Smith', the procedure will find the correct record, but if the entry is 'Willy Smith', it will not, even though both names are present in the alternate form. They are not both present within the bounds of one set of braces, so it's not a match.
This is simple with wildcards - Like '{%' + inputtext + '%}', but potentially slow. So:
1. Does a scenario like this make sense for full-text indexing? Does it depend on the actual statistics of the table (record count, actual field contents)?
2. Can full-text even address this requirement, of "all words to be within the same set of braces"?

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Brandie Tarvin SSC Guru Points: 173009 More actions · Answer 1

I think your scenario highlights a very real problem with Full-Text.

The searches are based on words, and variations of words, that SQL "knows." Because you are using proper names, instead of words that are in the dictionary, you are limited with full-text to the variations on proper names that Microsoft may have entered into their "dictionary." When you get to exotic names, or names spelled differently from the norm, you won't be able to find them with full-text searching.

I've never had a reason to use full-text. If it weren't for the proper name issue, I would think your situation would be a decent reason to use it, but I'm unsure of what performance issues you may run into. Do you have a sandbox to test this out on?

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

GSquared SSC Guru Points: 260824 More actions · Answer 2

It would make sense to use FTS for that, but instead of defining all the combinations in the AlternateText column, build/buy/find a custom thesaurus for names.

If you look at the data on FTS thesauri, here http://msdn.microsoft.com/en-us/library/ms142491.aspx, you'll pretty quickly see how that can work. It can be pretty slick with custom "this string = this other string" functionality.

There's a discussion on the subject here: http://social.msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/6e77ba90-aa63-4087-8b0d-5ac7bda183cd/

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

pdanes2 SSCommitted Points: 1930 More actions · Answer 3

Brandie Tarvin (7/25/2011)
I think your scenario highlights a very real problem with Full-Text.
The searches are based on words, and variations of words, that SQL "knows." Because you are using proper names, instead of words that are in the dictionary, you are limited with full-text to the variations on proper names that Microsoft may have entered into their "dictionary." When you get to exotic names, or names spelled differently from the norm, you won't be able to find them with full-text searching.
I've never had a reason to use full-text. If it weren't for the proper name issue, I would think your situation would be a decent reason to use it, but I'm unsure of what performance issues you may run into. Do you have a sandbox to test this out on?

I have a complete copy of the real database, under another name, running in the same instance of Sql Server Express. I use ODBC to access it, so I simply switch the ODBC pointer between real and test versions as necessary. I even have code in the app (only one at present) that talks to the database, which checks the ODBC target and computer user name. If my computer is linked to the test version or any other user's to the real version, all is well and the app starts normally. But if it sees that I'm talking to the real version or someone else is talking to the test version, it pops up a warning dialog (mostly to keep me from accidentally trashing real data, but also to warn users if I screwed up something on their machine and the ODBC link isn't right).

The custom dictionary isn't a bad idea, some of the contents are proper names, and some tables contain hairy-chested scientific terms, which Microsoft is unlikely to have in their standard dictionary. Also multiple languages (English, Latin, Czech, German, Russian) figure in some of this.

However, I thought that full-text indexed all words. Do you mean that if it runs into a word that isn't in the MS-supplied dictionary (or an additional custom one), it simply drops the word? And then will claim that it doesn't exist in the database? If so, we've gone far enough - that is absolutely unacceptable for this app.

pdanes2 SSCommitted Points: 1930 More actions · Answer 4

GSquared (7/25/2011)
It would make sense to use FTS for that, but instead of defining all the combinations in the AlternateText column, build/buy/find a custom thesaurus for names.

I could do that, but I have 20 such tables, with wildly varying content. My Bill Smith record was just an example, to illustrate what I'm trying to do. Also, Brandie's post seems to indicate that such a thesaurus must be completely covering, for all possible terms in any table that will use this. If I understood that correctly, there is no point in going any further with this topic.

If you look at the data on FTS thesauri, here http://msdn.microsoft.com/en-us/library/ms142491.aspx, you'll pretty quickly see how that can work. It can be pretty slick with custom "this string = this other string" functionality.

Thanks, that will need some study, but it looks useful. However, I see no way to confine acceptable matches to a specific range in the AlternateValues column. Quite the opposite, it seems that my proposed braces would simply get discarded and leave me with no way to specify that all matching words must be in the same substring: that is, any substring delimited by braces (or some other such character(s) that will not occur in the texts).

There's a discussion on the subject here: http://social.msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/6e77ba90-aa63-4087-8b0d-5ac7bda183cd/

That would help with the name issue, if there was something there, but it seems that the discussion fizzled out with no solution. And I have much more than just proper names, 'Bill Smith' was only an illustrative example.

Brandie Tarvin SSC Guru Points: 173009 More actions · Answer 5

In an effort to not scare you off of full-text, I've called in reinforcements to offer alternatives. I have, unfortunately, reached the limit of my knowledge in this subject.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

David Burrows SSC Guru Points: 65138 More actions · Answer 6

The problem you will have with FTS is that it will ignore the braces and end up with a word list of

Bill

Esquire

Smith

Smitty

Will

William

Willy

so you could end up with incorrect output if looking for Will or Smith in each delimited set.

To achive a better search result split the row into several to achieve

Will Smith

Bill Smith

Willy Smitty

William Smith Esquire

which would make

CONTAINS(col,'"Will" OR "Smith"')

and

CONTAINS(col,'"Willy" OR "Smith"')

more accurate

Far away is close at hand in the images of elsewhere.
Anon.

GSquared SSC Guru Points: 260824 More actions · Answer 7

FTS will index all words. The issue with the thesaurus is that if you haven't told it that "AU = Gold = Aurum = Oro", it won't know that. It will index the words separately. So long as you can build the right thesaurus, you'll get the equivalencies. If you don't, you'll still get the individual words.

So, if you haven't told it that "Bill" = "Willy", you can still search for "Bill", but it won't find "Willy" results, unless you search for both in your query. Multiple ways to skin the same cat.

I'm not sure if you can include multiple "words" in the same thesaurus entry. Haven't tried that. You'll need to test it to see if you can include "Billy Smith = Willy Smith". But I'm not sure why you'd want to do that. Why not single-word equivalences, which would get the same result. Then you search for "Bill NEAR Smith", and you get "Willy Smith", "Bill Smitty", "Billy Smith", et al. Much easier that way, and less error prone in most circumstances (I can't think of any where it would be more error prone, but that doesn't mean they don't exist, just that I can't think of them).

On the lists other than proper names, you can probably find thesauri already built, or can find data sources that a little XML cleverness can turn into thesauri with very little effort compared to other possible solutions. I found that discussion about proper names in under a minute with Bing, so use the search engine you prefer and take a little time, and you'll probably find what you need.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

pdanes2 SSCommitted Points: 1930 More actions · Answer 8

OK, thank you everyone for the advice. Based on all that, it doesn't seem like full-text is the correct tool for my needs in this particular instance. I'll stick with the wildcard approach for now.

Most of the time, the text I need will be in the first, indexed column, so speed shouldn't be a problem there. And the cases when searches are likely to examine the Alternate column are generally batch imports, which are not response-critical. If it takes five minutes to load a batch instead of four, it's just not that big a deal - the (possible) improvement just doesn't seem to be worth the effort required to implement full-text.

Appreciate all the thoughts on the matter.

GSquared SSC Guru Points: 260824 More actions · Answer 9

Okay. Glad we could help at least that much.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon