I would hope that most of us have used a thesaurus at some point in our careers. These allow us to substitute words for one another, providing for richer and more interesting communication.
Full-text search in SQL Server includes a thesaurus that you can customize for your searches. As with the thesaurus some of us use when writing, this features allows the search engine to substitute one word for another in searches.
You actually have to customize it. Here’s the default thesaurus for SQL Server 2008, which is stored in this location:
If you look in this folder, you see a lot of XML files. These are the thesaurus files and they are named as tsxxx.xml, where xxx is the three letter language code. For English, the thesaurus is tseng.xml.
If I open up the English file, you can see there’s not much there in terms of entries.
This looks like the file that came with SQL Server 2005, if not SQL Server 2000. Note also that everything is commented out, and you need to remove these comment lines if you want to edit this file.
The configuration isn’t that complex, but let’s look at a simple example. I’ll set up a small table and create a full text index on it.
CREATE TABLE FTSTemp ( id INT , Notes varchar(8000) CONSTRAINT pk_ftstemp PRIMARY KEY (id) ) ; GO INSERT ftstemp SELECT 1, 'The quick brown fox jumped over the lazy dog' INSERT ftstemp SELECT 2, 'I run WinXP.'
I’m going to edit my thesaurus file to include these entries:
I am only using expansion sets here. There are also replacement sets, but that’s for another post. In this case, when I search for any of the terms above, elements matching any other terms will be returned.
NOTE: The US English file is tsenu.xml. The UK English file is tseng.xml.
I’ll now create a full text index on this table, on the Notes column.
I can issue this search, which I expect to work:
SELECT id , notes FROM ftstemp WHERE CONTAINS(notes, 'quick')
That returns the data I expect.
Now to check the expansion set. To do that, I’ll need to use a FREETEXT query. Once I do this, I get results from both of my entries.
Note that if you edit the thesaurus, in order for your changes to show up in queries, you need to reload the Thesaurus file with this:
EXEC sys.sp_fulltext_load_thesaurus_file 1033;
The 1033 is for English. This is the LCID, which varies for each language.
If you want to broaden your searches, include acronyms, etc, then the thesaurus is a good way to do this. Beware, however, that your entries will apply to all searches on the instance, so if you have disparate applications on the same instance, you might encounter some strange results.
Filed under: Blog Tagged: full text search, sql server, syndicated