Blog Post

Full-Text Search – Stoplists in SQL Server

,

Full-text search is an interesting subsystem in SQL Server. It allows you to implement searches through a variety of text formats stored in SQL Server. This is a one of a series of posts that looks at different facets of full-text search.

What is a Stoplist?

A stoplist is a list of stopwords that SQL Server should not include in a full-text index.  These are words that are seen as not adding any value to the full-text index. We typically see these words as important in language for structure, but not for content. Examples of stopwords are:

  • the
  • a
  • an
  • is
  • are

In previous versions of SQL Server, these were also known as noise words and a noise word list. You can read about the topic in Books Online.

How are they used?

Stoplists are used when building the index. The words that are contained in the text, and also in the stoplist are ignored and not populated inside the index. This makes for a smaller index, and it also means that the stopwords are not

The position of these words in the text being indexed, however, still do matter. This is to be sure that searches using proximity are still correctly carried out.

Each index can have a specific stoplist associated with it. You can associate a stoplist at index creation time, or alter the index later to add or change the stoplist.

Creating a Stoplist

For each language supported in the full-text system, there is a stoplist installed with SQL Server. These are the commonly used words that should be ignored for each language.

You can, however, create your own stoplist of word with the CREATE FULLTEXT STOPLIST command. The creation can be for a new stoplist, or you can copy an existing stoplist.This includes system stoplists, which you can use as a basis for your custom stoplist. The commands are simple, and they are well documented in BOL.

To add or remove words from a stoplist, the ALTER FULLTEXT STOPLIST command is used with the ADD or DROP parameters. Alterations to a stoplist must be for a specific language, which is specified with the LCID or name of the language.

Practical Points

The stoplists are important for limiting the size of the stoplist and making a more efficient index. Full-text indexes are very efficient and scalable in SQL Server, but the less data that needs to be indexed and searched, the most efficient the system will operate.

System stoplists works well for many natural language searches, but are not necessarily adequate for domain specific searches. For example, if I were indexing all white papers on SQL Server, I might want to ignore extremely common words or phrases that are in all documents. For example, I might consider “SQL” to be so common as to be useless in searches. Rather than bloat the size of the index with this word, I may add this to a stoplist for the full-text index and assume it’s a word like “the”, which I would not use for searches of these documents.

I haven’t necessarily found a reason to use custom stoplists in the past, but if my full-text index were extremely large or I had a large volume of searches, I might consider using stoplists to prune down my indexes.

If you have used these in your system, I’d be interested in knowing the reasons and effects.

Filed under: Blog Tagged: full text search, sql server, syndicated

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating