At what point do stoplists actually start improving performance ?

Question

At what point do stoplists actually start improving performance ?

isuckatsql

SSChampion

Points: 10381
More actions
September 16, 2013 at 12:33 pm

#279864

I have two exactly the same tables with 4.5 Million resumes indexed with FTI, and a clustered unique ID.
One table has the standard Microsoft 154 English stopwords.
The other table has 10000 stopwords.
Prior to running each query, i ran an
UPDATE STATISTICS Tablename WITH FULLSCAN
Then i ran a DBCC Freeproccache.
Then i ran the following query:
set statistics time on
--select count(*) from Profiles91313bb
select count(*) from Profiles
where contains(doccontent, 'resume')
and the results are basically identical, around 1.05 seconds.
Is this because 4.5 Million records, with up to 10000 stoplist words per resume, is just not enough data to cause index bloat ?
Thanks

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 1

I am not sure why you would add 10000 stopwords. Stopwords are common words that are useless to search on. For instance words like "then", "for", "and". If you add a word like "chickenpox" to the stoplist, then users cannot search for "chickenpox".

Putting a word like "résumé" into a stoplist may make sense if you have a recruiting database and about every document is a résumé anyway. But 10000 such words?

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

isuckatsql SSChampion Points: 10381 More actions · Answer 2

The words that would typically be searched are IT related, such as Java, J2ee, JQuery, Livelink etc...

Since i don't know every new technology being developed, it is easier to remove the words that are unlikely to be searched.

What i find interesting is that even with 10k stopwords over 4.5 Million multiple page resumes, i did not get any improvement in performance !

Thanks for your feedback !

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 3

isuckatsql (9/18/2013)
Since i don't know every new technology being developed, it is easier to remove the words that are unlikely to be searched.

I don't think that this is a very good strategy. It can only lead to reduced benefit of the index. The single user that searches for "caramel" will not find that single résumé that includes that word. But you have only made a miniscule reduction of the index size.

The performance should not really matter whether the stopwords are there or not. First, I don't think many of these 10000 words are frequent enough to reduce the size. But more importantly, an index is an index, which means that the size of the index does not matter much when you seek it, since you follow the B-tree (or whatever organisation a fulltext index has.)

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

isuckatsql SSChampion Points: 10381 More actions · Answer 4

isuckatsql

SSChampion

Points: 10381

September 18, 2013 at 10:48 am

#1651197

Thanks !