At what point do stoplists actually start improving performance ?

  • I have two exactly the same tables with 4.5 Million resumes indexed with FTI, and a clustered unique ID.

    One table has the standard Microsoft 154 English stopwords.

    The other table has 10000 stopwords.

    Prior to running each query, i ran an

    UPDATE STATISTICS Tablename WITH FULLSCAN

    Then i ran a DBCC Freeproccache.

    Then i ran the following query:

    set statistics time on

    --select count(*) from Profiles91313bb

    select count(*) from Profiles

    where contains(doccontent, 'resume')

    and the results are basically identical, around 1.05 seconds.

    Is this because 4.5 Million records, with up to 10000 stoplist words per resume, is just not enough data to cause index bloat ?

    Thanks

  • I am not sure why you would add 10000 stopwords. Stopwords are common words that are useless to search on. For instance words like "then", "for", "and". If you add a word like "chickenpox" to the stoplist, then users cannot search for "chickenpox".

    Putting a word like "résumé" into a stoplist may make sense if you have a recruiting database and about every document is a résumé anyway. But 10000 such words?

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • The words that would typically be searched are IT related, such as Java, J2ee, JQuery, Livelink etc...

    Since i don't know every new technology being developed, it is easier to remove the words that are unlikely to be searched.

    What i find interesting is that even with 10k stopwords over 4.5 Million multiple page resumes, i did not get any improvement in performance !

    Thanks for your feedback !

  • isuckatsql (9/18/2013)


    Since i don't know every new technology being developed, it is easier to remove the words that are unlikely to be searched.

    I don't think that this is a very good strategy. It can only lead to reduced benefit of the index. The single user that searches for "caramel" will not find that single résumé that includes that word. But you have only made a miniscule reduction of the index size.

    The performance should not really matter whether the stopwords are there or not. First, I don't think many of these 10000 words are frequent enough to reduce the size. But more importantly, an index is an index, which means that the size of the index does not matter much when you seek it, since you follow the B-tree (or whatever organisation a fulltext index has.)

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Thanks !

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply