Implementing FULL-TEXT Search in a Google fashion!

  • OK, my first post on a subject I startd learning a couple days ago! SQL2K5 Full Text Search!

    Right....creating the FTS indexes and catalogs was a no-brainer! I even have a SP that I use to query the indexed table. So my question...

    What is the best way to search, using FTS, for multiple words?

    For example:

    In my table I have the columns Title, Description and Detail. If I run...

    SELECT Title, Description, Detail

    FROM MyTable

    WHERE FREETEXT (*, 'computer programming')

    GO

    it returns (what I think) are all of the records I need. Is this the best way?

    I'm not really concerned about server resources for the moment. The SQL machine is seriously under-tasked. I'm more concerned about retreiving everything which contains the search text, in all it's combinations.

    TIA

    ...the man in black fled across the desert...and the gunslinger followed...

  • Depends. Google does a lot more than phrase matching and ranking.

    See also CONTAINS predicate and ISABOUT term.

  • Getting the most out of Full Text can be like walking in a wilderness without a map, compass or satnav. There are some details in BOL, and Google can find some good articles, but the more you know about Full Text the more you realise ow little good advice exists.

    A few tips...

    1) Work out what queries you need in a query window before you put anything into a SP or CLR code. Your first, second, third, etc attempts are unlikely to get the best out of Full Text.

    2) Using CONTAINSTABLE allows you to fine tune your results more than the other access methods. If a simple FREETEXT or CONTAINS does not give you what you need then you need CONTAINSTABLE and probably the ISABOUT clause with weightings.

    3) The documentation does not make it clear that ISABOUT('value1', 'value2') really means ISABOUT('value1') OR ISABOUT('value2'). If a match is found with 'value1' then do not expect FTS to also search for a match on 'value2'.

    4) If you apply weightings in ISABOUT, the best results are given if the order of the weighting is descending. e.g. if you have a series: ISABOUT('value1' WEIGHT (0.001)) AND ISABOUT('value2' WEIGHT (0.9)) AND ISABOUT('value3' WEIGHT (0.5)) will give a different result to the final rank than if the weights were given in descending order 0.9, 0.5, 0.001. Giving the weights in descending order helps give a higher rank to the clause with the heighest weight.

    5) To get good performance, you need to minimise the number of Full Text accesses you make in a single T-SQL query. Aim to get everything you need in a single FREETEXT or CONTAINSTABLE clause.

    6) To get good performance, you need to avoid filtering the results from a FTS search by joins to other tables. e.g. SELECT model FROM CONTAINSTABLE(trucks, colour, FORMSOF('red')) as ft JOIN trucks on ft. = trucks.key WHERE trucks.wheels = 4 will first select all types of truck where the colour is a shade of red, then filter the list to just give you 4-wheeled trucks that are a shade of red. This filtering can take a long time to complete. You need to look at how you can search for both shades of red and 4 wheels in a single CONTAINSTABLE so FTS only gives back the rows you really need.

    Finally, FTS can give fantastic results, but it can take a while to work out how to use it most effectively.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Read Hilary Cotter's articles first before just "jumping in".

    http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features/

    Be very careful as to what features you're looking for and how you implement them. You will never be able to get to 100% of what Google does.

    I've also worked extensively with Oracle's full-text feature, which has more features and functionality (along with better performance) than SQL Server. Although SQL Server 2008 is reportedly better (per Microsoft) -- we'll see.


    [font="Arial Narrow"](PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.[/font]

  • I agree, Hilary Cotter is one of the few sources of good FTS advice on the web.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Also be aware of the noise word list and how this affects the search results when searching for text containing common words.

    I also agree that producing the sort of results that Google search has will be very difficult to achieve, especially using SQL exclusively.

  • to make it kind of like google...

    distinct lowercase words in one field

    I put all "distinct" lowercase words I'm interested in 1 field and build a fulltext index on that. I do this because I don't like inflated ranks based on repeated words or caps. Note: I also modify the noise words because a lot of noise words are important for my searches.

    add "filters" into the fulltext field with all the rest of the words

    say you only want to get things that are in-stock, add "instock" to the words you're fulltext indexing and then add "instock and..." to your containstable search clause; this performs a lot better than post-results joining and filtering on another table's instock bit. note: i add it first in the field, so I can easily search for it with a like statement as well; "like" performs decent when you're searching from the start, or the end, but not the middle (e.g. TheText LIKE 'instock%' is good).

    containstable - and containstable with formsof(inflectional,"word") syntax

    I use containstable for more precise matches; if containstable doesn't work straight up, or I need more results, I use contains table with inflectional formsof syntax around appropriate words. freetexttable just performs like crap and returns weird results if you ask me (it uses OR between every word and formsof() variations for everything; it's just too slow).

    if variations of containstable yield nothing, use "grams" and "sounds like" strategy.

    add another field to your fulltext indexed table that is all n-grams of your words. let's say they are sizes of 3 characters each; here's an example:

    "michael crichton"

    becomes

    "_mi mic ich cha hto ton on_ n_c _cr cri ric ich cht..."

    if you want, you can also add the SOUNDEX codes, or even the double-metaphone codes into your grams and call it GramsAndCodes. then take grams and sounds like codes and search against the fulltext grams and sounds like codes with freetexttable; it doesn't perform well, but will get you fuzzy results that are pretty decent. with some text similarity code analysis of the results, often you can predict what the user meant to type and do a "did you mean ???"

    use top n rank in clause appropriately

    using the top n rank setting can speed things up, but test the results, sometimes I've noticed less than predictable results with using top in ranks with freetexttable. There's also some pre-compute rank setting that you can try to speed up freetext searches, but it returns crappy results shortly after a master-merge ages.

    other performance things

    if you want to get crazy, you can put the fulltext index on it's own drive, and even limit sql server's ram consumption to ensure fulltext has room to operate.

    disclaimer

    this works for me, and I combine it with some client side caching to make things faster, but I have to mention that I'm not fulltext indexing a massive amount of text here, so what I do might not work that great for bigger text loads. also, so you know, I use the SQL CLR to make function like GetDistinctWords() and GetGrams() and GetSoundExCodes() and GetDoubleMetaphoneCodes()

    Lucene.NET

    another option is Lucene.NET, and it's free. if your data you want to fulltext search is in a database, I'd stick with SQL FullText though, you just can beat the load speed and automatic updating for sql text data.

  • Fantastic advise folks! 😀

    Thanks a million. Definately have enough to REALLY do dome damage! :w00t:

    Noise words are causing some minor issues. Seems some people are searching for IT (Information Technology) and it's getting ignored as "it"!!! Nifty! :hehe:

    Also getting some weird results on fields containing XML data. it's correctly returning the searched values, but the XML markup (which should remain invisible to users) sometimes appears after a FTS! Super-Nifty!! :hehe::w00t::hehe:

    I've already streamlined the SP to remove all joins. I'm now loading "related" data as an array and attaching what's needed in the code instead of in SQL

    Thanks again for the resource links everyone. Great help!!

    ...the man in black fled across the desert...and the gunslinger followed...

  • EdVassie - Tuesday, April 1, 2008 3:11 AM

    Getting the most out of Full Text can be like walking in a wilderness without a map, compass or satnav. There are some details in BOL, and Google can find some good articles, but the more you know about Full Text the more you realise ow little good advice exists.A few tips...1) Work out what queries you need in a query window before you put anything into a SP or CLR code. Your first, second, third, etc attempts are unlikely to get the best out of Full Text.2) Using CONTAINSTABLE allows you to fine tune your results more than the other access methods. If a simple FREETEXT or CONTAINS does not give you what you need then you need CONTAINSTABLE and probably the ISABOUT clause with weightings.3) The documentation does not make it clear that ISABOUT('value1', 'value2') really means ISABOUT('value1') OR ISABOUT('value2'). If a match is found with 'value1' then do not expect FTS to also search for a match on 'value2'.4) If you apply weightings in ISABOUT, the best results are given if the order of the weighting is descending. e.g. if you have a series: ISABOUT('value1' WEIGHT (0.001)) AND ISABOUT('value2' WEIGHT (0.9)) AND ISABOUT('value3' WEIGHT (0.5)) will give a different result to the final rank than if the weights were given in descending order 0.9, 0.5, 0.001. Giving the weights in descending order helps give a higher rank to the clause with the heighest weight.5) To get good performance, you need to minimise the number of Full Text accesses you make in a single T-SQL query. Aim to get everything you need in a single FREETEXT or CONTAINSTABLE clause.6) To get good performance, you need to avoid filtering the results from a FTS search by joins to other tables. e.g. SELECT model FROM CONTAINSTABLE(trucks, colour, FORMSOF('red')) as ft JOIN trucks on ft. = trucks.key WHERE trucks.wheels = 4 will first select all types of truck where the colour is a shade of red, then filter the list to just give you 4-wheeled trucks that are a shade of red. This filtering can take a long time to complete. You need to look at how you can search for both shades of red and 4 wheels in a single CONTAINSTABLE so FTS only gives back the rows you really need.Finally, FTS can give fantastic results, but it can take a while to work out how to use it most effectively.

    Amen I am implementing for the very first time started a couple weeks ago. We have some pretty complex rules and variables and I am struggling with it. I just got here and about to dig in to what you have.

    <hr noshade size=1 width=250 color=#BBC8E5> Regards,Jeffery Williams http://www.linkedin.com/in/jwilliamsoh

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply