Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

A Google-like Full Text Search Expand / Collapse
Author
Message
Posted Tuesday, October 7, 2008 2:28 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
Oleg (10/7/2008)
Hi Mike,
very useful article, especially code - thank you.

I tested code and found one problem.

If I entered phrase:

president -aluminium

it wokrs fine.

If phrase is:
-aluminium president

it generates syntax error, phrase cannot start from negation. I checked goole it understand correctly this kind of phrases.

How can I edit grammar to avoid this kind of errors?

Thank you again.
Regards,
Oleg.





Hi Oleg

I fought with this issue myself and decided to "keep it simple" by defining a rule that the first token can't be preceded by a minus sign. The issue is that, owing to the simple nature of the recursive AST walker, the grammar maps fairly closely to the iFTS grammar. In IFTS grammar you have to use the AND NOT operator (there's no NOT without AND). Starting an iFTS query string with AND NOT ("AND NOT aluminum AND president" is not allowed in iFTS). You could use a more complex algorithm to rearrange the nodes of the AST to deal with this issue in a more user-friendly way, but I really wanted to keep the sample code simple for the article. I also considered using a more advanced algorithm like applying the visitor pattern to the AST, but again to keep it simple I decided to stick with the simple recursive tree walker.

Thanks
Mike C
Post #582181
Posted Tuesday, October 7, 2008 2:53 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
Confucius247 (10/7/2008)
More time talking about a c# compiler than the SQL. No talk of the limitations of the FTS.


The article is not about the limitations of FTS, of which information abounds. A lot of these limitations are overcome in SQL 2008 iFTS, however. I didn't feel a strong need to rehash FTS limitations to demonstrate how to provide your users with a simpler interface to FTS.

Bit of a waste of time really.


I'm sorry you feel that way.

From title of piece I was hoping for something about combining site useage stats stored in SQL with the FTS results to actually make something Google like. Google is not defined by its syntax, which few users use, but by the relevancy of its results.


Combining site usage states stored in SQL with FTS results? I'm not sure what led you to believe this particular article addressed that, and I'm sorry you're disappointed. However, if relevance of results is important (and of course it always is) there are several opportunities to increase relevance in FTS. For instance you can use the proximity search to increase relevance (implemented in the grammar) or weighted seraches (not implemented in the grammar). FTS also provides relevance ranking via CONTAINSTABLE and FREETEXTTABLE. Unfortunately in a corporate intranet setting there's little chance you can duplicate Google's ability to count links back to a specific document or page to calculate a Google-style relevance score.

As for "few users" using Google syntax, I strongly disagree. Google is the world's most popular search engine, and I would expect that nearly everyone who has used a Web-based search engine has at some point used Google. I would further venture that most users are familiar with the basics of Google syntax, which is why I recommend using it rather than forcing users to learn iFTS syntax to manually enter queries like the following:

"aluminum" AND (FORMSOF(INFLECTIONAL, fish) OR FORMSOF(THESAURUS, money))

The reasons for providing users with a simple syntax are discussed in the article.

Thanks
Mike C
Post #582199
Posted Tuesday, October 7, 2008 3:03 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
marcosc (10/7/2008)
Good article.
Actually, with Nautilus you can find a table just writing part of its name and once located you can find a record writing any word which is compared with like if it's a string and with "=" if its an integer.
The query is built for all the fields:

select top 10 *
from dbo.Customers (nolock)
where CustomerID like '%ana%' or CompanyName like '%ana%' or ContactName like '%ana%'...


Once the record is located, you can see it's related records (following the FK links).

There is also a "google like" search where you write a word or number and looks for it in the whole database.
Not much like google, but it works for developers.

Thanks, Marcos (http://sourceforge.net/projects/nautilus/)


FTS actually builds an inverted index structure to search text fields, and allows some pretty complex searches that are more efficient than LIKE with leading wildcards, which can't take advantage of any indexes really. For tables with a large number of columns and a large number of rows, most of which aren't indexed at all, I would imagine the LIKE with leading wildcards could lead to serious performance issues.

This actually sounds like it might be closer to the Google Base Data API Query Language (http://code.google.com/apis/base/query-lang-spec.html). The Base Data API Query Language actually gives you a lot of flexibility in performing searches, similar to what you're suggesting above, but would also have to be translated to SQL code at some point.

Thanks
Mike C
Post #582209
Posted Tuesday, October 7, 2008 3:07 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
Paul DB (10/7/2008)
While I am still trying to digest this excellent article, I think I found a slight error.

In The Grammar section, the article lists 7 rules. One of them is...
	PrimaryExpression ::= Term
| '-' Term
| Phrase
| '(' OrExpression ')'
| '<' ( Term | Phrase )+ '>'

To more closely match the code in the Conversion Engine section, I think it should be more like the following.
Note the addition of the ExactExpression option.
	PrimaryExpression ::= Term
| '-' Term
| ExactExpression
| '(' OrExpression ')'
| Phrase
| '<' ( Term | Phrase )+ '>'

A minor point, I know. I just want to make sure that I am not missing something here. :D


I believe you're right. I made some slight changes to the code after writing the article and I probably wiped that line out by mistake. Good catch, and apologies!

Mike C
Post #582211
Posted Tuesday, October 7, 2008 3:40 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, November 2, 2011 5:26 AM
Points: 5, Visits: 35
Thanks for the great article. As soon as I read it. I started to implement full text search on our most used table.

I have altered the code to use CONTAINSTABLE(tblsite,*,@ftsQuery) because I want to be able to search in many fields.
But it is not working.

I have this field list with full text index on (Site (PK), SiteName, Address, HouseNumber, Zip, City)
I can search for "Vejle" (danish city) and I get result from columns SiteName and Address, so that works.

Now I want to search for "Vejle" and "17". I want to find the city "vejle" and all streets with housenumber "17" but the result set is empty. Can any one help me out on this? I think I have tried all search combination's



Best Regards Martin Nyborg - Perfection is a process, not an end-point.
Post #582235
Posted Tuesday, October 7, 2008 4:41 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, December 16, 2014 7:41 PM
Points: 26, Visits: 54
Thanks for this, an informative and useful article no doubt.

The title however is misleading - " A Google-like Full Text Search". Google-like IMO implies some sort of PageRank IR algorithm. But this article is really just about boolean search.
Post #582262
Posted Tuesday, October 7, 2008 5:15 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
Martin Nyborg (10/7/2008)
Thanks for the great article. As soon as I read it. I started to implement full text search on our most used table.

I have altered the code to use CONTAINSTABLE(tblsite,*,@ftsQuery) because I want to be able to search in many fields.
But it is not working.

I have this field list with full text index on (Site (PK), SiteName, Address, HouseNumber, Zip, City)
I can search for "Vejle" (danish city) and I get result from columns SiteName and Address, so that works.

Now I want to search for "Vejle" and "17". I want to find the city "vejle" and all streets with housenumber "17" but the result set is empty. Can any one help me out on this? I think I have tried all search combination's



The problem you're encountering is that you're searching for ("17" AND "Vejle"), and iFTS stipulates that they must exist together in the same column. In order to search for "17" in one column and "Vejle" in another column, you must create two FTS predicates ANDed together like this:

CONTAINS (tblsite, HouseNumber, "17")
AND CONTAINS (tblsite, City, "Vejle")

This can be done, but adds considerable complexity to the query creation.

Thanks
Mike C
Post #582271
Posted Tuesday, October 7, 2008 5:26 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
innomatics (10/7/2008)
Thanks for this, an informative and useful article no doubt.

The title however is misleading - " A Google-like Full Text Search". Google-like IMO implies some sort of PageRank IR algorithm. But this article is really just about boolean search.


Glad you found it useful, and I'm sorry that the title implies design and creation of custom web crawlers and custom page ranking techniques.

The article is primarily aimed at intranets, although the concept of a UI that implements Google-like syntax could easily be applied elsewhere. Unfortunately in most intranet settings the network is not a "democracy" where user pages can be counted as "votes" toward a specific piece of content stored in a database. Those that are, however, can take advantage of custom web crawlers to rank their database content, no doubt. Unfortunately these types of custom applications are beyond the scope of this particular article.

Not sure what you mean by "boolean search"? This article is about creating an application layer that converts Google-style query syntax to SQL Server Full-Text Search predicates which can be executed against an FTS-enabled database.

I probably should have gone with my original, more explicit title, "A Tool For Converting Google-Style Search Query Syntax To SQL Server Full-Text Search Predicate Syntax". A little longer to be sure, but definitely eliminates any expectation that custom page ranking algorithms are discussed within.

Thanks
Mike C
Post #582276
Posted Tuesday, October 7, 2008 6:48 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, March 8, 2010 5:04 AM
Points: 135, Visits: 90
Great article! It's good to see some ideas on how to implement the front end a FTS deployment. Understanding the benefits of the feature from a database perspective is easy enough but unless the user interface is intuative then it's use becomes limited.

-- JP
Post #582300
Posted Tuesday, October 7, 2008 8:24 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, December 16, 2014 7:41 PM
Points: 26, Visits: 54
Mike C (10/7/2008)
[quote]innomatics (10/7/2008)
Not sure what you mean by "boolean search"?


Generally a search that doesn't compute likeness statistics and therefore can't return and ordered set of results based on best match. Boolean search results are either in or out, there's no similarity measured.

It would be interesting to see how more complex algebraic (e.g. vector space, cosine similarity) and probabilistic search models could be implemented effciently in SQL Server.
Post #582314
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse