A Google-like Full Text Search

  • Thanks for this, an informative and useful article no doubt.

    The title however is misleading - " A Google-like Full Text Search". Google-like IMO implies some sort of PageRank IR algorithm. But this article is really just about boolean search.

  • Martin Nyborg (10/7/2008)


    Thanks for the great article. As soon as I read it. I started to implement full text search on our most used table.

    I have altered the code to use CONTAINSTABLE(tblsite,*,@ftsQuery) because I want to be able to search in many fields.

    But it is not working.

    I have this field list with full text index on (Site (PK), SiteName, Address, HouseNumber, Zip, City)

    I can search for "Vejle" (danish city) and I get result from columns SiteName and Address, so that works.

    Now I want to search for "Vejle" and "17". I want to find the city "vejle" and all streets with housenumber "17" but the result set is empty. Can any one help me out on this? I think I have tried all search combination's

    The problem you're encountering is that you're searching for ("17" AND "Vejle"), and iFTS stipulates that they must exist together in the same column. In order to search for "17" in one column and "Vejle" in another column, you must create two FTS predicates ANDed together like this:

    CONTAINS (tblsite, HouseNumber, "17")

    AND CONTAINS (tblsite, City, "Vejle")

    This can be done, but adds considerable complexity to the query creation.

    Thanks

    Mike C

  • innomatics (10/7/2008)


    Thanks for this, an informative and useful article no doubt.

    The title however is misleading - " A Google-like Full Text Search". Google-like IMO implies some sort of PageRank IR algorithm. But this article is really just about boolean search.

    Glad you found it useful, and I'm sorry that the title implies design and creation of custom web crawlers and custom page ranking techniques.

    The article is primarily aimed at intranets, although the concept of a UI that implements Google-like syntax could easily be applied elsewhere. Unfortunately in most intranet settings the network is not a "democracy" where user pages can be counted as "votes" toward a specific piece of content stored in a database. Those that are, however, can take advantage of custom web crawlers to rank their database content, no doubt. Unfortunately these types of custom applications are beyond the scope of this particular article.

    Not sure what you mean by "boolean search"? This article is about creating an application layer that converts Google-style query syntax to SQL Server Full-Text Search predicates which can be executed against an FTS-enabled database.

    I probably should have gone with my original, more explicit title, "A Tool For Converting Google-Style Search Query Syntax To SQL Server Full-Text Search Predicate Syntax". A little longer to be sure, but definitely eliminates any expectation that custom page ranking algorithms are discussed within.

    Thanks

    Mike C

  • Great article! It's good to see some ideas on how to implement the front end a FTS deployment. Understanding the benefits of the feature from a database perspective is easy enough but unless the user interface is intuative then it's use becomes limited.

    -- JP

  • Mike C (10/7/2008)


    innomatics (10/7/2008)


    Not sure what you mean by "boolean search"?

    Generally a search that doesn't compute likeness statistics and therefore can't return and ordered set of results based on best match. Boolean search results are either in or out, there's no similarity measured.

    It would be interesting to see how more complex algebraic (e.g. vector space, cosine similarity) and probabilistic search models could be implemented effciently in SQL Server.

  • innomatics (10/7/2008)


    Mike C (10/7/2008)


    innomatics (10/7/2008)


    Not sure what you mean by "boolean search"?

    Generally a search that doesn't compute likeness statistics and therefore can't return and ordered set of results based on best match. Boolean search results are either in or out, there's no similarity measured.

    It would be interesting to see how more complex algebraic (e.g. vector space, cosine similarity) and probabilistic search models could be implemented effciently in SQL Server.

    The CONTAINSTABLE and FREETEXTTABLE rowset functions return two columns, one of which is a RANK value based on one of a few different formulae, dependent on options used (such as Jaccard, OKAPI BM25, etc.) The RANK value gives you a relevance ranking based on several different criteria (again dependent on options used). You can order the results based on RANK. An interesting note is that RANK is always calculated by FTS, but is only exposed via the rowset functions.

    The weighted search option in full-text search (ISABOUT) is in fact a classic vector-space search.

    Either of these options is fairly simple to implement in SQL Server. Hilary Cotter expands on some of these options in articles and blogs. He expands his discussions with some examples of multicolumn weighted search algorithms, text mining with custom taxonomies, and even custom inverted indexes.

  • Thanks for the answer

    I had a suspicion that i needed to OR or AND a lot of contains expression. Was just hoping that I could get it all for free:P

    Best Regards Martin Nyborg - Perfection is a process, not an end-point.

  • Hi Mike,

    thank you for your answer.

    Regards,

    Oleg.

  • Martin Nyborg (10/7/2008)


    Thanks for the answer

    I had a suspicion that i needed to OR or AND a lot of contains expression. Was just hoping that I could get it all for free:P

    It would be nice, right? 🙂

    I think of your requirements as sort of a special case as far as full-text search is concerned. It's pretty common to need to query separate individual columns for specific values, like where one column contains a house number, a second column contains a street name, another column contains city name, and so on... Full-text search usually isn't the best option for these types of searches though since you don't normally want to perform linguistic searches of city names and other atomic data like that. Phonetic or other fuzzy search techniques work well for those type of searches.

    FTS is really designed for linguistic searches against documents and textual data stored in the database. When you think of it like that, it's not too often (though I'm sure it does occur) that you have documents stored in multiple columns of the same table and you want to mix and match search criteria among them ("17" must occur in one of the Word documents and "Main St." must occur in another document stored in a different document stored in the same database table row).

    What I think might be more common is a requirement to search multiple columns or even multiple tables using the same criteria (e.g., search all the documents in 3 different tables for documents containing both "17" and "Main St."). You can actually do this by unioning the results of CONTAINSTABLE or FREETEXTTABLE together, using the same FTS search strings but specifying different tables and columns in each query. Basically you would duplicate your basic query multiple times but change the tables/columns specified, and UNION the results together. You would probably also need to add one more column to your results to indicate the source table for each result, so you could join back to the correct table and retrieve your results.

    Thanks

    Mike C

  • You need to change the 28th line in fmConverter.cs

    from:

    Irony.StringSet errors = _compiler.Parser.GetErrors();

    to:

    Irony.StringSet errors = _compiler.Grammar.Errors;

    to get it compiled with the latest version of Irony.

  • bseker (10/13/2008)


    You need to change the 28th line in fmConverter.cs

    from:

    Irony.StringSet errors = _compiler.Parser.GetErrors();

    to:

    Irony.StringSet errors = _compiler.Grammar.Errors;

    to get it compiled with the latest version of Irony.

    Thanks for the info. Irony is constantly under development, so some of the method names, etc., may be different if you compile with a newer version. The download file includes the version of the Irony DLL that I used to compile the example.

    Thanks again

    Mike C

  • Very cool. It's exactly what I needed. I'm implementing a search and didn't want users to have to learn FTS syntax. Thanks for your work.

    I'm running into a problem with the OR operator. If I search for the term "orange" (no quotes in search, just for clarity here) I get the following error:

    Syntax error, expected: OrExpression AndExpression PrimaryExpression ThesaurusExpression ThesaurusOperator ExactExpression ExactOperator ParenthesizedExpression ProximityExpression OrExpression'

    If I search for "andes" (as in the mountains) I get the same error.

    I tried changing the value of OrOperator.Rule by adding a space after "or" like this:

    OrOperator.Rule = Symbol("or ") | "|";

    And it seems to work. So I did the same with "and". I'd be interesting in having this confirmed that I'm not misunderstanding how this works and thus causing problems.

  • developmentalmadness (10/15/2008)


    Very cool. It's exactly what I needed. I'm implementing a search and didn't want users to have to learn FTS syntax. Thanks for your work.

    I'm running into a problem with the OR operator. If I search for the term "orange" (no quotes in search, just for clarity here) I get the following error:

    Syntax error, expected: OrExpression AndExpression PrimaryExpression ThesaurusExpression ThesaurusOperator ExactExpression ExactOperator ParenthesizedExpression ProximityExpression OrExpression'

    If I search for "andes" (as in the mountains) I get the same error.

    I tried changing the value of OrOperator.Rule by adding a space after "or" like this:

    OrOperator.Rule = Symbol("or ") | "|";

    And it seems to work. So I did the same with "and". I'd be interesting in having this confirmed that I'm not misunderstanding how this works and thus causing problems.

    I see your point here. I'll have to check with Roman, it actually sounds like a parsing issue. I would suspect that defining or with a space after it might cause issues with "or" at the end of a word like "emptor" (haven't tried it out though). If this is a parsing issue it may have been resolved in the most recent release of Irony.NET (available here: http://www.codeplex.com/irony).

    Thanks

    Mike C

  • Yes, I meant to say the problem was with the parser. I tried words ending in OR like "Victor" and it seems to work fine.

  • OK I sent Roman an email asking if this was a known issue and if it's been resolved with the latest release, or if the blame is with me (I didn't set a particular setting correctly or something along those lines). If the "or" and "and" problems only occur at the beginning of words, then defining a space behind them like you did will be a usable workaround. The only issue then, I think, is if you use "or" or "and" and don't put a space behind them, as in the following:

    orange and(andes)

    I'll let you know what I hear back on this.

    Thanks

    Mike C

Viewing 15 posts - 16 through 30 (of 166 total)

You must be logged in to reply to this topic. Login to reply