Key Word Searches

  • quickdraw (3/1/2011)


    @ Antonio,

    Just so I can file this away in my brain for later:

    What is your max sustained throughput (total users and max queries per minute)? How many CPUs?

    Thanks.

    trying to track just full text queries is tough for me since our app performs a lot of caching. i'll see if a dba can expose something via tracing. however, we typically have 150+ users on at any given time and searching is a common task.

  • Sharepoint Search service uses SQL Server FTS. Based on results listed

    http://technet.microsoft.com/en-us/library/cc262574(office.12).aspx

    you should be able to do 20 requests per second on a 15 million item library with one server.

    20 RPS is quite a bit, even for a public website. If your getting that many, you probably got a datacenter or two and a team of DBAs

  • While the discussion has been about comparing this technique to words and text, it is also a way to handle leading wildcard searches on part numbers. I have done something like this by building an index of all reasonable suffix substrings of a part number, clipping off one leading character at a time until there's nothing "significant" left. Then if a user wants to search for, say, '%74%138%', I clip off the leading wild characters and do a LIKE on the suffix index. (I could further optimize by guessing that '138%' would be a shorter scan than '74%', and applying the original '%74%138%' pattern to the original part number.) I quickly get all the matches like 'MC74AC138N', 'SN74LS138D', etc. I don't imagine that FTS would be applicable to this problem.

    While it sounds like I'd be multiplying the size of my index by the number of characters in a part number, it's really not so bad. Short part numbers have fewer suffix substrings, and you can add some restrictions if you disallow putting a wildcard character in the middle of certain sequences, like a series of digits. For about 125,000 part numbers I have less than 600,000 suffixes to part number mappings.

  • We loaded something like 150 million records from Library of Congress and had a co-located 2 post rack running a quad processor something-or-other with a RAID 5 setup on Oracle 8i. Users ranged into the thousands online at once doing queries. RPS peaked over 50 per second if memory serves (this was 1999).

    We filtered stop words, and plurals, but not misspellings nor synonyms.

    Using materialized views is very smart. Be aware of disk space with lots of records. I'm on a project now that has over a billion records in one table. We have to plan very carefully for index growth and maintenance (rebuilds). Adding a materialized view that copies and concatenates text columns (and then builds FTS indexes on that, which might be in addition to FTS indexes on individual columns) could take up considerable space.

    For 15 million records with an average row length of 65 characters you're talking close to a gigabyte of space for the materialized view. Not a problem for an in-house server, but for a hosted server or a cloud database, that could cost you actual coin per month.

    My point there isn't to talk you out of materialized views. I love that idea. Just look at the whole picture when designing things, and plan for growth/scale.

  • I liked the article. A new technique to add to my toolset. I did play around with it to remove the Temp Tables and at least on my dev server, there was quite a performance increase.

    DECLARE @Keywords VARCHAR(8000)

    SET @Keywords = 'The Red Rider Carbon Action 200 shot range model air rifle' ;

    SELECT

    DISTINCT

    P.ProductID

    ,P.Name

    ,P.Color

    ,P.Size

    ,P.ListPrice

    FROM

    -- convert the keywords to a table. Replaces #KW

    dbo.DelimitedSplit8KNEW(REPLACE(REPLACE(@Keywords, '/', ' '), ',',

    ''), ' ') searchterms

    LEFT JOIN dbo.KeyWordsToExclude xcl

    ON xcl.KeyWord = searchterms.ItemValue

    INNER JOIN dbo.ProductKeyWordItems prodkeys

    ON [prodkeys].[KeyWord] = searchterms.ItemValue

    INNER JOIN [Production].[Product] P

    ON P.[ProductID] = prodkeys.[ProductID]

    WHERE

    xcl.KeyWord IS NULL -- include only records where the keyword search hits on a non-excluded word.

    Here are the stats:

    Table 'KeyWordsToExclude'. Scan count 1, logical reads 79, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ProductKeyWordItems'. Scan count 11, logical reads 24, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Product'. Scan count 1, logical reads 15, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 9 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

  • There is a lot of reinvention of the wheel. I've worked with full-text in both Oracle (9i, 10g) and SQL Server (2000, 2005, 2008) as well as external, i.e., not-integrated, full-text systems such as AltaVista, etc.

    Now that SQL Server, in 2008, has an integrated full-text engine similar to what Oracle had in 9i (circa 2000), full-text in SQL Server is pretty robust. However, it still lacks a lot of Oracle's functionality.

    We use it for plain text, HTML, and binary (Microsoft Office files, PDFs, etc.).

    Eliminate the Stop (noise) word filtering as it is so 90's. I.e., index everything.

    As for performance and capacity, just look at what Iron Mountain is doing.


    [font="Arial Narrow"](PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.[/font]

  • Another technique to try ...

    On tables where you have columns that you'd otherwise build a FTS index, instead create a trigger that populates a single table where you string together exactly what you want. I.e. if searching has a set of filter criteria as well as the text, add in 'keywords' for the filters.

    E.g. you want all people whose name are like 'Fred Smith' that live in California and were born in 1969. You'd create a FTS string like 'Fred Smith zClfrnz zDOB19690501' where zClfrnz and zDOB19690501 are the keywords created by the trigger from attributes that are probably already in memory.

    When you search your FTS contains clause looks like this: 'Fred AND Smith AND zClfrnz AND zDOB1969*'. This is blisteringly fast.

    Obviously, this technique falls down when a range search is also required. But not anything like as bad as you'd think. Say you want to search on 'Fred Smith's in California whose age is between 45 and 50. You'd add a column to your new FTS table that inlcuded the DOB, possibly with a specific index or a covering index that has some of the most common range value searches if they are commonly used in conjunction. Another approach where the number of items in a rang is small is simply to string together the range of interest, e.g. 'Fred AND Smith AND zClfrnz AND (zDOB1967* or zDOB1968* or zDOB1969* or zDOB1970* or zDOB1971*)'

    Another thing that works well is 'joining' FTS searches. In this situation you still have just one FTS table but have triggers on multiple entity tables populating it. I.e. you want to find all sales people whose name is 'Joe Patel' who work in California and who are servicing a 'Fred Smith'. The salesPerson table creates a 'Joe Patel zslsprsnz zClfrnz' and the fred smith row is extended to include this relationship, i.e. red Smith zClfrnz zDOB19690501 zslsprsnzJoeSmith'.

    At any time to want to add in new search keywords or searchable attributes, you run a dummy update on the source table and the trigger regenerates the search string value. Note you need to get a bit funky finding the existing search value if you're doing updates rather than deletes and inserts.

    A 2 vCPU, 8GB SQL Server was easily handling 100 AJAX trns per second over a DB with 750k rows in the search table. Two different search application types too. One for building products and one for real estate.

    The above won't work for all situations but if your searchable data is in database columns it generally works a treat as you can combine the best of relational DB with the best of FTS.

  • One reason for doing this instead of built-in Full-Text Search, is that once you have isolated your keywords, you can suggest alternatives when a user has misspelled one. See my article[/url] on fuzzy-string matching. (I linked this article from the discussion on mine.)

  • Doesn't accommodate for wildcards though. It is only parsing and matching FULL words.

    He% and Nu% would not work.

    If putting a LIKE in-between KWI.KW and #KW.Keyword, you may think it works, but it is not matching BOTH keywords, only one of the above.

Viewing 9 posts - 31 through 38 (of 38 total)

You must be logged in to reply to this topic. Login to reply