full-text search containstable 'near' broken in SQL Server 2008?

  • I’ve been testing our search implementation which we’ve been running very successfully on SQL Server 2005 against the latest version of SQL Server 2008 (10.0.2531.0 SP1 Developer Edition) on Windows XP SP3 and I’m seeing strange behaviour from full text search (containstable) queries involving near.

    Having spent several hours investigating I’ve isolated the problem to a fairly simple example:

    I have a table of postal addresses called PropertyTokenIndex2 with a full text index on one of its columns TokenText and a bigint primary key field (which complies with Microsoft’s performance recommendation to use an integer type field).

    The full text catalog and index are created as follows:

    create fulltext catalog PropertyTokenIndex_FT_Catalog with accent_sensitivity=off

    create fulltext index on PropertyTokenIndex2(TokenText) key index PK_PropertyTokenIndex2 on PropertyTokenIndex_FT_Catalog with change_tracking=auto, stoplist=off

    (Note there is *no stop list* and that accent sensititivity on the catalog has no effect on the issue I will demonstrate.)

    There are 6758 rows (postal addresses) in the table, one of which has the following in its TokenText field:

    ‘FIRE PROOF. MATRIX HOUSE. UNIT 4. CONSTITUTION HILL’

    I expect all of the following containstable queries to match this row, however the last two queries unexpectedly fail to match:

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, 'FIRE and PROOF and MATRIX and HOUSE')

    Key Rank

    -------------------- -----------

    179617 24

    (1 row(s) affected)

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, '(FIRE near PROOF) and MATRIX')

    Key Rank

    -------------------- -----------

    179617 80

    (1 row(s) affected)

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, '(FIRE near PROOF) and HOUSE')

    Key Rank

    -------------------- -----------

    179617 24

    (1 row(s) affected)

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, '(MATRIX near HOUSE)')

    Key Rank

    -------------------- -----------

    179332 69

    179382 69

    179521 69

    179617 104

    179648 69

    179650 69

    179657 69

    179866 69

    185644 69

    (9 row(s) affected)

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, '(FIRE near PROOF) and MATRIX and HOUSE')

    Key Rank

    -------------------- -----------

    179617 24

    (1 row(s) affected)

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, '(FIRE near PROOF) and (MATRIX near HOUSE)')

    Key Rank

    -------------------- -----------

    (0 row(s) affected)

    :w00t:

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, 'FIRE and (MATRIX near HOUSE)')

    Key Rank

    -------------------- -----------

    (0 row(s) affected)

    :w00t:

    Interestingly if I change the data in PropertyTokenIndex2 for this row from HOUSE to COURT as follows:

    update PropertyTokenIndex2 set TokenText = 'FIRE PROOF. MATRIX COURT. UNIT 4. CONSTITUTION HILL' where TargetCk = 179617

    … then I get a match:

    select [Key], [Rank] from containstable(PropertyTokenIndex2, TokenText, '(FIRE near PROOF) and (MATRIX near COURT)')

    Key Rank

    -------------------- -----------

    179617 104

    (1 row(s) affected)

    (On changing it back to HOUSE again, I no longer get a match.)

    By querying sys.dm_fts_index_keywords I see that the document count for HOUSE is 1289 and that for COURT is 289.

    select * from sys.dm_fts_index_keywords(db_id('compass8'), object_id('PropertyTokenIndex2')) where display_term in ('FIRE', 'PROOF', 'MATRIX', 'HOUSE', 'COURT')

    keyword display_term column_id document_count

    --------------------------- --------------- ----------- --------------

    0x0066006900720065 fire 2 6

    0x00700072006F006F0066 proof 2 4

    0x006D00610074007200690078 matrix 2 12

    0x0068006F007500730065 house 2 1290

    0x0063006F007500720074 court 2 289

    (5 row(s) affected)

    I’ve seen multiple instances of this problem searching against the small test dataset I'm using (of which the one above is just one example) and the issue appears to be related to the frequency of tokens in documents – I have not had problems in cases where all tokens have a low 'document count' (but it could be that I’ve just not come across one yet).

    Note that dropping and recreating the full text index and catalog has no effect – this is *not* an issue with automatic change tracking.

    The issue is also not related to stop words since I have configured the index to use a null stoplist and querying sys.dm_fts_index_keywords_by_document for the record we're interested in confirms that all the expected tokens are present (none have been discarded).

    select * from sys.dm_fts_index_keywords_by_document(db_id('compass8'), object_id('PropertyTokenIndex2')) where document_id = 179617

    keyword display_term column_id document_id occurrence_count

    ----------------------------------------------------- ------------ --------- ----------- ----------------

    0x0034 4 2 179617 1

    0x0063006F006E0073007400690074007500740069006F006E constitution 2 179617 1

    0x0066006900720065 fire 2 179617 1

    0x00680069006C006C hill 2 179617 1

    0x0068006F007500730065 house 2 179617 1

    0x006D00610074007200690078 matrix 2 179617 1

    0x006E006E0034 nn4 2 179617 1

    0x00700072006F006F0066 proof 2 179617 1

    0x0075006E00690074 unit 2 179617 1

    0xFF END OF FILE 2 179617 1

    Please tell me 'near' has not been broken as a result of Microsoft's re-writing full-text searching in SQL Server 2008 (it worked perfectly on SQL Server 2005) and that I'm doing something wrong or have something amiss with my database configuration?

    This is basically a showstopper for our using SQL Server 2008 if it cannot be resolved as we are dependant on ‘near’ to match groups of words in the same phrase, not in different phrases (SQL Server 2005 still matched the later if it couldn't find words in the same phrase but assigned a much lower score which was ideal behaviour - and it was very fast). I don’t want to have to tell customers that going forwards they can only run our software on Oracle.

  • I passed this to Microsoft tech support having got no replies on the forums. They have successfully reproduced the problem and confirmed this is a bug due to a performance optimisation (which is indeed frequency related) in SQL Server 2008.

    I'm astonished that this has not been picked up before now - is noone else using full text search on SQL Server 2008?! This bug has been in since MSSQL2008 was released 18 or so months ago and is still present in SP1. Anyone using 'near' in their full text predicates could be affected where their search arguments are of low cardinality (high frequency).

    Microsoft have told me they hope to have a fix for this as part of a cumulative update by mid July 2009.

  • This is not the only problem that has been in SQL 2008 iFTS. We are heavy users of FTS in SQL 2005, but found that SQL 2008 iFTS performance was not up to the job in the CTPs and RTM version.

    We therefore put on hold just about all our testing of iFTS until the performance issue was fixed, as we felt there was little point tuning our query design to a performance model that would have to change before we could go live. The problem was fixed in both RTM CU3 and SP1, so now we are working on designing and tuning our SQL 2008 iFTS queries.

    It may be that other installations came across the same issue and also deferred investigation of iFTS performance. It would not surprise me that more problems are reported about iFTS in the next 3 months than have been reported between RTM and CU3.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply