Full text issues after string with "\.\" in it

  • Hi, all -

    Weird issue on Sql Server 2014 that I'm hoping someone can help me with.  We work with a lot medical notes and reports that comes in via a wide variety of formats. These notes and reports get ETL'd into a document table where they get full text indexed for searching.   Since many medical terms contain a hyphen in them, we rely on the fact that the word breaker will break these terms up into separate words and index them separately.  As an example, "Klippel-Feil syndrome" will have the words "Klippel", "Feil", and "syndrome" indexed separately.  Works the same on search so a user can type in either "Klippel-Feil" or "Klippel Feil".  All that works great.

    We noticed some hyphenated terms weren't bringing back the correct document counts.  After much trail and error, we narrowed it down to a particular format (HL7).  As an artifact of the type, these notes contain a lot of backslashes in them - for the most part, these designate line breaks (i.e. "\br\").  These backslashes regular bump up next to full words - the indexer seemed to handle this fine, correctly separating out the backslash from the word next to it.  We noticed though that one particular string of characters before a hyphenated word seemed to change the behavior of the indexer:

    If the hypenated word was preceeded by "\.\" (backslash, period, backslash) - the index would no longer break apart the hypenated word.  What's more, the backslashes and period could be separated by various string of letters - as long as they were part of the same  contiguous word in the same order, they would prevent the hyphenated word from being broken apart.  A couple of examples:

    "Klippel-Feil" indexed as "Klippel" and "Feil"
    "\.\Klippel-Feil" indexed as "Klippel-Feil"
    "\.br\Klippel-Feil" indexed as "br" and "Klippel-Feil"
    ".\Klippel-Feil" indexed as "Klippel" and "Feil"

    We've tried various language types to see if it helps.  We're usually defaulted to English (1033), but we also tried various other ones.  Neutral (0) and 13 (not sure what that one is) seem to parse the phrase right (using dm_fts_parser), but for whatever reason, the full text index looks the same as English after converting over to either.

    We're considering reparsing the notes to look for this specific series of characters, but it means we'd have to reprocess 100s of millions of notes.

    Anyone run into this before? Any suggestions?

Viewing 0 posts

You must be logged in to reply to this topic. Login to reply