Freetext query to get surrounding portion of text

  • Hello all,

    I want to build a query which returns some portion of the text around the matching criteria (word/phrase). That is, if i search for say, "valuable asset", the result should be something like "....positive attitude is the most valuable asset of a person, which....".

    The column contains files of MS office 2007 (docx, pptx, xlsx,etc), html, txt and pdf formats.

    Any suggestions?

    George

  • This is possible but the only way I know of would be hideously slow. I assume you have column varbinary(max) and in this column you have stored the byte [] which represents the file. then you have another column with the mimetype or filename/extension? The only way I can think of looking through is to write each file to disc and then use file I/O to read the file as text and see if it contains the string you are searching for. Like I said it can be done but anything more than 4-5 documents is going to be hideously slow.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I have a large number of documents in the db. So I don't think that is an option.

    Anyone know any other options?

  • Given that the files are converted to byte array they are no longer human readable. There is no way you will be able to search the contents of these files for a given search string without first converting them back to a file. This is pretty much like trying to search an encrypted value where the clear text contains a string. The only way to do that is to first convert your encrypted value to the original and then search it. Sounds like maybe getting your files out of the database might be the place to start. Store them in the file system instead of the db. That might not be possible with the rest of your application however.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • You can use full text indexing on BLOBs stored in SQL Server. MSDN has instructions on setting it up. You have to include the file extension as a column in the table.

    That will do what you need.

    It's not hard to set up. It can end up using a fairly large amount of disk space, depending on what you're indexing.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • +1 for somebody with more knowledge than me jumping in. Thanks Gus.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Ok, Gus... I really get to show my ignorance of FTS here. :blush: I know that SQL Server FTS will create the indexes from a gaggle of different file types but I didn't know they'd be materialized in a searchable format for what the OP wants to do and I certainly didn't know that SQL Server FTS would do a "context window" like what the OP is asking for.

    So, I have to ask, as well... How can this be done using SQL Server FTS ?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • First part, on FTS on a binary file, there's an example of it here: http://www.codeproject.com/KB/architecture/sqlfulltextindexing.aspx

    The materialization is more complex, but is mostly just string manipulation. For example, CharIndex will return the start point of a search string in a varbinary column.

    Personally, I'd really do this outside of SQL Server. Filestream would allow you to use existing search functions on these things, without having to roll-your-own in T-SQL.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Guh! Thanks, Gus. That's what I thought.

    The big problem with doing what the OP wants to do is that FTS is just what it says... it's a search. It only tells you if the item you searched for is somewhere in the doc and what the search rank is. It doesn't return the word number, character number, or anything else useful for this particular problem.

    I've also look at the actual index files that FTS makes... they're not very useful to outside processes unless you can find a piece of code designed to read them.

    A while back, someone posted an article on a homemade FTS system built totally within T-SQL tables. A lot of folks gave the author a whole lot of guff on seemingly redeveloping the wheel but I can see why someone might want to build such a thing considering the very limited info that FTS actually provides.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yeah, I've always thought FTS was pretty crippled because you can't materialize the matching string from it.

    It would be MUCH more useful if you could query a function from the ContainsTable method that would return what the match was.

    This would especially be true in binary (BLOB) indexed objects, because charindex and such into varbinary is case sensitive.

    Just another half-product from Microsoft. Does something almost useful, but not quite. Like the tuning advisor. Like Profiler. Gets you halfway to what you need, and no way to get the rest of the way without moving into manual data analysis.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi,

    From the above discussions, I understand that I will not be able to extract the text portion directly from the varbinary field. So, as a walk around, will it be possible to locate the previous and next occurrences of full stop (in terms of varbinary format) with respect to the CHARINDEX of the search criteria, then extract this portion between the full stops and cast it to varchar?

    On a second thought, I guess that is going to be a really circuitous and tricky route! and of course, there is more probability for poor accuracy, too.

  • You can use Substring in varbinary just like in varchar. Get the start and stop point, pull the substring, find what you need inside of that.

    Again, you're probably better off with a desktop search tool and using that to index the files.

    If you need SQL style data as well as the files, look into Filestream. It's a SQL Server 2008 option that gives you SQL and file tools all in one. Has some limitations, but might be what you need here.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks Gus...

    But I only have SQL Server 2005, so I will have to work with it..... It has to be a web application, so I am trying to find a way out.....

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply