Blob data handling - best option?

  • Hi

    I am a programmer and need to do the following.

    Currently we are using SQL Server 2000 and we are handling lots of resume/cv in .DOC and PDF formats. we use some 3rd party tools to extracts the contents of the file and store it in NTEXT column which is full-text indexed.

    Now we are planning to go with SQL Server 2012 Standard and i would like to get some advice on this scenario.

    1. I dont want to use the text extracter (3rd party tools). Should i store the files as single BLOB and full-text index on it. This is main point as we search these very often to find the best matched candidate for a job requirement.

    Are is there any other best method to handle this situation?

    Thanks.

  • kavidha (3/20/2013)


    Hi

    I am a programmer and need to do the following.

    Currently we are using SQL Server 2000 and we are handling lots of resume/cv in .DOC and PDF formats. we use some 3rd party tools to extracts the contents of the file and store it in NTEXT column which is full-text indexed.

    Now we are planning to go with SQL Server 2012 Standard and i would like to get some advice on this scenario.

    1. I dont want to use the text extracter (3rd party tools). Should i store the files as single BLOB and full-text index on it. This is main point as we search these very often to find the best matched candidate for a job requirement.

    Are is there any other best method to handle this situation?

    Thanks.

    1) are you sure you need Ntext? Almost every time I see this at clients there is no need for it. Unless you are or plan to store languages that require 2-byte characters, save 50% of the storage cost by choosing an non-Unicode datatype.

    2) I would look into a purpose-built 3rd party device for text indexing/searching. They can provide MUCH more performance and importantly for you flexibility/power in your search capabilities.

    3) Having said that, SQL 2012 FTS is really very good. There is also the new Semantic Search capabilities you probably should investigate. There are also various iFilters you can use to enable indexing various document types such as pdf, doc, etc.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Hi Kevin,

    Thanks a lot for your reply.

    1. No i dont want to use NText...

    2. I am not sure about the Single BLOB...so for my scenario, can i use it? is it better solution?

    Thanks.

  • Filestream is the most efficient way to store unstructured data, and is available since 2008

    You need to enable filestream for the database and add a filestream filegroup.

    Alter the table column to varbinary.

    See the below links

    http://msdn.microsoft.com/en-us/library/gg471497.aspx

    http://msdn.microsoft.com/en-us/library/cc645923.aspx

    http://msdn.microsoft.com/en-us/library/cc645585.aspx

    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Roshan Joe

    Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • kavidha (3/20/2013)


    Hi Kevin,

    Thanks a lot for your reply.

    1. No i dont want to use NText...

    2. I am not sure about the Single BLOB...so for my scenario, can i use it? is it better solution?

    Thanks.

    Yes, you can store SOME types of files as blobs (either filestream or not) and then have Full Text Indexing shred them. Some file types will not be FTS-aware. You may wish to consult with an FTS expert to see what is best for your needs. Personally I would set up tests of several different mechanisms and see which works/works best for YOUR needs.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks for you suggestion Roshan Joe.

    But I heard that filestream is at its best if the size is over 1gb...:unsure:

  • Thanks Kevin. Will try to execute with different mechanism. Thanks once again.

  • kavidha (3/22/2013)


    Thanks for you suggestion Roshan Joe.

    But I heard that filestream is at its best if the size is over 1gb...:unsure:

    http://www.sqlskills.com/blogs/paul/sql-server-2008-filestream-performance/

    http://www.sqlskills.com/blogs/paul/high-performance-filestream-tips-and-tricks/

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • If you are going to 2012, think about FileTable. In some ways the development is easier than Filestream.

    The filters for blobs are based on the iFilter standard. You can purchase third party filters for some types, or that work better. The Office filters work pretty well, but if you are Office 2010 or later, you'll need to download and install them separately. The native iFilters for FTS are 2007 based.

    For file sizes, in general, if you are 256kb or smaller, use varbinary and stick them in there. If your files are 1MB or larger, go Filestream/Filetable. In between, test, but if you don't have lots of updates (mostly inserts), I'd just do Filestream/Filetable and not worry about it.

    I wouldn't look at Semantic Search in 2012. It is a neat idea, but it only supports uni-grams (single terms) and isn't terribly useful, IMHO. FTS searches work better. If they get to n-grams in 20xx, then I'd revisit this. With that in mind, separate and contain your search logic so you can replace it if things improve or you want to change things. The NEAR and proximity searches are improved in 2012. You also get property searches, so the "tags" and metadata you see in the file properties or explorer windows can be searched in 2012. That's pretty cool stuff if your sources use it.

  • Thank you Steve.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply