FullText Search on PDF files

  • Hi

    We have an existing solution where we store documents in the database in FileStream tables.

    The users are allowed to search for content in these documents, using FullText and IFilters in combination we can do CONTAINS searches on document content.

    This has worked for years, and still do. Adobe has removed their download links for IFilter though. And I can see they have set it to end-of-life. FoxIt still has a paid solution for IFilter it seems.

    I have been wandering around the net to figure out what is best practice today.

    We want to keep the files in the database, as that is a lot easier in our configuration where we have data replicated on vessels sailing all over the globe, and this way we don't need to configure and maintain fileservers onboard each and every ship.

    We would prefer to keep FullText search, as this is implemented a lot of places in our system.

    We use IFilters for Office and PDF files.

    Is there better alternatives to FullText search and Adobe IFilter today?

    Best regards

    /Anders

  • Probably Elasticsearch using Ingest Attachment Processor or Apache Tika.

    Otherwise, if you're happy with iFilter, pay FoxIt. Elasticsearch is very powerful, but the costs to develop & implement the new approach will probably dwarf the license fee for iFilter.

  • We have been discussing Elastic Search - our challenge is currently our deployment environment.

    A single customer typically has a single land based installation, and potentially 100's of vessel installations, which their IT operates. Our system distributes data to/from these vessel islands.

    So having a single SQL server installation as only requirement on board is quite simple. Adding an elastic search configuration to each vessel would cause us quite some work to be able to get that running and configured so vessel IT at our customers can support it.

    That's primarily why we stick with FullText and having the files inside the databaase as FileStreams. It makes the setup quite portable.

    And of course the entire process of refactoring all usage of CONTAINS (and other searches) in our legacy code base with elastic search calls. That would take a lot of time. We have full flexible search on columns and fulltext on many screens.

    Adobe PDF filter works as such at our customers - I was just curious if there was any new technology other than IFilter in SQL Server I had missed - or if PDF iFilters was default supported some way. (The last part I can easily test out in a virtual machine at some part)

    /Anders

  • I thought SQL Server Full Text Search already included filters to handle PDF and office formats 'out of the box'. But I may be completely missing the point in which case I apologize.

  • Long time no see..

    Revisited this to close it for now.

    Installed a mint SQL 2019, and it doesn't seem to include .pdf when executing.

    SELECT * FROM sys.fulltext_document_types

    Neither does it include "new" Office formats, docx, xslx, etc.

    It does though include old binary Office file formats, doc, xsl, etc.

    My conclusion for now is that we need to keep a good grip on our legacy Adobe Pdf IFilter so customers can use this.

    And for Office files we just continue to use the Office 2010 Filter pack.

    Best regards

    /Anders

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply