Return text from a PDF stored in the database (Adobe iFilter)

  • We are storing PDF files inside a SQL Server 2008R2 DB. We have installed the Adobe iFilter to create a full-text catalog in order to search these files. Everything was working great, until.... we tried to get the text out of that PDF for display on a website. I am at a loss. We want to be able to return the text of a PDF file as varchar(max) using just straight-up T-SQL. I assume we would need to create a function and *somehow* use the iFilter to pull out the text, but I cannot find any documentation on how to do such a thing. I have searched the WWW for hours and found nothing.

    Has anyone done this? or.. Does anyone have a link to some documentation that can show me how to do it?

    Thanks,

    Murphy

  • I woudl guess this is possible with the appropriate .Net routine. I guess that there might be assemblies out there to manipulate PDF documents. You may be able to add these assemblies to SQL Server and create a CLR function for the task.

    However, I am a little skeptic to that it is that bright of idea. Maybe it's better to return the PDF to the client and extract the text in the business layer.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • here's the first actual code example i found:

    http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

    along the lines of what Erland was suggesting, i think i'd simply add a column that will hold the extracted text from the pdf, and create a method that will process the documents one by one.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Hello

    I often solve the problems like you using a pdf processing software[/url].They support to rotate PDF page, insert or delete a PDF page,

    reorder PDF pages and add images to PDF document page easily.You can take a look.

  • I appreciate the comments.

    We already have a the data, that is contained in the PDF document, in a text field and, until now, we have had that field as the catalog search. However, these are HUGE files and the data redundancy is what we want to get rid of. We need to keep PDFs in order for our customers to be able to download, so the only option for eliminating redundancy is to get rid of the text. So, I thought I found a good solution by indexing the PDFs and changing our search to use that. My issue now is that I cannot retrieve the data from the PDF for simple display on the webpage.

    As for the web code... we only allow website code to execute stored procedures, in other words, they do not have direct access to the DB. So I need a way, using T-SQL, to pull the information out of the PDF and return it to them, as text, via procedure calls.

    I was just hoping somebody had done something like that.

    Thanks anyway.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply