How to read data in a pdf file in SSIS

  • I was wondering if any body had a situation where data needs to be extracted from pdf files and exported to SQL Server.

    I really appreciate in advance.

  • can you go back to the source and get the same data in another format?

    AFAIK, there's no way to import from a PDF; you might be able to grab teh text programatically, but it won't be in a format that you can use;

    you could try converting to Text, or to Word O0nline, but that's going to be just as flaky as far as formatting goes.

    i tried five different PDF's just now...some do not have the To Text option, even though it's filled with copy/pasteable text, and the ones that did have the option were mostly unreadable as far as text anyway...

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks for your reply Lowell. I have Adobe Acrobat X V 10.1.0 and don't see the TEXT option when I try to save it. Also, our pdf file is in such a way that it has location block on the upper right hand corner from which i need to get address, city, state, country etc. And some other details in the lower right hand corner which i need to separate out while loading them to SQL Server. I've 7000 files of this kind. Is there a way to do this without having to using c# programming?

    Thanks in advance!

  • without a programming language? i more than just doubt it; i think that will be the only way to do it.

    check out this thread over on stackoverflow:

    read pdf files programatrically

    lots of libraries mentioned, i haven't tried any yet.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • A question. Are you trying to store the file, or the information IN the file for full-text indexing?

    If you're just trying to store the file, you want VARBINARY(MAX). Under 99% of circumstances you'd be better off storing the file in a filestore somewhere and databasing the link to it.

    If you're trying to extract the information as text directly (without images, by the way, those can't come out to play in VARCHAR(MAX)) then what Lowell offered is basically it. You might contact Adobe and see if they've got an automator for the process, but there's not really a lot of things you'll do with Reader from T-SQL.

    EDIT: Eeps, helps if I fully read your second response. Yeah, you're screwed, you'll have to go through a programming language and parse the files directly, there's nothing really in T-SQL or SSIS that's going to help you here.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Another alternative, if you don't want to roll your own solution, is to look at the many document management systems out on the market. Most of them have a "scan and extract" feature that will OCR the document and extract the data elements that you're interested in. Some are pretty sophisticated and can interpret, with varying degrees of success, handwriting as well.

  • Thanks everyone for the your time for suggestions. I guess there is no way expect use some programming to extract the data. I would have to look out for any third party tools that can read and extract information from pdf.

    If anyone has any idea about the tools that have been successfully implemented for this solution, please let me know

    Thanks again

  • Hi There

    PC Based - have a look at ABBY Fineprint as an OCR Reader - Will export to CSV but not much else.

    Server Based - Have a look at ABBY's Big Brother - Flexicapture. you can tell it either where to look on a PDF and then it OCR's the data or you can tell it to OCR first and look for the word Address in the top right hand corner and use that as the Anchor for OCR'ing the address.

    You can set up a hot folder where you can just drop the PDF's the server processes them in the background and it drops the required field(s) straight into the Database.

    If you want I can get you the detail of a VAR that we use (UK based) but there are bound to be others around.

  • sql server developer (8/6/2012)


    Thanks for your reply Lowell. I have Adobe Acrobat X V 10.1.0 and don't see the TEXT option when I try to save it. Also, our pdf file is in such a way that it has location block on the upper right hand corner from which i need to get address, city, state, country etc. And some other details in the lower right hand corner which i need to separate out while loading them to SQL Server. I've 7000 files of this kind. Is there a way to do this without having to using c# programming?

    Thanks in advance!

    Is the data stored in the PDF as text? Can it be selected and copied?

    Or is it there as an image?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Ephesoft is one that I'm aware of. A team here is working on implementing it to capture information from incoming invoices. There are plenty of others out there at all price points.

  • Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..

  • sql server developer (8/8/2012)


    Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..

    You have not answered my question. I asked you whether the text could be selected using copy/paste. If the answer is no, OCR is your only option.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • I'm sorry Phil, the answer for your questions is Yes. The text can be copied

  • Also, if you don't mind can you help me understand what OCR is?

  • sql server developer (8/8/2012)


    Also, if you don't mind can you help me understand what OCR is?

    Optical Character Recognition. Basically: Your scanner takes the blobs on the page and attempts to make characters/words out of them. Particularly useful when trying to get hand printed checks to have names, nearly illegible faxes to become computer documents, etc.

    google: OCR and hit the wiki page that comes up as the first choice there.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply