• Phil Parkin (5/8/2013)


    dawnbrown243 (5/7/2013)


    OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition[/url] would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.

    Even though this thread is getting old, it was never fully resolved and remains interesting.

    Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?

    The first thing I would try is invoking the application that extracts the text from the PDF file from a command line (I would expect a "mature" PDF processing application to have a command line interface), probably with an Execute Process task. It would likely be easiest to have the app write the text to a flat file, then use the appropriate connection managers, etc. to ETL those files.

    Jason Wolfkill