Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

How to read data in a pdf file in SSIS Expand / Collapse
Author
Message
Posted Wednesday, August 8, 2012 10:50 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, December 9, 2014 12:14 PM
Points: 151, Visits: 457
Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..
Post #1342027
Posted Wednesday, August 8, 2012 11:20 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 2:15 AM
Points: 5,317, Visits: 12,354
sql server developer (8/8/2012)
Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..


You have not answered my question. I asked you whether the text could be selected using copy/paste. If the answer is no, OCR is your only option.



Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.

It is better to keep your mouth shut and appear stupid than to open it and remove all doubt. (Mark Twain)
Post #1342056
Posted Wednesday, August 8, 2012 12:56 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, December 9, 2014 12:14 PM
Points: 151, Visits: 457
I'm sorry Phil, the answer for your questions is Yes. The text can be copied
Post #1342142
Posted Wednesday, August 8, 2012 12:57 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, December 9, 2014 12:14 PM
Points: 151, Visits: 457
Also, if you don't mind can you help me understand what OCR is?
Post #1342144
Posted Wednesday, August 8, 2012 1:11 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Monday, December 15, 2014 2:26 PM
Points: 5,466, Visits: 7,647
sql server developer (8/8/2012)
Also, if you don't mind can you help me understand what OCR is?


Optical Character Recognition. Basically: Your scanner takes the blobs on the page and attempts to make characters/words out of them. Particularly useful when trying to get hand printed checks to have names, nearly illegible faxes to become computer documents, etc.

google: OCR and hit the wiki page that comes up as the first choice there.



- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
Post #1342156
Posted Tuesday, May 7, 2013 10:29 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, May 8, 2013 1:13 AM
Points: 2, Visits: 1
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.
Post #1450398
Posted Wednesday, May 8, 2013 1:11 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 2:15 AM
Points: 5,317, Visits: 12,354
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.


Even though this thread is getting old, it was never fully resolved and remains interesting.

Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?



Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.

It is better to keep your mouth shut and appear stupid than to open it and remove all doubt. (Mark Twain)
Post #1450428
Posted Thursday, May 9, 2013 9:51 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Tuesday, October 28, 2014 12:50 PM
Points: 1,061, Visits: 2,580
Phil Parkin (5/8/2013)
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.


Even though this thread is getting old, it was never fully resolved and remains interesting.

Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?


The first thing I would try is invoking the application that extracts the text from the PDF file from a command line (I would expect a "mature" PDF processing application to have a command line interface), probably with an Execute Process task. It would likely be easiest to have the app write the text to a flat file, then use the appropriate connection managers, etc. to ETL those files.


Jason Wolfkill
Blog: SQLSouth
Twitter: @SQLSouth
Post #1451218
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse