Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

How to read data in a pdf file in SSIS Expand / Collapse
Author
Message
Posted Wednesday, August 8, 2012 10:50 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 11:45 AM
Points: 151, Visits: 448
Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..
Post #1342027
Posted Wednesday, August 8, 2012 11:20 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 7:20 AM
Points: 5,180, Visits: 12,033
sql server developer (8/8/2012)
Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..


You have not answered my question. I asked you whether the text could be selected using copy/paste. If the answer is no, OCR is your only option.



Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.
Post #1342056
Posted Wednesday, August 8, 2012 12:56 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 11:45 AM
Points: 151, Visits: 448
I'm sorry Phil, the answer for your questions is Yes. The text can be copied
Post #1342142
Posted Wednesday, August 8, 2012 12:57 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 11:45 AM
Points: 151, Visits: 448
Also, if you don't mind can you help me understand what OCR is?
Post #1342144
Posted Wednesday, August 8, 2012 1:11 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: 2 days ago @ 6:08 PM
Points: 5,401, Visits: 7,514
sql server developer (8/8/2012)
Also, if you don't mind can you help me understand what OCR is?


Optical Character Recognition. Basically: Your scanner takes the blobs on the page and attempts to make characters/words out of them. Particularly useful when trying to get hand printed checks to have names, nearly illegible faxes to become computer documents, etc.

google: OCR and hit the wiki page that comes up as the first choice there.



- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
Post #1342156
Posted Tuesday, May 7, 2013 10:29 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, May 8, 2013 1:13 AM
Points: 2, Visits: 1
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.
Post #1450398
Posted Wednesday, May 8, 2013 1:11 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 7:20 AM
Points: 5,180, Visits: 12,033
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.


Even though this thread is getting old, it was never fully resolved and remains interesting.

Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?



Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.
Post #1450428
Posted Thursday, May 9, 2013 9:51 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Yesterday @ 3:40 PM
Points: 1,061, Visits: 2,578
Phil Parkin (5/8/2013)
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.


Even though this thread is getting old, it was never fully resolved and remains interesting.

Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?


The first thing I would try is invoking the application that extracts the text from the PDF file from a command line (I would expect a "mature" PDF processing application to have a command line interface), probably with an Execute Process task. It would likely be easiest to have the app write the text to a flat file, then use the appropriate connection managers, etc. to ETL those files.


Jason Wolfkill
Blog: SQLSouth
Twitter: @SQLSouth
Post #1451218
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse