Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


How to read data in a pdf file in SSIS


How to read data in a pdf file in SSIS

Author
Message
sql server developer
sql server developer
SSC-Enthusiastic
SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)

Group: General Forum Members
Points: 162 Visits: 472
Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..
Phil Parkin
Phil Parkin
SSCrazy Eights
SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)

Group: General Forum Members
Points: 8407 Visits: 19510
sql server developer (8/8/2012)
Pdf's have both text and images. We need to capture text out of it. It can be copied but as you know doing it for 7000 pdf's would take a lot of time..


You have not answered my question. I asked you whether the text could be selected using copy/paste. If the answer is no, OCR is your only option.


Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

If the answer to your question can be found with a brief Google search, please perform the search yourself, rather than expecting one of the SSC members to do it for you.

Please surround any code or links you post with the appropriate IFCode formatting tags. It helps readability a lot.
sql server developer
sql server developer
SSC-Enthusiastic
SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)

Group: General Forum Members
Points: 162 Visits: 472
I'm sorry Phil, the answer for your questions is Yes. The text can be copied
sql server developer
sql server developer
SSC-Enthusiastic
SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)SSC-Enthusiastic (162 reputation)

Group: General Forum Members
Points: 162 Visits: 472
Also, if you don't mind can you help me understand what OCR is?
Evil Kraig F
Evil Kraig F
SSCertifiable
SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)

Group: General Forum Members
Points: 5715 Visits: 7660
sql server developer (8/8/2012)
Also, if you don't mind can you help me understand what OCR is?


Optical Character Recognition. Basically: Your scanner takes the blobs on the page and attempts to make characters/words out of them. Particularly useful when trying to get hand printed checks to have names, nearly illegible faxes to become computer documents, etc.

google: OCR and hit the wiki page that comes up as the first choice there.


- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
dawnbrown243
dawnbrown243
Forum Newbie
Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)

Group: General Forum Members
Points: 2 Visits: 1
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.
Phil Parkin
Phil Parkin
SSCrazy Eights
SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)SSCrazy Eights (8.4K reputation)

Group: General Forum Members
Points: 8407 Visits: 19510
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.


Even though this thread is getting old, it was never fully resolved and remains interesting.

Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?


Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

If the answer to your question can be found with a brief Google search, please perform the search yourself, rather than expecting one of the SSC members to do it for you.

Please surround any code or links you post with the appropriate IFCode formatting tags. It helps readability a lot.
wolfkillj
wolfkillj
Ten Centuries
Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)

Group: General Forum Members
Points: 1108 Visits: 2582
Phil Parkin (5/8/2013)
dawnbrown243 (5/7/2013)
OCR means Optical character recognition, it is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. Some PDFs are scans, so OCR recongnition would be required, PDF format is well-documented, PDF have multiple columns and the extraction of pdf text needs to use a mature and structure pdf reading app.


Even though this thread is getting old, it was never fully resolved and remains interesting.

Are you able to suggest how to "use a mature and structure (sic.) pdf reading app" in SSIS to solve this problem?


The first thing I would try is invoking the application that extracts the text from the PDF file from a command line (I would expect a "mature" PDF processing application to have a command line interface), probably with an Execute Process task. It would likely be easiest to have the app write the text to a flat file, then use the appropriate connection managers, etc. to ETL those files.

Jason Wolfkill
Blog: SQLSouth
Twitter: @SQLSouth
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search