Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


How to read data in a pdf file in SSIS


How to read data in a pdf file in SSIS

Author
Message
sql server developer
sql server developer
SSC-Enthusiastic
SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)

Group: General Forum Members
Points: 186 Visits: 472
I was wondering if any body had a situation where data needs to be extracted from pdf files and exported to SQL Server.

I really appreciate in advance.
Lowell
Lowell
SSCoach
SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)

Group: General Forum Members
Points: 17785 Visits: 39399
can you go back to the source and get the same data in another format?
AFAIK, there's no way to import from a PDF; you might be able to grab teh text programatically, but it won't be in a format that you can use;
you could try converting to Text, or to Word O0nline, but that's going to be just as flaky as far as formatting goes.

i tried five different PDF's just now...some do not have the To Text option, even though it's filled with copy/pasteable text, and the ones that did have the option were mostly unreadable as far as text anyway...


Lowell

--
help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

sql server developer
sql server developer
SSC-Enthusiastic
SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)

Group: General Forum Members
Points: 186 Visits: 472
Thanks for your reply Lowell. I have Adobe Acrobat X V 10.1.0 and don't see the TEXT option when I try to save it. Also, our pdf file is in such a way that it has location block on the upper right hand corner from which i need to get address, city, state, country etc. And some other details in the lower right hand corner which i need to separate out while loading them to SQL Server. I've 7000 files of this kind. Is there a way to do this without having to using c# programming?

Thanks in advance!
Lowell
Lowell
SSCoach
SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)

Group: General Forum Members
Points: 17785 Visits: 39399
without a programming language? i more than just doubt it; i think that will be the only way to do it.
check out this thread over on stackoverflow:
read pdf files programatrically

lots of libraries mentioned, i haven't tried any yet.

Lowell

--
help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Evil Kraig F
Evil Kraig F
SSCertifiable
SSCertifiable (6.3K reputation)SSCertifiable (6.3K reputation)SSCertifiable (6.3K reputation)SSCertifiable (6.3K reputation)SSCertifiable (6.3K reputation)SSCertifiable (6.3K reputation)SSCertifiable (6.3K reputation)SSCertifiable (6.3K reputation)

Group: General Forum Members
Points: 6331 Visits: 7660
A question. Are you trying to store the file, or the information IN the file for full-text indexing?

If you're just trying to store the file, you want VARBINARY(MAX). Under 99% of circumstances you'd be better off storing the file in a filestore somewhere and databasing the link to it.

If you're trying to extract the information as text directly (without images, by the way, those can't come out to play in VARCHAR(MAX)) then what Lowell offered is basically it. You might contact Adobe and see if they've got an automator for the process, but there's not really a lot of things you'll do with Reader from T-SQL.

EDIT: Eeps, helps if I fully read your second response. Yeah, you're screwed, you'll have to go through a programming language and parse the files directly, there's nothing really in T-SQL or SSIS that's going to help you here.


- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
ACinAZ
ACinAZ
SSC-Enthusiastic
SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)

Group: General Forum Members
Points: 131 Visits: 957
Another alternative, if you don't want to roll your own solution, is to look at the many document management systems out on the market. Most of them have a "scan and extract" feature that will OCR the document and extract the data elements that you're interested in. Some are pretty sophisticated and can interpret, with varying degrees of success, handwriting as well.
sql server developer
sql server developer
SSC-Enthusiastic
SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)SSC-Enthusiastic (186 reputation)

Group: General Forum Members
Points: 186 Visits: 472
Thanks everyone for the your time for suggestions. I guess there is no way expect use some programming to extract the data. I would have to look out for any third party tools that can read and extract information from pdf.

If anyone has any idea about the tools that have been successfully implemented for this solution, please let me know

Thanks again
Paul Smith-221741
Paul Smith-221741
SSC-Addicted
SSC-Addicted (444 reputation)SSC-Addicted (444 reputation)SSC-Addicted (444 reputation)SSC-Addicted (444 reputation)SSC-Addicted (444 reputation)SSC-Addicted (444 reputation)SSC-Addicted (444 reputation)SSC-Addicted (444 reputation)

Group: General Forum Members
Points: 444 Visits: 334
Hi There

PC Based - have a look at ABBY Fineprint as an OCR Reader - Will export to CSV but not much else.

Server Based - Have a look at ABBY's Big Brother - Flexicapture. you can tell it either where to look on a PDF and then it OCR's the data or you can tell it to OCR first and look for the word Address in the top right hand corner and use that as the Anchor for OCR'ing the address.

You can set up a hot folder where you can just drop the PDF's the server processes them in the background and it drops the required field(s) straight into the Database.

If you want I can get you the detail of a VAR that we use (UK based) but there are bound to be others around.
Phil Parkin
Phil Parkin
SSChampion
SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)

Group: General Forum Members
Points: 10128 Visits: 19838
sql server developer (8/6/2012)
Thanks for your reply Lowell. I have Adobe Acrobat X V 10.1.0 and don't see the TEXT option when I try to save it. Also, our pdf file is in such a way that it has location block on the upper right hand corner from which i need to get address, city, state, country etc. And some other details in the lower right hand corner which i need to separate out while loading them to SQL Server. I've 7000 files of this kind. Is there a way to do this without having to using c# programming?

Thanks in advance!


Is the data stored in the PDF as text? Can it be selected and copied?

Or is it there as an image?


Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

If the answer to your question can be found with a brief Google search, please perform the search yourself, rather than expecting one of the SSC members to do it for you.

Please surround any code or links you post with the appropriate IFCode formatting tags. It helps readability a lot.
ACinAZ
ACinAZ
SSC-Enthusiastic
SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)SSC-Enthusiastic (131 reputation)

Group: General Forum Members
Points: 131 Visits: 957
Ephesoft is one that I'm aware of. A team here is working on implementing it to capture information from incoming invoices. There are plenty of others out there at all price points.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search