Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

How to read data in a pdf file in SSIS Expand / Collapse
Author
Message
Posted Thursday, August 2, 2012 12:48 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 11:45 AM
Points: 151, Visits: 448
I was wondering if any body had a situation where data needs to be extracted from pdf files and exported to SQL Server.

I really appreciate in advance.
Post #1339455
Posted Thursday, August 2, 2012 1:59 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 8:47 AM
Points: 12,905, Visits: 32,168
can you go back to the source and get the same data in another format?
AFAIK, there's no way to import from a PDF; you might be able to grab teh text programatically, but it won't be in a format that you can use;
you could try converting to Text, or to Word O0nline, but that's going to be just as flaky as far as formatting goes.

i tried five different PDF's just now...some do not have the To Text option, even though it's filled with copy/pasteable text, and the ones that did have the option were mostly unreadable as far as text anyway...


Lowell

--There is no spoon, and there's no default ORDER BY in sql server either.
Actually, Common Sense is so rare, it should be considered a Superpower. --my son
Post #1339504
Posted Monday, August 6, 2012 2:37 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 11:45 AM
Points: 151, Visits: 448
Thanks for your reply Lowell. I have Adobe Acrobat X V 10.1.0 and don't see the TEXT option when I try to save it. Also, our pdf file is in such a way that it has location block on the upper right hand corner from which i need to get address, city, state, country etc. And some other details in the lower right hand corner which i need to separate out while loading them to SQL Server. I've 7000 files of this kind. Is there a way to do this without having to using c# programming?

Thanks in advance!
Post #1340893
Posted Monday, August 6, 2012 2:55 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 8:47 AM
Points: 12,905, Visits: 32,168
without a programming language? i more than just doubt it; i think that will be the only way to do it.
check out this thread over on stackoverflow:
read pdf files programatrically

lots of libraries mentioned, i haven't tried any yet.


Lowell

--There is no spoon, and there's no default ORDER BY in sql server either.
Actually, Common Sense is so rare, it should be considered a Superpower. --my son
Post #1340904
Posted Monday, August 6, 2012 3:00 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: 2 days ago @ 6:08 PM
Points: 5,401, Visits: 7,514
A question. Are you trying to store the file, or the information IN the file for full-text indexing?

If you're just trying to store the file, you want VARBINARY(MAX). Under 99% of circumstances you'd be better off storing the file in a filestore somewhere and databasing the link to it.

If you're trying to extract the information as text directly (without images, by the way, those can't come out to play in VARCHAR(MAX)) then what Lowell offered is basically it. You might contact Adobe and see if they've got an automator for the process, but there's not really a lot of things you'll do with Reader from T-SQL.

EDIT: Eeps, helps if I fully read your second response. Yeah, you're screwed, you'll have to go through a programming language and parse the files directly, there's nothing really in T-SQL or SSIS that's going to help you here.



- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally Tables

Twitter: @AnyWayDBA
Post #1340909
Posted Tuesday, August 7, 2012 8:34 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, April 1, 2014 3:27 PM
Points: 109, Visits: 957
Another alternative, if you don't want to roll your own solution, is to look at the many document management systems out on the market. Most of them have a "scan and extract" feature that will OCR the document and extract the data elements that you're interested in. Some are pretty sophisticated and can interpret, with varying degrees of success, handwriting as well.
Post #1341312
Posted Tuesday, August 7, 2012 10:02 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 11:45 AM
Points: 151, Visits: 448
Thanks everyone for the your time for suggestions. I guess there is no way expect use some programming to extract the data. I would have to look out for any third party tools that can read and extract information from pdf.

If anyone has any idea about the tools that have been successfully implemented for this solution, please let me know

Thanks again
Post #1341381
Posted Tuesday, August 7, 2012 2:11 PM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, September 19, 2014 1:45 AM
Points: 434, Visits: 313
Hi There

PC Based - have a look at ABBY Fineprint as an OCR Reader - Will export to CSV but not much else.

Server Based - Have a look at ABBY's Big Brother - Flexicapture. you can tell it either where to look on a PDF and then it OCR's the data or you can tell it to OCR first and look for the word Address in the top right hand corner and use that as the Anchor for OCR'ing the address.

You can set up a hot folder where you can just drop the PDF's the server processes them in the background and it drops the required field(s) straight into the Database.

If you want I can get you the detail of a VAR that we use (UK based) but there are bound to be others around.
Post #1341538
Posted Wednesday, August 8, 2012 2:55 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 7:20 AM
Points: 5,180, Visits: 12,033
sql server developer (8/6/2012)
Thanks for your reply Lowell. I have Adobe Acrobat X V 10.1.0 and don't see the TEXT option when I try to save it. Also, our pdf file is in such a way that it has location block on the upper right hand corner from which i need to get address, city, state, country etc. And some other details in the lower right hand corner which i need to separate out while loading them to SQL Server. I've 7000 files of this kind. Is there a way to do this without having to using c# programming?

Thanks in advance!


Is the data stored in the PDF as text? Can it be selected and copied?

Or is it there as an image?



Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.
Post #1341736
Posted Wednesday, August 8, 2012 8:32 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, April 1, 2014 3:27 PM
Points: 109, Visits: 957
Ephesoft is one that I'm aware of. A team here is working on implementing it to capture information from incoming invoices. There are plenty of others out there at all price points.
Post #1341911
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse