RE: Extended Properties Introduction

Hall of Fame

Points: 3691

March 17, 2011 at 2:47 pm

YSLGuru,

ANyone know of a good way to get Meta data (sometimes called a data dictionary) out of a PDF and into something more easy to insert into t-SQL code? The PDF currently uses an Excel/Spreadhseet/table like structre; at least thats they way its presented. I have no diea how PDF works internally and so its looking like its in a table may not mean it any easier to export then if it were presented in free form style.

Internally, PDF is a compressed PostScript file (with proprietary add ons), and as far as I can tell there isn't any good way to extract the data out.

However, In theory, one could print the pdf to a PostScript file (e.g. set up a postscript printer and change the destination to be a file) and then use the postscript language to extract the data out. There may be converters to turn the postscript into something more friendly but I've never had occasion to look for any. It might even be possible to skip the postscript and set up some sort of line printer to create the file and parse that out after stripping out the control characters. If you decide to try this use the oldest printer driver you can make work (The Apple Laserwriter has historically been a good choice for a postscript printer driver). Tables might not be too bad but the more formatting and objects that exist in the pdf the harder it will be to parse out the file.

After writing all of that, I thought of a potentially much easier way:

OCR.

Good Luck,