How to push PowerQuery results into SQL Server

  • Sorry, me again.

    In my PowerQuery, I read a table out of a PDF file, do a bunch of transformations and then split the resulting table in two (one is Purchases, and the other is Discounts), and then I use those to create a new dataset by doing a Purchases LEFT JOIN Discounts. All that works. I got the writing to work by using Access and a linked table to an Excel sheet with the PowerQuery result, and then another to my SQL Server destination table and then I just use Access to do the "coordination". (Ghastly, I know).

    What's the most sensible way of doing this without all the Excel/Access/SQL Server hoops? It seems like I could (maybe) turn the PowerQuery into a function that returns the final table, and then call that from SSIS to handle the For Each file Loop / moving /logging etc part of the problem. I'm doing everything on my desktop (no Fabric)...

    just a reference to an article is plenty, thanks!

  • If I am not mistaken, PowerQuery is part of Excel, so it'll be hard to remove excel from your mix.

    But is there any reason you aren't just connecting SSIS to Excel? Is there a need for Access?

    But since you are familiar with SSIS, you could take PowerQuery and Excel out of the equation and just use SSIS:

    https://stackoverflow.com/questions/58683425/how-to-use-pdf-file-as-data-source-in-ssis

    SSIS should be able to handle everything PowerQuery was doing with a little bit of initial setup on the SSIS side of things.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I don't think I can remove Excel, because for some reason, PowerQuery *in Excel* can read/parse tables in a PDF, but the PowerQuery source in SSIS can't. (???!!!!!) The weird thing is that mapping in that stepchild, Access, can read both Excel and SQL Server just fine. but Seems kinda over the top to use all of that to just import an Excel(ish) file.

    The big question is how do I deal with the multiple PDFs? And what do I do if one PDF fails? because PowerQuery will just lump all that together (with a filename). If nothing fails, I can jusMt import all the data at once, dump it into a temporary table in SQL Server, split the Purchases and the Discounts to different tables and then outer join them.

    Might be time to learn Py2PDF

    • This reply was modified 3 weeks, 2 days ago by pietlinden.
  • Mr. Brian Gale wrote:

    If I am not mistaken, PowerQuery is part of Excel, so it'll be hard to remove excel from your mix.

    PQ can be used in many other places, including Power BI desktop, SSAS, ADF, Power Apps ... But I think Excel is where it all began.


  • Phil Parkin wrote:

    Mr. Brian Gale wrote:

    If I am not mistaken, PowerQuery is part of Excel, so it'll be hard to remove excel from your mix.

    PQ can be used in many other places, including Power BI desktop, SSAS, ADF, Power Apps ... But I think Excel is where it all began.

    Shows what I know about PowerQuery LOL. Quick google though told me that you can use a C# step to pull data out of PDF in SSIS which is the approach I'd take. Source (PDF) to ETL (SSIS) to Destination (SQL Server). Reduce the number of tools in the middle (Excel, Access). Problems may come in if/when the PDF format changes and your C# stuff breaks, but that'll also break PowerQuery. So as long as you have someone with some C# skills, that approach should work.

    Now why PowerQuery can read the PDF in Excel but not in SSIS, that would be a red flag for me to not use PowerQuery for this. If the language is used in other places BUT behaves differently across platforms, it means an update to anything in my workflow (Excel in your case) COULD update PowerQuery and break the PDF import. C# is a lot less likely to change with an SSIS upgrade, so less risk. Still risk mind you because anything you update COULD break the parsing. A new version of the PDF editing tools you are using could render the PDF unreadable in C# or PowerQuery.

    Don't suppose there is any way you can request a better format for the data, eh?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I asked them at Costco if they could do something crazy and send me the file as a spreadsheet and not a PDF, but nope. I guess they're super proud of their 2005 technology. Then I wouldn't have to do any of these stupid things. =(

  • Probably due to its use in Data Engineering and Data Science, Python has a number of libraries to extract text from PDFs. eg PDFPlumber.

     

  • I've been looking at a bunch of them. There are several libraries that read/parse PDFs. Guess I'll dig into that.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply