How to push PowerQuery results into SQL Server

  • Sorry, me again.

    In my PowerQuery, I read a table out of a PDF file, do a bunch of transformations and then split the resulting table in two (one is Purchases, and the other is Discounts), and then I use those to create a new dataset by doing a Purchases LEFT JOIN Discounts. All that works. I got the writing to work by using Access and a linked table to an Excel sheet with the PowerQuery result, and then another to my SQL Server destination table and then I just use Access to do the "coordination". (Ghastly, I know).

    What's the most sensible way of doing this without all the Excel/Access/SQL Server hoops? It seems like I could (maybe) turn the PowerQuery into a function that returns the final table, and then call that from SSIS to handle the For Each file Loop / moving /logging etc part of the problem. I'm doing everything on my desktop (no Fabric)...

    just a reference to an article is plenty, thanks!

  • If I am not mistaken, PowerQuery is part of Excel, so it'll be hard to remove excel from your mix.

    But is there any reason you aren't just connecting SSIS to Excel? Is there a need for Access?

    But since you are familiar with SSIS, you could take PowerQuery and Excel out of the equation and just use SSIS:

    https://stackoverflow.com/questions/58683425/how-to-use-pdf-file-as-data-source-in-ssis

    SSIS should be able to handle everything PowerQuery was doing with a little bit of initial setup on the SSIS side of things.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I don't think I can remove Excel, because for some reason, PowerQuery *in Excel* can read/parse tables in a PDF, but the PowerQuery source in SSIS can't. (???!!!!!) The weird thing is that mapping in that stepchild, Access, can read both Excel and SQL Server just fine. but Seems kinda over the top to use all of that to just import an Excel(ish) file.

    The big question is how do I deal with the multiple PDFs? And what do I do if one PDF fails? because PowerQuery will just lump all that together (with a filename). If nothing fails, I can jusMt import all the data at once, dump it into a temporary table in SQL Server, split the Purchases and the Discounts to different tables and then outer join them.

    Might be time to learn Py2PDF

    • This reply was modified 1 month, 3 weeks ago by pietlinden.
  • Mr. Brian Gale wrote:

    If I am not mistaken, PowerQuery is part of Excel, so it'll be hard to remove excel from your mix.

    PQ can be used in many other places, including Power BI desktop, SSAS, ADF, Power Apps ... But I think Excel is where it all began.


  • Phil Parkin wrote:

    Mr. Brian Gale wrote:

    If I am not mistaken, PowerQuery is part of Excel, so it'll be hard to remove excel from your mix.

    PQ can be used in many other places, including Power BI desktop, SSAS, ADF, Power Apps ... But I think Excel is where it all began.

    Shows what I know about PowerQuery LOL. Quick google though told me that you can use a C# step to pull data out of PDF in SSIS which is the approach I'd take. Source (PDF) to ETL (SSIS) to Destination (SQL Server). Reduce the number of tools in the middle (Excel, Access). Problems may come in if/when the PDF format changes and your C# stuff breaks, but that'll also break PowerQuery. So as long as you have someone with some C# skills, that approach should work.

    Now why PowerQuery can read the PDF in Excel but not in SSIS, that would be a red flag for me to not use PowerQuery for this. If the language is used in other places BUT behaves differently across platforms, it means an update to anything in my workflow (Excel in your case) COULD update PowerQuery and break the PDF import. C# is a lot less likely to change with an SSIS upgrade, so less risk. Still risk mind you because anything you update COULD break the parsing. A new version of the PDF editing tools you are using could render the PDF unreadable in C# or PowerQuery.

    Don't suppose there is any way you can request a better format for the data, eh?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I asked them at Costco if they could do something crazy and send me the file as a spreadsheet and not a PDF, but nope. I guess they're super proud of their 2005 technology. Then I wouldn't have to do any of these stupid things. =(

  • Probably due to its use in Data Engineering and Data Science, Python has a number of libraries to extract text from PDFs. eg PDFPlumber.

     

  • I've been looking at a bunch of them. There are several libraries that read/parse PDFs. Guess I'll dig into that.

  • FWIW, I got one way to work... I used Abscess and Excel and a linked table to SQL Server. That way I can just call .RefreshAll() in Excel so that each PDF gets read and parsed into Excel, and then I can create a linked table that points to that... then all there is left to do is do the append from the linked Excel file to SQL Server and finally just move the processed files.

    Totally high-tech, I know. but it works. And no type casting nightmares.  And since the guy I'm doing it for is definitely low-tech, that'll work just fine.

  • also assumes that that process runs in the local pc (which has the license for Office).

    So not enterprise/production ready really, but that may be what is required for your client.

  • Why would I need enterprise grade for a one man show? For something he'd run every few weeks?  I'm not processing 100K files an hour.

    If you can point out where my code is wrong, that would be great. Or explain how to fix the problem I'm having with SSIS not returning any columns from a PDF source. I mean, since you seem to like to lecture me, why not make it a useful lecture?

  • then why not supply an example of a file that you fail to process along with the SSIS packaged you built. This has been asked for in your original threads on this subject.

    cause in some of the links that were supplied as part of your prior threads, as well as some googling, the process to add the columns was explained - and you at a later stage had stated you had managed to process a pdf within SSIS - and now it seems that working around multiple table types has you back to "columns do not show".

  • pietlinden wrote:

    Why would I need enterprise grade for a one man show? For something he'd run every few weeks?  I'm not processing 100K files an hour.

    If you can point out where my code is wrong, that would be great. Or explain how to fix the problem I'm having with SSIS not returning any columns from a PDF source. I mean, since you seem to like to lecture me, why not make it a useful lecture?

    Just want to add that a one man show needing an Access license and an Office license and a SQL Server license (for SSIS which is not included in express) seems like a bit of overkill to get a PDF into a database, no? That feels like a very expensive solution to the problem. If you can get the PDF into Excel, and you are using manual processes anyways, you could have a column in Excel that generates the SQL Insert commands and does some sanity checks on the data to make sure it read properly into Excel, then copy-paste those insert statements into SSMS. Removes Access and SSIS from the equation which saves a lot of cost.

    As for using a PDF as a data source in SSIS, did you check out my link from earlier - https://stackoverflow.com/questions/58683425/how-to-use-pdf-file-as-data-source-in-ssis

    Now, the above being said, if your solution works and you/they are OK eating the costs and handling the manual processes, I say go for it.

    For me, I dislike manual processes as there is always a chance for error. Excel's Refresh All SOMETIMES misbehaves on me, especially if I am using calculations on the data after it pulls it from an external source (such as power query) and I need to do 2 refreshes. Plus you have no control on the refresh order if you have multiple data sources. I know there are workarounds to both of these things, but anything manual runs the risk of being done wrong. The more I can automate, the less likely I am to make mistakes. Even with 1 man shows (I am not, but I have some scripts that I have written that are "just for me"), I tend to write up scripts to make some tasks easier for me. For example, I have a bat file that runs the winget commands for updating all software managed by winget. Yes I could do it without the bat file, but I don't do it often enough to remember the commands and I don't want to waste time googling it each time. Trivial example, but it shows how lazy I am. "sudo update.bat" vs "sudo winget upgrade -h --all --include-unknown" - Which would you prefer to run?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Oh, sorry. Silly me! So I just modify the step where it writes the one row ...

     ' Display last table's content.
    Dim table = tables.FirstOrDefault()
    If paragraph IsNot Nothing Then
    Console.WriteLine("Table content:")
    Console.WriteLine("Modify below to insert the parsed values into the database")
    For Each row In table.Rows
    For Each cell In row.Cells
    Console.Write($"{cell.Content.ToString().TrimEnd().PadRight(15)}|")
    Next
    Console.WriteLine()
    Next
    End If

    • This reply was modified 3 weeks, 3 days ago by pietlinden.

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply