Is there a matured solution to ETL excel data?

  • Merry Christmas and happy new year!

    From time to time my work needs to deal with lots of excel sheets from clients, there is no united format to those excel files and sheets, each client and each project will be different to each other.

    That means so far we have to manually import the data, here is what have been doing long before I joined the team:

    1. Create a database for each project

    2. Create a table for each sheet in each excel file

    All columns were initially created as varchar, and 255 characters unless max is needed. No integer, no datetime, sometimes we manually update the types but most of the times that's all.

    The import incurs lots of human cost. I believe this is indeed quite common and there should be some matured solution already?

    I am thinking:

    1. can document-based MongoDB handle this?

    2. is there ETL tool capable to handle this?

    3. can SSIS handle this in an acceptable manner? I mean, it sounds unrealistic if we are to create SSIS for each sheet.

    Thank you for your input.

  • One option might be to use the bastard child of SQL Server - Access... (bear with me for a minute)....

    Create a table in Access of (source column name, destination column name, destination data type), and then use it to match source columns to destination columns. The good thing is that you would only have to do it once for each variation of a name. Then you could write a query (probably using VBA to make it dynamic) and then append the changed data (proper types and sizes) to a SQL Server table.

    Other than that, I don't know...

  • Thanks, still very tedious, actually adding one step here because data is eventually to be imported into SQL (or NoSQL), right?

  • halifaxdal (12/26/2016)


    Thanks, still very tedious, actually adding one step here because data is eventually to be imported into SQL (or NoSQL), right?

    It'll be much less tedious to brow beat the clients into submission and have them follow a standard. More fun, as well. πŸ˜›

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you Jeff,

    I completely agree with you, however you know often the time you will have to follow your client, not asking client to follow you.

    It might be possible to enforce some requirement to your client, I bet your boss would ask you not to do so.

    When you are talking to clients, situations could end up like a excel file with dozens of sheets and each sheet has arbitrary names with special characters and some sheet might have hundreds of columns with possible special characters in some of them.

    So, no, this is not a job for DBAs or for internal developer team to deal with, I am rather looking for some kind of commercial solution as I believe there should be a mature solution for this because this is a common scenario as far as I can see.

  • halifaxdal (12/26/2016)


    It might be possible to enforce some requirement to your client, I bet your boss would ask you not to do so.

    Fortunately, I have some good input with my bosses and both of them understand the value of consistent data.

    When you are talking to clients, situations could end up like a excel file with dozens of sheets and each sheet has arbitrary names with special characters and some sheet might have hundreds of columns with possible special characters in some of them.

    You've just mentioned the most valuable aspect of it all and that's talking with the clients. If you let them know how much more accurately and quickly you can do things for them if the data/spreadsheets are thoughtfully structured, they're usually eager to comply. To spring an old parable on this, the answer is always "No" unless you ask. Make it a point to help the customer see the value in it because it will be much better for both the customer and your company.

    Just remember... If you don't have a plan, don't expect your clients to have one. πŸ˜‰

    So, no, this is not a job for DBAs or for internal developer team to deal with, I am rather looking for some kind of commercial solution as I believe there should be a mature solution for this because this is a common scenario as far as I can see.

    Unfortunately, no one has come up with the AI solution necessary to handle every possible bit of dribble someone may drool onto a spreadsheet. You say that it's "not a job for DBAs or internal developer team" to deal with but who would know the data better than they do? They probably know it even better than the originators of the spreadsheets.

    Shifting gears, do what I did. Think about how to import the first several rows of data (using the "ACE" drivers, which is a bit of that "mature" solution you're looking for) from the spreadsheet with headers set to "NO" and IMEX set to "ON". Import those rows using the standard {fx} column names and read the column names so that you can use them to create the necessary target table. If you have (for example) monthly repeated column names, use that information to create an "unpivot" of the data once you've loaded the data into a staging table. Of course, it will require dynamic SQL but, if you're clever, you can make a couple of variables to identify when you have multi-row column headers and deal with them appropriately. What I end up with is a nice little EAV that's easy to query and index. Of course, horizontal tables as part of a report, you also need to learn how to do dynamic CROSS TABs.

    The reason why I don't post my solution is because it's a little too complex to post on a forum thread without an in-depth explanation and the explanation of the techniques used makes for a 30 or so page article. It also requires the use of xp_CmdShell and that alone requires a 15 page article to convince people that it can be used safely and another 10 pages to explain how to set it up to be used safely.

    My point though is, if this lowly, non-certified, non-degreed, mostly self-trained, doesn't-believe-in-portable code, SQL hack can do it, so can someone else. It just takes a bit of study, testing, and dedication. To paraphrase a famous quote:

    "The difficult we do immediately... the impossible takes slightly longer" and that's backed up with the most important principle for DBAs, Developers, and any walk of life, "Essayons!" πŸ™‚

    Step one would be for you to Google how to download and install the "ACE" drivers. I'll save you the time for this one though. The link you need is https://www.microsoft.com/en-us/download/details.aspx?id=39358 and it needs to be installed on each server that you wish to do such things with.

    Now... once you've downloaded the 64 bit version, be advised that the Website you downloaded from has a set of instructions that will work but ONLY if there are no 32 bit applications on your machine. DO NOT UNLOAD YOUR APPLICATIONS! IT’S NOT NECESSARY!!!

    Instead, find where you the downloaded the AccessDatabaseEngine_x64.exe file to, open a DOS window, change to that directory, type the following command, and press enter.AccessDatabaseEngine_x64.exe /quiet

    The command above will install the driver/provider whether or not any 32 bit programs exist on the server.

    Be sure to allow any changes when prompted. You should have administrator privs on the box to do this installation.

    You’ll need to restart SSMS (not the server) for the changes to take effect.

    "Every great journey begins with the first step." πŸ˜‰

    Once you've properly downloaded and installed the "ACE" drivers, then you need to configure them. You can do so with the following code once you're restarted SSMS after the installation.

    USE [master]

    GO

    EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'AllowInProcess', 1

    GO

    EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'DynamicParameters', 1

    GO

    EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'NestedQueries', 1

    GO

    EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'NonTransactedUpdates', 1

    GO

    EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'SqlServerLIKE', 1

    GO

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • With the understanding that some of the human dialog is missing and once you've taken the first step above and you don't mind it not being quite as detailed as an article will be, you can go to the following URL, search for "EXCEL HELL" on the page, and download my complete presentation on the subject, which includes the PPTX presentation, example spreadsheets to play with, and all the necessary code. Do keep in mind that its POP code (Proof-of-Principle) that may need a tweak or two.

    Here's the link...

    http://detroit.sqlpass.org/MeetingArchive/2015.aspx

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    I am shocked when I woke up and saw you have such a long reply to my question, I think that's why I love this forum, thank you very much from bottom of my heart

  • halifaxdal (12/27/2016)


    Jeff,

    I am shocked when I woke up and saw you have such a long reply to my question, I think that's why I love this forum, thank you very much from bottom of my heart

    Thanks for the feedback. I aim to please. I sometimes miss but I'm always aiming. πŸ˜€

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/26/2016)


    halifaxdal (12/26/2016)


    Thanks, still very tedious, actually adding one step here because data is eventually to be imported into SQL (or NoSQL), right?

    It'll be much less tedious to brow beat the clients into submission and have them follow a standard. More fun, as well. πŸ˜›

    ROTFL! I would buy tickets to see that!

  • kevaburg (12/31/2016)


    Jeff Moden (12/26/2016)


    halifaxdal (12/26/2016)


    Thanks, still very tedious, actually adding one step here because data is eventually to be imported into SQL (or NoSQL), right?

    It'll be much less tedious to brow beat the clients into submission and have them follow a standard. More fun, as well. πŸ˜›

    ROTFL! I would buy tickets to see that!

    Me too. I've experienced lots of failure in that area due to "management directives" and the like. At least Jeff's Excel Hell solution works. If you do a lot of importing, spend the time to learn the techniques - they worked great for me.

  • Thanks for the endorsement, Ed. And Happy New Year to everyone!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Excel is a terrible format for automated data exchange. Aside from all technical issues involved in installing and configuring and Excel driver there is the following:

    When the user hides rows or columns, they are ignored by Excel provider as if they were deleted.

    When the user strikes-out rows, they think that means the rows are deleted, but that's just formatting.

    Instead, insist that import files be submitted in TAB delimited format using a documented column layout.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Happy New Years too! πŸ™‚

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply