ETL in 2021 ?

  • Ahoi,

    after reading Steves  Importing a CSV into SQL Server Shouldn't Be This Hard and the comments on the state of SSIS i remembered something i wanted to check in the past. Using google i could not really find a satisfying anwer that would help me to give me a concrete direction what to look into.

    At my work i am using a framework made by a consultans (before my time) i was thrown into. The framework is using metadata for running the different packages and logging the executions. When  a new new table (SAP/different SQL Server) is required, i have to create a new task, place itwithin the framework and add the according metadata to make sure its run/executed within the job. So im not importing Data from files, only from tables (sap/sql server).

    I want to broaden my horizon a bit, so what are other common ways or tools that are used to create ETL processes to automate this type of scheduable ETL used in 2021?

    Thanks alot

  • My company uses SSIS entirely for our ETL.

    Our process is the support request comes in for a report to be made. Next, a data person will determine where that data currently resides and if the report needs live data or point in time data.  If it needs point in time data or is using only historical data, then the data MAY be moved to a data mart.  If the data is being moved to a data mart and doesn't already exist in the data mart, we will build an SSIS package to first clone the source tables from a specific point in time and then build up the reporting table.  We do the clone first so we can reduce blocking on the source tables as we run our data refreshes during company downtime.  Then the report is built off of that reporting table.

    Most of our data is stored in SQL Server with a small portion of it existing in Excel files.  The Excel files are pulled into the data mart.

    When doing the T part of the ETL (technically our process is more of an ELETL as we extract, load, extract, transform, load) we try to pick the best tool for the job.  Sometimes SSIS is the right way to go, other times a stored procedure can handle it better.  We try to avoid using scripts in SSIS, but that is mostly because nobody on our team has investigated using scripts in SSIS and s0 far our ETL process is performing "good enough" that we don't want to take it apart and try to improve it.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • We use SAP Data Services. We have multiple platforms, SQL, IMBi and a couple of Oracle links. It's similar in some ways to SSIS.

    One tiny feature I recently found that I like is that it's flexible when importing flat files that have a slightly different layout. I can tell the tool that I normally expect 25 fields of data, but if I only have 24, pass along a null instead. SAP DS seems to handle flat file imports more gracefully than SSIS, which I last used a few years ago.

  • I agree with you there about SSIS.  It can run and work happily without much help, but boy does it hate change or inconsistencies.  Metadata changes and the package fails.  Source data had 25 columns yesterday when it ran successfully and 24 columns today, it will fail.

    You can tell it to resume on error, but chances are you don't want it to do that.  Or you can have a few errors pop up and still be a success, but some errors are fatal errors (like metatdata changes).

    It is not very flexible in terms of error handling, but it is nice that it is included with a SQL Server license.

     

    SAP Data Services sounds like a nicer tool to work with.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • If a file had 24 fields yesterday - and now has 25 fields - I want the process to fail.  I don't want to assume the new field added was actually added to the end and just ignore it.

    If a file had 25 fields yesterday - and now has 24 fields, I want it to fail.  I have no way of knowing that the missing field was the last field in the file - it could just as easily be any field in the file that was dropped.

    If the metadata has changed - such that a field in the file grew from 30 to 35 characters, again - I want that to fail so I can determine why the data has changed.  Was the change a temporary change - or a mistake from the sender - or do we need to now account for the large size column in our downstream tables and systems?

    This idea that the problem with importing files in SQL Server is somehow an SSIS issue is just wrong.  The problem is the files - and the nature of how those files are generated that is the problem.  If you have a well defined file and a contract with the sender to deliver that file in the specified format (CSV, TSV, fixed-width, ragged-right, etc.) then SSIS (or any other tool) - once set up will easily handle the loading of that file with no issues.

    I think the bigger problem is the expectation that we can take Excel files and import them without any rules on how the files are built.  Too many times I have had issues with Excel files (or CSV files generated from Excel) where the users entered invalid data - text in a numeric field, invalid dates in a date field, extra data in money columns or even special characters in text field (CR/LF/TAB).

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • I see your point Jeff, and that makes sense.  You can't just assume that the missing column is the last one or that new columns will always end up at the end of the list.  I will drop my gripes with those scenarios.

    A scenarios that still exists though that I get frustrated with - when I am using a table as a source and a table as a destination and then someone changes some metadata on the source AND the destination so they match.  Something like changing a datatype from VARCHAR(25) to VARCHAR(30) on both the source and destination, so both sides have the correct datatypes and there should be no issues.  But SSIS complains that the metadata changed with what it had internally and now it doesn't know what to do and the job fails.

    Now, inside SSIS, what I expect for this to be a problem is that it has optimized itself for expecting that column to be VARCHAR(25) for when it does memory allocation and any data transforms that may occur.  So suddenly getting 5 more characters, the optimizations may not be applicable anymore, or worse - they may not be requesting enough resources to handle the operation; so it needs to re-optimize things.  I just wish I could do this from within SSMS.  Something like a "refresh metadata" option rather than needing to load it up in visual studio, double click on each entry with the yellow triangle, change literally nothing but click on OK, then save it and deploy it out to the SSIS server again.  A lot of overhead just to change some metadata inside the SSIS package...

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • My previous experience with SSIS was to determine a method to make table-to-table transfers like this work much more quickly.  The way I did that was I stopped using SSIS and wrote stored procedures  to do it. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • We use SSIS but it is picky on metadata (so not really suited for excel files) and I found the error handling uneasy.

    No user friendly error messages ( metavalidation failed 🙁 ) and can't export the packageflow for documentation

    With BIML it allows to autogenerate ssis-packages for boiler plate packages

    Lately the transform part is done by T-SQL (views)

    Exploring powershell (importexcel + dbatools) for getting excel's in sql tables

  • Hmm.... SSIS is decent for a tool that's essentially included free with SQL Server(compared to say Oracle's nothing), but it's obviously not something that has a super high priority put on it.  And it's good for something if someone wants to put together a quick job and deploy it quickly but it doesn't really provide a good robust ongoing development/deployment environment.  We use Informatica and Talend as our main ETL tools for heavy duty jobs with a decent amount of SSIS for various random things that aren't in theory super complicated.

  • Mr. Brian Gale wrote:

    I see your point Jeff, and that makes sense.  You can't just assume that the missing column is the last one or that new columns will always end up at the end of the list.  I will drop my gripes with those scenarios.

    A scenarios that still exists though that I get frustrated with - when I am using a table as a source and a table as a destination and then someone changes some metadata on the source AND the destination so they match.  Something like changing a datatype from VARCHAR(25) to VARCHAR(30) on both the source and destination, so both sides have the correct datatypes and there should be no issues.  But SSIS complains that the metadata changed with what it had internally and now it doesn't know what to do and the job fails.

    Now, inside SSIS, what I expect for this to be a problem is that it has optimized itself for expecting that column to be VARCHAR(25) for when it does memory allocation and any data transforms that may occur.  So suddenly getting 5 more characters, the optimizations may not be applicable anymore, or worse - they may not be requesting enough resources to handle the operation; so it needs to re-optimize things.  I just wish I could do this from within SSMS.  Something like a "refresh metadata" option rather than needing to load it up in visual studio, double click on each entry with the yellow triangle, change literally nothing but click on OK, then save it and deploy it out to the SSIS server again.  A lot of overhead just to change some metadata inside the SSIS package...

    I agree with this and it would be nice to have the ability to refresh the package metadata without having to edit the package.  With that said - binding issues like this occur with many applications and any changes to meta-data requires updating the code in that application.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • This was removed by the editor as SPAM

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply