ETL or ELT

  • Comments posted to this topic are about the item ETL or ELT

  • I am very much on the ETL side of things, and a mainly structured data warehouse for use by the business, analysts and data scientists.
    There might be some data that would require a ELT, such as very fast data, or massive quantities, but even then I only use those for specific cases, and then as a source for an ETL into a structured DW.

  • We use Apache Spark which reduces the need for ELT.  We wrestle with the "How do you unit test and integration tool?" question.  An awful lot of what we do involves transforming a set of data so historically ELT has suited our use case.
    In a more heterogenous world we are looking to Apache Presto/AWS Athena as a unified SQL interface to many different types of data sources.

  • This is just me I suppose, but I do use data stores to store data first before transforming it. This is not any different than downloading it to disk and then transforming it. In meaning, just because I loaded it to a data lake does not flip the TL to LT. I still need to transform the data at some point just as I do if there was no data store and I just downloaded it to a hard drive that will later be absorbed by some SSIS package.

    Thus, it's all ETL to me. I download data, I push it to a data lake, it's transformed in USQL, then loaded to the warehouse, and maybe transformed more in TSQL.

    ELT is more of downloading it, loading it, then letting the front-end transform it to whatever it likes on read.

  • I have done one biggish ETL project and that was one way rather than dynamic continual.

    I have used MS Access as a unified SQL interface for transforming Oracle to Oracle system . I can connect to MySQL / SQL Server and PostGres with MS Access (haven't tried MariaDB)

    It worked well for me but I have a decade of experience with VBA so I initially took the mapping documents from the users and generated simple update queries and then individually curated those. I had a column for source table and field and column for target table and field and split into sections by both source and receiving table then wrote a function to generate the Update queries. These update queries were obviously quite wide of the mark but were a good start.

    I then had further functions that

    Scripted complete SQL runs indicating time start and time end
    Pivoted the numbers being transformed into different tables over a historical time period to give me a quick indicator of whether things were going awry.
    Wrote the defined SQL into a separate searchable table to allow searching of fields to ensure fields were not being omitted.
    I also had a nice function that I used to compare a nasty jumbled address field with a 100% accurate and properly normalized address table. This wrote thousands of update queries to a table that I could then run consecutively against the dirty data to attach a primary UPRN index field to the new data. These queries were run using a VBA a function which called the queries in the table consecutively. This got round the object limit in access which was 32k - I ended up with 66k of queries to run but it would of scaled up to 2GB (or limitless if I had linked to an outside access source for the table holding the SQL queries).

    Still took around 9 months to write the SQL but once written ran in under an hour. (excluding those address matching queries which I only ran once) I moved my desk next to the building control manager and asked him annoying question for 3 months ( like can a building warrant have more than one design certificate? ) In the end I had 500 core queries (excluding the address matching queries) and a colleague who did a separate department had in the region of 300 queries.

    For me it was also useful to have the form designer in Access as I recreated some of the more complicated forms in the new system as I didn't have the structure of the data right for some data by trying to design the form it cleared up the schema for me. Of course this wasn't helped by having no schema for the source or target systems.

    This was a planning application system and building control system for a local authority. about 2 million records - not large volume but the structure was complicated. The nice thing was we were converting everything into pretty much the exact format for import into the new system (vendor would not let us run the SQL to import into new system) so effectively it was like dumping the data into a data warehouse although we just transformed it into an almost perfect state on dump. The nice thing about that was the users were working on the data for those 9 months and we just hit the button on the last day of the 9months and latest data could go straight into the new oracle backend.

    I do think it is horses for courses - if you are good with a tool you are probably best to go with that. The time to start learning is not probably at the start of an important ETL project that would have added a lot of stress for me.
    I have never written so much SQL and I would do a system dump of data everyday in the morning to check the latest scripts ran without error. This along with the former pivot of number counts going into the different tables was nice as I could check each day if the boys in the section had been inputting new cases they should be reflected in the dump per table.

  • In my experience not all ETL tools are created equal.  Ab Initio still blows my mind.  Some of the open-source offerings, well, Ab Initio blows my mind.
    I've seen ETL used heavily for what is more job coordination.  On success do this, on failure do that etc.

    When you have a component, such as the slow changing dimension component in SSIS, and the advice from experienced ETL practitioners is "please don't use this" it does tend to drive towards the ELT pattern.  There are a few articles on best practices and patterns.  With some ETL tools it is easy to follow those patterns, in others the tools fight against those patterns every step of the way.

  • I have to agree with Steve on this: “I'll admit that I often prefer to transform data inside of a SQL Server target, OLTP or DW database. The reason is that mass changes are often easier with T-SQL, working sets of data rather than row by row.”

    Whenever possible I also get data into the database in whatever structure is needed for the data, then perform my transformations using T-SQL. It’s much faster (and easier IMHO) to work large data volumes in sets than it is in RBAR fashion. Therefore, I will often use SSIS as a means to extract data from files then transform and load with T-SQL. Then again, your mileage may vary.

  • Our version of ELT is dumping the data to a staging table as VARCHARs with SSIS, creating a destination table with appropriate data types for each field and then writing a stored procedure to cast and insert the data from the staging table into the destination table. Never thought of calling it ELT.

  • T/SQL is pretty useless for any kind of complex string manipulation, which is the lion's share of the T in my experience. So I like a hybrid approach, transforming disparate customer data into a "common" format using a .net front end, then throwing that into a staging table (that at least has the right data types) then using the staging table for the final load and validation.

    Horses for courses, and all that.

  • Aaron N. Cutshall - Tuesday, April 3, 2018 5:38 AM

    I have to agree with Steve on this: “I'll admit that I often prefer to transform data inside of a SQL Server target, OLTP or DW database. The reason is that mass changes are often easier with T-SQL, working sets of data rather than row by row.â€Whenever possible I also get data into the database in whatever structure is needed for the data, then perform my transformations using T-SQL. It’s much faster (and easier IMHO) to work large data volumes in sets than it is in RBAR fashion. Therefore, I will often use SSIS as a means to extract data from files then transform and load with T-SQL. Then again, your mileage may vary.

    The issue I have with this is when you work with large sets of data or large transformations, this practice can essentially be equal to you putting all your eggs in one basket. You're SQL Server target becomes the greatest point of failure in the pipeline. For example, if your data, transformation and even concurrency grows, the SQL Server target becomes the greatest bottleneck and requires constant upgrades to stay the relevant tool for the job.

    This is why I feel the best practice moving forward will always be to distribute the transformation and loading across your infrastructure in order to ensure not one single point is your failure and you can ensure that SQL Server target is not doing all the heavy lifting that may require you to constantly scale up as opposed to scaling out. For example, like Kimball suggests, doing transformation on disk before it even lands in the SQL target, then maybe bringing it home with additional transformations in SQL before finalizing the ETL process.

    For this reason, this is why I really like data stores. Similar to what David mention with Spark, using such tools (Data Lake Analytics/USQL in my case), helps distribute that work across other systems. The data store helps with the transformation on the way to the SQL target, where SQL can also help with other stages of the transformation process to create the final product.

  • This is a really interesting discussion.  I do all my work via ELT - load the data from files into SQL Server staging tables, then transform as it is loaded into DW tables.  However, I'm working with a relatively small amount of data.  It's interesting to hear others' perspectives when working with larger volumes of data and more complex transformations.  I've never been in a situation where the SQL Server target would become a bottleneck.


    [font="Tahoma"]Personal blog relating fishing to database administration:[/font]

    [font="Comic Sans MS"]https://davegugg.wordpress.com[/url]/[/font]

  • ELTL, Is the way I would describe how I prefer to handle this data.  Trying to figure out what record in a delimited file has the bad value(using ETL) when your 'T' blows up when transforming before you load to a SQL table is a hug pain in the ass. IMHO  I would much rather load that data to a SQL table of say VARCHAR(500) columns then transform the data on the way to the final destination.  To me this is simply easier to find out a field is NULL when it shouldn't be or has invalid data.  Looking at the SQL to transform the data is a lot easier than digging through 10 to 20 different 'look ups' in an SSIS data flow.  Again IMHO.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • This is a really interesting discussion. I do all my work via ELT - load the data from files into SQL Server staging tables, then transform as it is loaded into DW tables. However, I'm working with a relatively small amount of data. It's interesting to hear others' perspectives when working with larger volumes of data and more complex transformations. I've never been in a situation where the SQL Server target would become a bottleneck.

    This is how I do it as well, and I'm unconvinced by the article that was attached to the editorial that it should be different.  That said, there is often some transformation that has to take place once the data is in the target tables.  You could think of it as ETLt.  I'm unsure why some seem to think the choice is essentially between set-based and RBAR.  In my view I'm working only with the extract data in the staging database.  The entire staging table is subject to the transforming queries in a set-based manner.  If I were to wait until the data was in the target tables, the updates would still be set based, but it would only be a small portion of the table.  This would take longer. 

    As for the delay, I usually create the target (AKA the data warehouse first) so that the users have something to look at.  I develop a way to populate it manually while I work on the automated process.  This does take time, but at least with the ETL I've had experience, there is a lot of transformation that needs to take place.

  • Interesting article. Without realizing I was doing ELT as recently as last weekend. I had to pull a lot of data into a database, but was struggling with finding a way to translate some of it on the way in. (It is Geography data, which I've never worked with before.) I eventually gave up the approach I had been following, in favor of just pulling it all in as is, which was everything was NVARCHAR. Then I created another table in the same database and wrote a SQL script to INSERT INTO ... SELECT. I didn't even think I was doing an ELT operation, but I was. Considering how many records there are (>500K), it did run relatively quick, too.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • xsevensinzx - Tuesday, April 3, 2018 7:14 AM

    The issue I have with this is when you work with large sets of data or large transformations, this practice can essentially be equal to you putting all your eggs in one basket. You're SQL Server target becomes the greatest point of failure in the pipeline. For example, if your data, transformation and even concurrency grows, the SQL Server target becomes the greatest bottleneck and requires constant upgrades to stay the relevant tool for the job.

    Good point, which is why I've also seen interim SQL Server instances perform this type of transformation then place onto the target server when complete. It's certainly OK to spread the work, as you've pointed out. I didn't necessarily mean that the target server had to do all the work, I just like using SQL Server for what it's good at. I have even seen SQL Server used as a transformation tool before the data was placed onto a totally different target server such as Oracle.

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply