How to get ETL Horribly Wrong

ETL ( Extract, transform, load) doesn't have to be like a spell on hell. To make a success of ETL systems, you need the freedom and ability to make graceful U-turns when you detect a mistake in architecture or configuration: to fix the root problem rather than to merely tackle the symptoms. Feodor lists the eight most common root causes of failure in ETL systems, and how to fix them.

There are many ways to get the Extract, transform, load (ETL) processing wrong. We can fail on different levels, and with a whole range of subtlety. Some of the most spectacular failures have only the most tenuous trail of cause and effect to the original root-cause error that doomed the project..

This article is based on Microsoft SQL Server technology, which means that the ETL process will typically be working with relatively small to medium amounts of data, compared to some other technologies such as Oracle.

A few words about SQL Server licensing costs:

The licenses for SQL Server are expensive enough to justify smart solutions to ETL if, by doing so, you can save on license fees. The price per core and the edition limitations certainly should motivate the ETL developers to deliver smart solutions. Let’s take a quick look at the licensing limitations, when it comes to hardware:

Feature Name

Enterprise

Business Intelligence

Standard

Maximum Compute Capacity Used by a Single Instance (SQL Server Database Engine)1

Operating System maximum

Limited to lesser of 4 Sockets or 16 cores

Limited to lesser of 4 Sockets or 16 cores

Maximum Compute Capacity Used by a Single Instance (Analysis Services, Reporting Services) 1

Operating system maximum

Operating system maximum

Limited to lesser of 4 Sockets or 16 cores

Maximum memory utilized (per instance of SQL Server Database Engine)

Operating system maximum

128 GB

128 GB

Maximum memory utilized (per instance of Analysis Services)

Operating system maximum

Operating system maximum

64 GB

Maximum memory utilized (per instance of Reporting Services)

Operating system maximum

Operating system maximum

64 GB

Maximum relational Database size

524 PB

524 PB

524 PB

As we can see, Enterprise edition is the only one which does not have significant CPU and memory limitations. Furthermore, if we look closely in the feature set, we will see that only Enterprise edition supports partitioning and compression – and these features are extremely handy for ETL processing. Despite this, we’ll, later on, look into whether we can help cut licensing and hardware costs by making smart architecture decisions.

Enough said about the industry and the licensing. Let’s look at the most common ways that one can engineer failure in our ETL processing.

Failure #0: Not planning for volume

If we don’t have a good estimate of the volumes of data that our existing ETL system will be loading a year hence, our ETL process is already in trouble. Sometimes the incoming data amounts do not change for years, in which case planning is easy. In other database applications it is gradually growing; and easy to plan for. However, a few applications are prone to sudden increases, usually due to new regulations. For example, a few years back there was a change in the EU regulations about how the electricity consumption data is collected, and according to these regulations, the daily measurements had to be converted to hourly reporting. In this case, the ETL volume increased twenty-four fold. In reality it can still be planned for, because data volumes almost never increase drastically ‘out of the blue’; there is almost always a ‘grace period’ (in case of government regulations) or at least some signs indicating the upcoming data volume increase. If it comes as a surprise to you, it is because warnings weren’t passed on, and you weren’t keeping up with the trade press.

Way out of failure #0:

The only way out of this type of ETL Hell is by planning. If you can’t accurately predict volumes a year ahead, your ETL strategies can fail, no matter how smartly built they are.

Failure #1: Choosing the wrong hardware.

I am occasionally awed by the ingenious ways in which organizations can get their hardware choice entirely wrong. These errors come in a rich variety, varying from paying for an Enterprise license and having only 32Gb of memory, to committing serious architectural misunderstandings such as committing the ‘one-server-does-it-all’ heresy (OLTP, ETL and OLAP on the same machine). The disk system choice is a rich opportunity for spectacular mistakes. A great way to fail in your ETL is to use a shared SAN system, configured in a non-optimal RAID.

Way out of Failure #1:

The role of a solution architect is crucial in this stage, because a sound understanding and follow-up on hardware is quite important. A solution architect will easily spot the fact that having 32Gb of RAM on an Enterprise edition is an absurd and easily-solved problem. RAM is very cheap nowadays, and easy to check that it is available to SQL Server. Keep as much data in memory as possible while processing; it is a great way to cut costs and save time.

A skilled solution architect will also stay away from the ‘one-server-does-it-all’ architecture. After all, in a distributed architecture, an ETL server is best thought of as an appliance, like a toaster, no more or less: it does just one thing and does it well. The ETL server does not have to be clustered, it does not have to be powerful, it just has to do its Extract, Transform and Load job properly and on time. When it comes to disaster recovery – as long as there are daily backups and good accounting of the incoming data, even in the case of failure, any other appliance server can pick up the ETL processing temporarily while the server is restored. In reality, the ETL server is just a transitional space, and it has no unique data which cannot be easily recovered from the original data sources.

Here is a very broad example of a distributed ETL architecture: we can have a fairly cheap server with 4 to 16 cores with Standard edition and 128Gb of RAM and SSD disks for storage. The role of this server will be to extract the data / receive the flat files with the data to be processed; then the data is imported, transformed, cleaned, prepared for dimension and fact loads and then loaded to staging tables on the actual production DataWarehouse server, where the data will be merged / partition-switched into the data warehouse.

It really depends on whether we can afford to be smart enough in our data loads and use just a small Standard edition SQL Server for data loads and cleansing (the SQL Server out of the box tools for Master data and Data cleansing are available only in the Enterprise edition).

Finally, the storage of the ETL server is a critical component, and it can easily become a bottleneck. SSDs nowadays are cheaper than SANs and are reliable enough. To some extent SSDs make up some of the time we waste in Standard edition by not having partitioning; when it comes to compression, SSDs are cheap enough and support large enough volumes in order to serve well in an ETL server. Since the ETL server is not used as a primary storage of any data, but merely duplicates data from other sources while it is being transformed and loaded to the destination, our logical choice would be to invest in speed and not so much in the server being over-redundant.

Failure #2: Not understanding the software

It is very easy to get trapped into this one, and it is very hard to get out of it, once the implementation is in place.

SQL Server works best with a small range of ‘bulk copy’ tabular data formats that it can ingest rapidly. It is, however, obliging enough to be able to use a wide variety of formats, but far more slowly. Even though there probably is a way to import any piece of data into SQL Server, some ways work better than others.

The problem begins by using a poor transfer convention between the data source and ETL server. The software can probably do it all, but if our goal is great performance, then we need to focus on finding a convention between the data source and the data import which will deliver that.

Let’s suppose that our data comes from a third party system that supports a variety of text file exports, with or without header and footer rows (used to identify the completeness of the file), as a CSV files, as XML files, JSON, Excel and so on.

You might agree on using a CSV file which has a header and a footer row and you might make the mistake of handling it like this (just in meta code):

Import the entire flat file into a temptable – in a single column, row for row

Write a query which selects the data from the temptable and has a WHERE clause looking for a specific string in the data by using the NOT LIKE ‘%….%’ in order to exclude the first and last rows

Use the above query to shred the lines into columns and insert in another temp table

Even though the above example might work fairly well for 1000 rows, it certainly won’t scale well, and it will be painfully slow with 100,000 rows.

Way out of Failure #2:

Don’t be afraid to talk to your counterparts in charge of the third-party system. Choose the best way to export the files (if there is really no way for our ETL server to connect directly to the source system and read directly from the source). And finally, select the best way to import the files.

In the above fictitious example, a much better approach would be to agree with the vendor on a specific convention for the first and last rows to contain a static string (which is the same for each file) and this way we can use a Script component in our SSIS package to ignore the first and last rows if they match exactly the static string. Thus, we would be able to process the data much faster than reading every row to determine if it is the first and last row.

Failure #3: Avoiding parallelism in data import

Another great way to waste time and resources in the ETL processes would be to misunderstand how SSIS imports work, and use only one of your server’s many CPUs for the job.

By design, DTExec.exe is a single-threaded mechanism, which means that if we have only one large flat file to import and we use a single SSIS package processing it, we will be wasting time and a lot of our resources will be underutilized.

Way out of Failure #3:

A good way to use parallelism during data imports is to agree with the data vendor to break the data file into X equal parts (X would be the value of the cores in our system), then create a piece of code which monitors the folder where the flat files are delivered and run in parallel as many DTExec.exe processes as files / cores there are. This way we get to instantiate the same SSIS package and run it in parallel with different flat files as we utilize all cores of our ETL server. Ideally, we will get X times faster data load. If you’re avoiding parallelism because of the complex interdependencies in the data, there are plenty of solutions to this that allow parallel threads.

Failure #4: Use a single hop for data flow

A classic example of a failure is to use a single hop for the data flow; this means that we would import the data in our system in a table (with string datatypes for all records), and then we would create a view on this data to convert the data to the correct datatypes and use this view in the dataflow pipeline later on to do joins.

Views are great, but certainly not for ETL processing of large amounts of data.

Way out of Failure #4:

Storage nowadays is cheap, even SSDs, and we can get much better performance from persisted data than from views for ETL.

In other words, after loading the data for the first time in the staging table, we could do another hop and persist the data as we convert the proper datatypes, do our data cleansing checks, and apply the proper indexes to the persisted table. This way, we can take advantage of data structures – heaps and b-trees – and use their performance advantages to the maximum extent.

In a later article we will explore the specific advantages of heaps and clustered indexes in the context of performance tuning and we will see specific examples when one is better than the other.

Failure #5: Misusing temporary tables for ETL loads

Yet another way to slow down our ETL processing would be to use temporary tables to store intermediary results of the hops.

Temporary tables have their fair share of usefulness, but when it comes to ETL loads, they tend to have disadvantages. After all, they are temporary. They get destroyed sooner or later. They are hard to index, and it takes time and effort to re-index them properly every time they are created. They can be compressed, they can have statistics, but the idea that there is any benefit from working with temporary objects in a solid ETL process seems somehow strange.

Way out of Failure #5:

Use persisted objects such as tables or indexed views. It is much easier to control the compression, index creation, statistics maintenance and execution plans than in temporary objects.

Failure #6: misunderstanding the significance of parallelism

Very often I have seen environments (both OLTP and OLAP) where the parallelism is disabled on instance level. This is most often caused by the panic mode induced by high CPU utilization stats by the people who are monitoring the servers. This panic mode dictates a query tuning session, which usually leads to the conclusion that the CPU does not spike as much if the SQL Server instance uses setting of MAXDOP 1.

In reality, parallelism is one of the best features in SQL Server, and this is one of the reasons why users pay for SQL Server licenses. Otherwise, there are other free database engines which have not figured out how to do parallel query plans.

I am not sure when and how this wave of ‘turn off parallelism’ started, but it seems important to note that even if the entire system seems to have less CPU pressure when the parallelism is disable, this does not mean that things are moving faster, especially in a Data Warehousing environment.

Way out of Failure #6:

Tune the system in such a way that as many CPUs are utilized as possible for the fastest data processing. Parallelism is your best friend in data warehousing, and many people have worked millions of man-hours to make it possible.

In very general terms, if disabling parallelism ‘fixes’ the CPU pressure issue, this usually means that there is a serious contention in the disk subsystem or in the memory subsystem.

Most often, parallel query execution plans suffer from inaccurate statistics combined with a very latent disk system, which makes the threads wait on each other.

Failure #7: Misunderstanding the significance of hash joins

Monitoring the performance of queries that operate on large sets of data is often a neglected art. Execution plans are either not monitored regularly, or not monitored at all. In reality, Hash joins are much better than either nested loop joins or merge joins which help to carry out the tasks but they do not scale well at all.

Although indexes are important, it is very easy to fall into the trap of having skewed statistics. Without accurate statistics, the query optimizer cannot produce a plan that achieves good performance. Especially close attention is needed in an ETL system, where data changes often and statistics are inaccurate more often than in any other system.

Way out of Failure #7:

Hash joins scale quite well and they parallelize well, which gives us an opportunity for the best possible throughput.

Hash joins can perform poorly if there is inaccurate estimation of the memory allocation for them; the optimizer bases the memory allocation decision on the available statistics, and if they are skewed and the allocated memory is not sufficient (additional memory cannot be allocated at runtime), some of the operations spill to the disk subsystem. This is not so bad if the TempDB is placed on SSDs, but in the above-mentioned shared SAN scenario, the query performance will be poor and unpredictable.

Conclusion:

Some mistakes will make the ETL processes fail, and cost a great deal of time and money to put right. These mistakes are generally easy to avoid if active steps are taken at the right time.

The best way of dealing with a failing ETL is to remedy the architectural and design mistakes rather than to approach the problem by overspending on licenses and on hardware to compensate. This latter approach is merely disguising the symptoms rather than tackling the root cause.