Where should I clean my data?


As a follow-up to my blogs What product to use to transform my data? and Should I load structured data into my data lake?, I wanted to talk about where you should you clean your data when building a modern data warehouse in Azure.  As an example, let’s say I have an on-prem SQL Server database and I want to copy one million rows from a few tables to a data lake (ADLS Gen2) and then to Azure SQL DW, where the data will be used to generate Power BI reports (for background on a data lake, check out What is a data lake? and Why use a data lake? and Data lake details).  There are five places that you could clean the data:

  1. Clean the data and optionally aggregate it as it sits in source system.  The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures).  The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details.  The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it.  Strongly advise against this option
  2. Clean data as it goes from source to the data lake.  The products that could be used for this include SSIS, Azure Data Factory (ADF) Data Flows (renamed to ADF Mapping Data Flows), Databricks, or Power BI Dataflow.  Note that ADF Mapping Data Flows can only be used against four data sources: Azure SQL DB, Azure SQL DW, Parquet (from ADLS & Blob), and Delimited Text (from ADLS & Blob).  Think of these four data sources as staging areas for the data to be used by ADF Mapping Data Flows.  The disadvantages of this option are: It will cause additional time pulling data from the source system which could affect performance of end users using the system; and you won’t have the raw data in the data lake.  If using SSIS or Power BI Dataflow it would cause nonparallel line-by-line transformations (see Difference between ETL and ELT) which could be very slow when cleaning many rows of data.  See below for how SSIS compares to ADF Mapping Data Flows and Databricks.  Strongly advise against this option
  3. Clean data after it has landed into data lake.  You land the data into a raw area in the data lake, clean it, then write it to a cleaned area in the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW via Polybase, all of which can be orchestrated by ADF.  This is the preferred option as it results in having the raw data in the data lake and minimizes the time hitting the source system, as well as saving costs as opposed to cleaning the data in a data warehouse (see Reduce Costs By Adding A Data Lake To Your Cloud Data Warehouse).  This will also be the way to clean the data the fastest.  ADF makes it real easy to move data from a source system (supports 79 sources) to the data lake (ADLS Gen2) by creating an ADF pipeline that uses the Copy Activity with a Self-Hosted Integration Runtime connected to your on-prem SQL Server.  You could land the data in the data lake in Parquet format with a high Data Integration Units (DIU) setting to make it super-fast.  ADF also makes it real easy via the Copy Activity to copy the cleaned data in the data lake to 23 destinations (“sinks”) such as SQL DW.  In addition, you could copy the cleaned data in the data lake to an MDM solution to master it, then write it back to the data lake in another layer called the mastered layer.  The cleaned area is also where you could join the data from various files together or split them apart for security reasons (i.e. having a file with rows from different countries split into multiple files for each country).  Products that could be used to clean the data: ADF Mapping Data Flows, Databricks, or HDInsight.  See below for how ADF Mapping Data Flows compares to Databricks
  4. Clean data as it moves from raw area in data lake to SQL DW.  The disadvantages of using this option is you won’t have the clean data in the data lake, it would do row-by-row transformations which are slow, and some products don’t use Polybase so it would be slow to load the data into SQL DW.  Products that can be used for this include ADF Mapping Data Flows or Databricks using the SQL DW Connector (which uses Polybase).  Strongly advise against this option
  5. Copy data from raw area in data lake to SQL DW then clean it via stored procedures in SQL DW. The disadvantages of using this option is it is more costly as SQL DW costs more than Databricks, the data transformations can affect users running queries on SQL DW, and the cleaned data won’t be available in the data lake for others to use.  You would use ADF to copy the data from the raw data area into staging in SQL DW, which would use Polybase.  Then execute stored procedures that would clean the data and copy it to production tables.  Strongly advise against this option

Be aware this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.

A bit about how ADF Mapping Data Flows works “under the covers”.  ADF Mapping Data Flows is a big deal as it brings GUI-based design together with scale.  It uses the Azure Databricks compute engine under the covers:  ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compiled and executed in Azure Databricks which will automatically scale-out as needed.  The distribution of data across worker/compute nodes in Databricks is where the optimizations come in: each of the transformations in the data flow has an “optimize” tab that lets you control how the data gets partitioned across the worker nodes.  If your source is Azure SQL DB, it’ll be using the out-of-the-box JDBC driver, which means the Databricks workers are accessing Azure SQL DB directly.  There are slight exceptions – if Azure SQL Data Warehouse is the source/destination, then it can land the data temporarily in blob storage (not the hierarchical namespace / ADLS Gen2) before Polybasing it in.  If the source/destination can Polybase, you set a staging linked service/container in the Data Flow activity settings.

How does ADF Mapping Data Flows compare to using SSIS?  In SSIS, you would create an SSIS Data Flow Task to do, for example, a Lookup and a Data Conversion on the one million rows.  This could be slow because it would have to copy the data from SQL Server in batches (usually 10,000 rows) to the SSIS server (so your dependent on the size of the SSIS server), then would go row-by-row updating the data, and landing the batch to a cleaned layer in ADLS Gen2 before doing the next one.  With ADF Mapping Data Flows, you create an ADF pipeline that uses the Copy Activity to copy the one million rows from SQL Server to a raw area in ADLS Gen2, then create a Data Flow activity in the ADF pipeline to do the transformations (see Azure Data Factory Data Flow), which behind-the-scenes fires up Databricks, puts the data in a Spark in-memory DataFrame across the workers, and runs Scala code on the DataFrame to update the rows.  This is very fast because no data moves once it is the the data lake, and because Databricks can be scaled for the transformations (it’s still doing row-by-row iterations, but it’s doing them in parallel according to the number of workers/cores and the way the data is distributed across those workers).  After the transformations are done, you would use the ADF Mapping Data Flow Sink Transformation to save this data in a cleaned layer in ADLS Gen2.  So the end result is ADF Mapping Data Flows is likely to be much faster than SSIS.

What is the difference between using Databricks vs ADF Mapping Data Flows?  ADF Mapping Data Flows is much easier to use because of its drag-and-drop interface, while Databricks is all code-based.  However, use Databricks if you like notebooks, want to use ML tools, and want more monitoring.  Performance when using ADF Mapping Data Flows would be the same as using Databricks separately or when using the Databricks Notebook activity in ADF unless you used a different partitioning scheme (controlled by the “optimize” tab for ADF Mapping Data Flows) or cluster size (default is 8 cores of general compute for ADF Mapping Data Flows – the number of worker nodes is dictated by the number of cores you use –  the driver node is assigned 4 cores with the rest assigned to workers), but also depends on the startup time for Databricks when using an ADF Mapping Data Flow (which can take five minutes) compared to using a separate Databricks cluster that is already running (which is instantaneous).  Also keep in mind that Databricks allows you to build a generic solution that could clean data in many files, while ADF Mapping Data Flows requires a package per object.

More info:

ADF Data Flow tutorial videos

ADF Mapping Data Flow Patterns

Original post (opens in new tab)
View comments in original post (opens in new tab)