Modern Data Warehouse & Reverse ETL


An extension to the Modern Data Warehouse (MDW) that I have heard a bit about lately is called “Reverse ETL”. Before I describe what that is, first I wanted to give a quick review of a typical MDW, which consists of five stages:

  1. Ingest: Data is ingested from multiple sources via ELT
  2. Store: The ingested data is stored in a data lake in a raw layer in the format that it came from the source
  3. Transform: The data is then cleaned and written to a cleaned layer in the data lake, and then joined and/or aggregated and copied into a presentation layer in the data lake
  4. Model: Some of the data is then copied into a relational database in third normal form (3NF) and/or a star schema
  5. Visualize: The data is then reported off of

There are many variations, additions, and exceptions to these stages and multiple products/tools can be used at each stage. In the Azure world, typically Azure Data Factory is used to ingest data, Azure Data Lake Storage Gen2 is used to store the data, mapping data flows in Azure Data Factory is used to transform the data, Azure Synapse Analytics is used to model the data, and Power BI is used to visualize the data. Of course, there are many variations on the tools you can use. I’ll post a video in the next few weeks that will discuss this in more detail.

“Reverse ETL” is the process of moving data from a modern data warehouse into third party systems to make the data operational. Traditionally data stored in a data warehouse is used for analytical workloads and business intelligence (i.e. identify long-term trends and influencing long-term strategy), but some companies are now recognizing that this data can be further utilized for operational analytics. Operational analytics helps with day-to-day decisions with the goal of improving the efficiency and effectiveness of an organization’s operations. In simpler terms, it’s putting a company’s data to work so everyone can make better and smarter decisions about the business. As examples, if your MDW ingested customer data which was then cleaned and mastered, that customer data can then by copied into multiple SaaS systems such as Salesforce to make sure there is a consistent view of the customer across all systems. Customer info can also be copied to a customer support system to provide better support to that customer by having more info about that person, or copied to a sales system to give the customer a better sales experience. As a last example, you can identify at-risk customers by surfacing customer usage data in a CRM.

Companies are building key definitions in SQL on top of the data warehouse such as Lifetime Value (LTV), Product Qualified Lead (PQL), propensity score, customer health, etc. Yes, you can easily create reports and visualizations using this data in BI tools or SQL, but these insights can be much more powerful if they drive the everyday operations of your teams across sales, marketing, finance, etc. in the tools they live in. Instead of training sales reps to use the BI reports you built from the MDW, the data analyst can operationalize their analysis by feeding lead scores from the data warehouse into, for example, a custom field in Salesforce.  As another example, say your data science team calculates a propensity score on top of the data warehouse or data lake describing the user’s likelihood of buying a product. Using Reverse ETL, you can move the propensity score to a operational production database to serve customers personalized in-app experiences in real time.

Instead of writing your own API connectors from the data warehouse to SaaS products to pipe the data into operational systems like Salesforce and dealing with all the mapping of fields, reverse ETL solutions have appeared which offer out of the box connectors to numerous systems. They provide the mapping to the SaaS products and allow you to continuously sync or define what triggers the syncing between the two systems. I see reverse ETL products with extensive and easy-to-use mapping capabilities to the many SaaS products such as Salesforce, Hubspot, Marketo, and Zendesk as a big advantage over trying to use Azure Data Factory and coding your own.

There are now a handful of startups building reverse ETL products including HightouchCensusGrouparoo (open source), HeadsupPolytomic, RudderStack, and Seekwell. It will be interesting to see how these product evolve and if reverse ETL becomes popular.

More info:

Reverse ETL — A Primer

Reverse ETL is Just Another Data Pipeline

What is Reverse ETL?

Reverse ETL: A new category emerges (video)

What is Reverse ETL and Why it is Taking Off

The post Modern Data Warehouse & Reverse ETL first appeared on James Serra's Blog.

Original post (opens in new tab)
View comments in original post (opens in new tab)