Blog Post

Data lake details

,

I have blogged before about data lakes (see What is a data lake? and Why use a data lake?), and wanted to provide more details on this popular technology, some of which I cover in my presentation “Big data architectures and the data lake“.  In short, the data lake is a storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

The data lake should be the center of just about any big data solution for these major reasons:

  • Inexpensively store unlimited data
  • Collect all data “just in case”
  • Easy integration of differently-structured data
  • Store data with no modeling – “Schema on read”
  • Complements enterprise data warehouse (EDW)
  • Frees up expensive EDW resources, especially for refining data
  • Hadoop cluster offers faster ETL processing over SMP solutions
  • Quick user access to data
  • Data exploration to see if data valuable before writing ETL and schema for relational database
  • Allows use of Hadoop tools such as ETL and extreme analytics
  • Place to land IoT streaming data
  • On-line archive or backup for data warehouse data
  • Easily scalable
  • With Hadoop, high availability built in
  • Allows for data to be used many times for different analytic needs and use cases
  • Low-cost storage for raw data saving space on the EDW

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer
  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

Data Lake layers:

  • Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area
  • Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
  • Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed
  • Sandbox data layer – Optional layer to be used to “play” in.  Also called exploration layer or data science workspace

Be aware that you still need data governance so your data lake does not turn into a data swamp!  Having a data lake does not lessen the data governance that you would normally apply when building a relational data warehouse.  In fact, it usually requires more data governance.  So look at tools such as the Azure Data Catalog to keep track of all your data.

Data typically lands in products such as Hadoop Distributed File System (HDFS) or the Azure Data Lake Store (ADLS).   Since these technologies are just storage, you need other technologies for the compute part.  For example, here are ways to clean data in ADLS:

Since you will typically have both a Data Lake and a relational data warehouse when building a big data solution, it’s important to understand the roles of each:

Data Lake/Hadoop (staging and processing environment)

  • Batch reporting
  • Data refinement/cleaning
  • ETL workloads
  • Store historical data
  • Sandbox for data exploration
  • One-time reports
  • Data scientist workloads
  • Quick results

Data Warehouse/RDBMS (serving and compliance environment)

  • Low latency
  • High number of users
  • Additional security
  • Large support for tools
  • Easily create reports (Self-service BI): A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?  Very little, hence a major reason to use a RDBMS

A question I get frequently is if all relational data should be copied to the data lake, especially if some of the relational data is not needed for reporting or analytics in the data lake (just in the data warehouse).  There still could be reasons to copy the data to the data lake, such as for backup purposes, to use low-cost storage for raw data saving space on the data warehouse, to use Hadoop tools, or to offload the refining of the data from the data warehouse (especially if ETL jobs on the data warehouse are taking too long).  But in some cases you may want to skip the data lake, especially if you have many existing SSIS packages as you can minimize the changes to them (i.e. just changing the destination source).  Also keep in mind the extra time needed to export data out from a relational database into a flat file before copying it to the data lake.

One issue to be aware of is when you have to move a lot of data each day from a relational database to a data lake.  You may want to look at 3rd-party products for Change Data Capture (CDC) for high volumes of data to get updates of your source system into a data lake such as HVR and Attunity.  Performing updates to rows of data in files sitting in a Hadoop/HDFS data lake can be very slow compared to appends.

More info:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating