Data lake details

I have blogged before about data lakes (see What is a data lake? and Why use a data lake?), and wanted to provide more details on this popular technology, some of which I cover in my presentation “Big data architectures and the data lake“. In short, the data lake is a storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

The data lake should be the center of just about any big data solution for these major reasons:

Inexpensively store unlimited data
Collect all data “just in case”
Easy integration of differently-structured data
Store data with no modeling – “Schema on read”
Complements enterprise data warehouse (EDW)
Frees up expensive EDW resources, especially for refining data
Hadoop cluster offers faster ETL processing over SMP solutions
Quick user access to data
Data exploration to see if data valuable before writing ETL and schema for relational database
Allows use of Hadoop tools such as ETL and extreme analytics
Place to land IoT streaming data
On-line archive or backup for data warehouse data
Easily scalable
With Hadoop, high availability built in
Allows for data to be used many times for different analytic needs and use cases
Low-cost storage for raw data saving space on the EDW

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

Too much time spent modeling when you don’t know all of the questions your data needs to answer
Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

Data Lake layers:

Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area
Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed
Sandbox data layer – Optional layer to be used to “play” in. Also called exploration layer or data science workspace

Be aware that you still need data governance so your data lake does not turn into a data swamp! Having a data lake does not lessen the data governance that you would normally apply when building a relational data warehouse. In fact, it usually requires more data governance. So look at tools such as the Azure Data Catalog to keep track of all your data.

Data typically lands in products such as Hadoop Distributed File System (HDFS) or the Azure Data Lake Store (ADLS). Since these technologies are just storage, you need other technologies for the compute part. For example, here are ways to clean data in ADLS:

HDInsight with Hive or Pig or MapReduce
Hortonworks/Cloudera/MapR IaaS. Available in the Azure Marketplace
U-SQL in Azure Data Lake Analytics (ADLA)
PolyBase in SQL Data Warehouse (SQL DW) or PolyBase in SQL Server 2016 IaaS

Since you will typically have both a Data Lake and a relational data warehouse when building a big data solution, it’s important to understand the roles of each:

Data Lake/Hadoop (staging and processing environment)

Batch reporting
Data refinement/cleaning
ETL workloads
Store historical data
Sandbox for data exploration
One-time reports
Data scientist workloads
Quick results

Data Warehouse/RDBMS (serving and compliance environment)

Low latency
High number of users
Additional security
Large support for tools
Easily create reports (Self-service BI): A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it? Very little, hence a major reason to use a RDBMS

A question I get frequently is if all relational data should be copied to the data lake, especially if some of the relational data is not needed for reporting or analytics in the data lake (just in the data warehouse). There still could be reasons to copy the data to the data lake, such as for backup purposes, to use low-cost storage for raw data saving space on the data warehouse, to use Hadoop tools, or to offload the refining of the data from the data warehouse (especially if ETL jobs on the data warehouse are taking too long). But in some cases you may want to skip the data lake, especially if you have many existing SSIS packages as you can minimize the changes to them (i.e. just changing the destination source). Also keep in mind the extra time needed to export data out from a relational database into a flat file before copying it to the data lake.

One issue to be aware of is when you have to move a lot of data each day from a relational database to a data lake. You may want to look at 3rd-party products for Change Data Capture (CDC) for high volumes of data to get updates of your source system into a data lake such as HVR and Attunity. Performing updates to rows of data in files sitting in a Hadoop/HDFS data lake can be very slow compared to appends.

More info:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

Rate

Share

Share

Rate

Data lake details

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts