What is a data lake?

A “data lake” is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed. It’s a great place for investigating, exploring, experimenting, and refining data, in addition to archiving data. There are various products that you can use to build a data lake, such as Microsoft’s Azure Data Lake repository. Data lakes are becoming much more needed as there are now so many data sources that companies can use to make better business decisions, such as social networks, review web sites, online news, weather data, web logs, and sensor data. All of these “big data” sources result in rapidly increasing data volumes and new data streams that all need to be analyzed. Some characteristics of a data lake include:

A place to store unlimited amounts of long-term data in any format inexpensively, as Hadoop is usually a much lower cost repository
Allows collection of data that you may or may not use later: “just in case”
Allows for easy integration of structured data, semi-structured data (e.g. XML, HTML), unstructured data (e.g. text, audio, video), and machine-generated data (e.g. sensor data)
A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read”
Complements an Enterprise Data Warehouse (EDW) and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW
Frees up expensive EDW resources (storage and processing), especially for data refinement
Exploit the full power of a Hadoop cluster to speed up ETL processing over SMP data warehouse solutions
Allows for data exploration to be performed without waiting for the EDW team to model and load the data, adding the benefit that it may turn out after exploration the data is not useful saving the EDW team from wasting resources
An excellent area to run extreme analytics, such as running millions of scoring models concurrently on millions of accounts to detect fraud on credit cards, which is typically not a workload you would see running in a data warehouse
A place to land streaming data, for example, from IoT devices or Twitter. This data can also be analyzed during ETL processing (i.e. scoring Twitter sentiment)
An on-line archive for data warehouse data that is no longer analyzed on a frequent basis
Some processing in better done on Hadoop than ETL tools like SSIS
Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)

To use some pictures to show the benefit of a data lake, here is the traditional approach for a data warehouse environment:

But as we introduce larger data volumes into this environment along with non-relational data, we run into problems:

The impact if we keep the current architecture:

Drop useful data by introducing ETL “bias”
Potentially insightful data is lost
Create latency as volumes increase and sources change
Duplicate data through staging environments to support ETL
Expensive “reactive” hardware to support processing scale requirements

So we take a new approach, where the non-relational data is copied to a data lake and refined, and then copied to the data warehouse. Meanwhile, much of the relational data can keep being fed directly to the data warehouse using the current ETL, bypassing the data lake:

Note the use of ELT instead of ETL (loading the data into the data lake and then processing it). This can speed up transformations as the data lake is usually in a Hadoop cluster that can transform data much faster than an ETL tool. Many data warehouse solutions that use MPP technology have already switched to ELT and load data to staging tables in the MPP appliance and then transform the data to take advantage of the power of parallel SQL processing. By changing the architecture for the analyst’s needs, we get the following benefits:

Entire “universe” of data is captured and maintained
Mining of data via transformation on read leaves all data in place
Refineries leverage the power of the cloud and traditional technologies
Integration with traditional data warehousing methodologies
Scale can be pushed to cloud for more horsepower
Orchestration of data is a reality (less rigid, more flexible, operational)
Democratization of predictive analytics, data sets, services and reports

A question I hear a lot is “Should we store relational data in the data lake?”. In most cases it is not necessary to copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:

Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced
Wanting to use some Hadoop technologies/tools to refine/filter data that are not supported by your EDW (i.e. JSON, images, video)
Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy
ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools)
The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value

Note there are technologies, such as PolyBase, that allow end-users to query data in a data lake using regular SQL, so they are not required to learn any Hadoop-related technologies. In fact PolyBase allows the end-user to use SQL, or any reporting tool that uses SQL, to join data in a relational database with data in a Hadoop cluster.

As the diagram below shows, data from all types of sources can be brought into the data lake and transformed/filtered/cleaned/refined. Within the data lake you will likely want to have various stages of the data, such as raw and cleaned. There should also be governance on the data in the data lake, something I see many companies skipping (see Apache Atlas Project Proposed for Hadoop Governance).

The cleaned data is then copied to any analytic platform, as some analytical workloads go beyond the traditional analytical workloads seen in data warehouses and are more likely to be processed on platforms more suited to these kind of workloads, such as analysis of data in motion and graph analysis. The spectrum of analytical workloads is now so broad they cannot all be dealt with in a single enterprise data warehouse. But also keep in mind that production reports can be generated right from the data lake in additional to the data lake being an area for exploratory analysis.