Data lakes take on big data

A data lake is a scalable data storage repository that can ingest large amounts of raw data and make it available on-demand. Robert Sheldon explains the benefits and challenges of data lakes.

The world is generating more data than ever. In 2020 alone, 64.2 zettabytes of data were created or replicated, according to the International Data Corporation (IDC). The IDC also projects that the amount of digital data created over the next five years will be greater than twice the amount of data created since the beginning of digital storage.

Data is being generated on every front—through cloud platforms, mobile computing, edge environments, internet of things (IoT) devices, on-premises data centers, and a variety of other sources—but this data will do little good if organizations don’t have adequate tools for deriving value from it. They need ways to store and access vast amounts of data as efficiently as possible to perform advanced analytics and learn as much as they can from the available information.

To address these challenges, many organizations are turning to data lakes, centralized storage repositories that can hold large amounts of structured, unstructured, and semistructured data in its native format. A data lake offers a flexible and scalable platform for accessing and working with data. It breaks down data silos and addresses some of the challenges of traditional data management approaches. Data lakes are not without their own challenges, however, and organizations should understand their limitations before heading down this path.

What is a data lake?

A data lake is a scalable data storage repository that can quickly ingest large amounts of raw data and make it available on-demand. Users accessing the data lake can explore, transform, and analyze subsets of data as they need it to meet their specific requirements.

One of the most important characteristics of a data lake is its ability to store all types of data from any source:

  • Structured data that is clearly defined and formatted, such as the data found in relational databases.
  • Unstructured data adheres to no specific format, such as social media or data generated by IoT devices.
  • Semistructured data falls somewhere between structured and unstructured data, such as CSV and JSON files.

Early data lakes were built on Hadoop technologies, particularly MapReduce and the Hadoop distributed file system (HDFS). Unfortunately, the first attempts at data lakes often turned into dumping grounds of undocumented and disorganized data, making it challenging to work with the data in any meaningful way. Since then, data lake technologies and methodologies have steadily improved, and they’re now better suited to handle the influx of growing data volumes. At the same time, there’s been a steady growth in cloud-based object storage, which provides a scalable and relatively inexpensive storage platform for handling all that data.

An effective data lake can ingest any type of data in its native format from any source, whether databases, applications, file servers, data streams, or systems such as customer relationship management (CRM) or enterprise resource planning (ERP). A data lake also provides the mechanisms necessary to securely access data and carry out data-related operations without needing to move data to a separate storage system (although this is still an option if required).

Data lakes can help organizations perform advanced analytics that employ techniques such as predictive modeling, machine learning, or deep learning. They can also support business intelligence, making it possible to run queries and generate reports that contain rich visualizations.

A data lake might also be used for warm or cold data that needs to be accessible for compliance requirements or historical analytics. In some cases, an organization might use a data lake as a staging area for a data warehouse, although some organizations use data lakes instead of or alongside data warehouses.

Why use a data lake?

Data lakes offer several advantages. One of the biggest is their ability to support advanced analytics and other forms of data science. Such analytics can help organizations derive greater value from their raw data and open the way for new business opportunities, leading to greater productivity, improved operations, better products, enhanced service delivery, or more effective decision-making.

Data lakes can also help break down silos by providing a centralized repository for storing and working with data from different systems. Data scientists have a complete view of the information, reducing the time needed to identify relevant content, navigate multiple systems, or contend with the various security protections that accompany each system. A centralized repository also reduces the administrative overhead that comes with managing distributed storage.

Data lakes can incorporate any type of data from any source. They’re also highly scalable systems, especially when implemented in the cloud. In addition, data cleansing and transformation are typically performed when the data is needed, rather than when the data is first ingested into the data lake.

These characteristics make the data lake an extremely flexible system that can accommodate a wide range of business needs. Its centralized nature and quick data ingestion enable users to find and respond to information much faster than they can when searching multiple systems. A data lake also makes a wide range of information available to different types of users, providing access to data that was often unavailable in the past.

Challenges using data lakes

Although data lakes can offer numerous benefits, they also come with their own challenges. One of the biggest is the risk that a data lake will turn into a data swamp, serving as little more than a dumping ground for all the data that nobody knows what to do with. The worse the swamp becomes, the more difficult it is for users to find what they need and administrators to track what data is stored in the data lake.

Another challenge is that data lake technologies and processes are still relatively young and have yet to gain the level of maturity available to systems such as data warehouses. Because the data can come in so many formats and types, working with the data can require a wide range of different tools, adding to overhead, complexity, and costs.

This lack of maturity can also lead to data governance challenges, especially with the diversity of tooling and data. In fact, data governance remains one of the top concerns with implementing a data lake. A data lake’s very nature makes governance difficult to achieve, yet that nature also points to the importance of a comprehensive data governance strategy. Without it, the data lake is destined to be plagued by issues in data quality and reliability, making it difficult to carry out the type of analytics needed to make effective business decisions.

Organizations also have to be wary of the unexpected costs that can come with data lakes. Even if an organization uses free, open-source platforms to host the system, there can still be unplanned expenses when it comes to infrastructure and ongoing maintenance, especially if the data lake starts turning into a data swamp. In addition, organizations that rely on cloud storage for their data lakes might experience a degree of sticker shock as subscription fees start to accumulate. Expenses can also add up if IT personnel and other professionals are spending inordinate amounts of time managing and troubleshooting the data lake environment.

Data lake use cases

Data lakes can store large volumes of data, regardless of the type of data or its source. For this reason, a data lake can support a variety of use cases, enabling them to derive meaning from the data through the use of advanced analytics and artificial intelligence technologies such as machine learning and deep learning. Here are just a few of the potential data lake use cases:

  • Education. Educational institutions are continuously collecting data about their students. The data might include information about attendance, grades, test scores, activities or other measures that indicate how well students have done in their academic lives. Through the use of data lakes, these institutions might be able to improve the teaching and learning experience, while increasing student retention and satisfaction.
  • Healthcare. The healthcare industry generates enormous amounts of data in the form of doctors’ notes, patient records, lab test results, clinical trials, administrative documents, and other activities. Data lakes can help the industry derive valuable information from these sources, leading to better treatments and more efficient processes.
  • Financial services. Financial services firms rely on current and accurate data to make decisions as quickly and efficiently as possible. Data lakes can help these organizations derive valuable insights from their data so they can more effectively respond to market trends, predict future outcomes, and reduce overall risks.
  • Climatology. Climatologists can use data lakes to more accurately model and predict the impact of human behavior on the ecology in the short-term and long-term. Data Lakes can also help with research into the steps that can be taken to mitigate the impact on the environment and move toward a more sustainable future.
  • Transportation. Data lakes can lead to more intelligent transportation systems, helping to address a wide range of issues, such as planning traffic flow, scheduling road maintenance, controlling traffic signals, managing road incidents, reducing carbon emissions, deploying fare payment systems, improving crosswalk safety, and numerous other transportation concerns.
  • Telecommunications. Since the onset of COVID-19, telecommunication companies have been under greater pressure than ever to deliver fast and reliable services to support both work-related and personal activities. By implementing data lakes, these companies can better adapt to fluctuating demands and shifts in market trends, while reducing customer churn and dissatisfaction.
  • Cybersecurity. As data continues to be generated at unprecedented rates, cybersecurity continues to be a growing concern. Security specialists can use data lakes to better understand the threat landscape, helping to identify and respond to security threats more quickly and effectively and reduce the risks to sensitive data.

These are by no means the only use cases that might benefit from a data lake. And as the technology and tooling mature, the potential for additional use cases is only likely to grow.

How can you deploy a data lake?

The way in which you go about deploying a data lake depends on the technologies and services you plan to use and where the data lake will be hosted. Regardless of the approach, however, you must take into account how you’ll be moving data into the platform, how you’ll securely store and manage the data and its infrastructure, and how users will be accessing the data.

To get started, you’ll need to have the basic storage infrastructure in place and a defined mechanism for continuously integrating incoming data. You’ll also need a way to effectively handle the metadata that comes with the primary data. Most importantly, you must ensure that security and access controls are built into the system at every step of the way. You should also have a clear sense of how you’ll manage data governance throughout the data lifecycle.

The degree of effort that goes into deploying and maintaining a data lake corresponds directly to the approach you take to implementing the data lake. For example, a cloud service such as AWS Lake Formation helps you build, secure, and manage your data lake, incorporating such features as source crawlers, data transformation, cataloging, security mechanisms, and access controls. AWS Lake Formation also provides built-in integration with Amazon S3 storage, Amazon Athena, Amazon Redshift, and Amazon EMR.

If you try to build a similar system on-premises, you’ll have to deploy a storage infrastructure, a data management platform, a system for ingesting and processing the data, security and access controls, and interfaces for integrating with other systems, tools, and data sources. Even if you use cloud services for all or part of this effort, the do-it-yourself approach can be time-consuming and bring with it additional complexities and overhead, along with mounting costs.

Data lake architecture

There is no fixed formula for a data lake’s architecture. Organizations can combine data lake technologies in different ways to create designs that meet their specific needs while accommodating the influx of heterogeneous data from multiple sources. Even so, data lakes typically share a similar architecture at the highest level, incorporating the following physical layers:

  • Data sources. Data can come from a wide range of sources, including databases, file shares, system logs, business applications, IoT sensors, cloud platforms, NoSQL systems, and many more.
  • Data ingestion. A data lake provides the connectors and other components necessary to rapidly ingest data from the various sources into the storage repository. Depending on the data lake, the ingestion process might also transform or filter the data in some way, but this should not prevent the data lake from ingesting data in its native format from any source.
  • Storage repository. The storage repository is at the heart of the data lake, providing the infrastructure necessary to store various types of data. The repository also stores any associated metadata, which is essential for users to access the data they need easily. Cloud object storage has become one of the most common storage platforms for data lakes, but this is not necessarily a given. The repository can also be implemented on-premises, usually distributed across multiple hosts.
  • Search engine. A data lake requires a scalable, high-performing search engine that enables users to locate and retrieve specific data quickly and easily, no matter how large the underlying data set. For example, a data lake might use Elasticsearch or Cloudera Search to enable users to find data in the repository.
  • Client interfaces. A data lake typically includes APIs or user interfaces for enabling client tools to connect to the search engine, conduct searches and retrieve the results.
  • Client tools. Users can work with a wide range of client tools—running on different devices—to access the data lake environment and its data. For example, they might use data visualization or machine learning tools to work with the data.

Although data lakes commonly store data in its native format, they might also apply formatting to the data in some way, such as converting Excel files to the tab-separated values (TSV) or comma-separated values (CSV). Another approach might be to apply JavaScript Object Notation (JSON) formatting to incoming data. For example, JSON might be used when ingesting data streamed in from IoT devices. A storage repository might also include sandboxed regions to support individual analytics efforts.

Organizations have a great deal of flexibility when building their data lakes, choosing from various tools for implementing the data lake environments and working with the data. They can deploy their data lakes on-premises, on cloud platforms, or in hybrid configurations that accommodate both. Initially, most data lakes were deployed on-premises. Still, the steady increase in the amount of data and the growing availability of cloud object storage have resulted in many organizations moving to the cloud.

Cloud storage services can reduce CapEx and administrative overhead while providing almost unlimited scalability. That’s not to say the cloud doesn’t come with its own challenges, such as mounting subscription fees and lack of control over the environment—a particular concern when it comes to security and compliance—but the cloud makes it possible to deploy a comprehensive data lake platform more quickly and efficiently than trying to do it on-premises—while avoiding many of the headaches that often go with in-house projects.

Data lakes vs. data warehouses

Some organizations operate both data lakes and data warehouses, having clear use cases that justify both. They might use the data lakes as staging platforms for their data warehouses or run them as completely separate systems. Regardless of the scenario, it’s essential to understand the differences between the two approaches to maximize their value and ensure their ongoing success.

A data warehouse is a central repository of structured data that has already been cleansed, transformed, and carefully curated to ensure adherence to the relational database model. This is similar to databases that process transactional operations but with a focus on business intelligence, reporting, and data visualization. In contrast to a data lake, a data warehouse has a very rigid and fixed structure that requires careful planning to implement and change. It also offers limited scalability.

The following table provides an overview of the differences between the data warehouse and the data lake.

Feature

Data warehouse

Data Lake

Types of data

Structured

Structured, unstructured, and semistructured

Schema

Schema-on-write

Schema-on-read

Data quality

Data carefully curated

Data quality can vary significantly

Data volumes

Terabyte scale

Petabyte scale

Agility

Fixed configuration with limited flexibility

Maximum flexibility

Maturity

Very mature features, tools, and security mechanisms

Data lake technologies still maturing

Use cases

Reporting, visualization, and business intelligence

Machine learning, deep learning, predictive analytics, and big data analytics

Users

Business and data analysts

Data scientists, data engineers, and business analysts

Costs

Higher cost per-gigabyte

Lower cost per-gigabyte

Much like data lakes, data warehouses offer both advantages and disadvantages, which is why organizations often implement both. In this way, they can realize the benefits of each system while avoiding some of their challenges. However, implementing two separate systems also requires additional expenses, complexity, and administrative overhead. For this reason, some organizations are now opting for data lakehouses.

A data lakehouse combines data warehouses and data lakes features to create a single platform that offers benefits from both systems. For example, a data lakehouse might support transactions, enforce schema, and provide governance features while using cheaper object storage, handling diverse types of data, and supporting tools typically associated with data lakes. A data lakehouse also reduces data redundancy and the administrative overhead that comes with managing two systems, resulting in lower overall costs.

The growing swell of data lakes

As data volumes continue to grow, organizations will face greater challenges in managing and deriving value from their data. Data lakes can play an important role in bringing order to all that data, but they can provide value only if they help organizations solve business problems and meet their stated goals. For this reason, careful consideration must be given to how a data lake is deployed and maintained and, perhaps more importantly, what the organization hopes to achieve by implementing the data lake.

Data lakes represent a growing presence. Whether they’re used alongside data warehouses, provide staging support for data warehouses, or morph into lakehouses that span both worlds, data lakes could significantly impact how we work with the morass of data expected in the coming years. Data lakes are still relatively new and have a long way to go to reach the maturity level enjoyed by data warehouses. Still, they’ve already gained a strong following, and their popularity continues to grow, especially now that the cloud giants have taken an interest. Data lakes are not for every organization, but for some, they promise to help navigate the challenging waters of data management in the years to come.

If you like this article, you might also like Storage 101: Cloud Storage.