THE BIG DATA PROBLEM
On July 28th, 2016 Mark Zuckerberg announced that as many people use Facebook today as were alive 100 years ago. For the past decade, our lives have been significantly transformed by increasingly affordable digital devices and services. This digitalization transcends all aspects of our personal and work lives, and as we use these digital platforms we are constantly generating information, or data at an alarmingly fast rate.
Today, we have more data than we’ve ever had before, and this data has enormous potential. But, what do we do with it? It is clear that we can gain valuable insights, but scientists currently lack the resources to analyse the full spectrum of this increasing volume of data. Thus, we have the Big Data Problem: we are still trying to find new ways to share, analyse, and interpret data.
WHAT IS A DATA LAKE?
A Data Lake acts as a facility for storing, analysing and querying data regardless of format and thus provides an intuitive way to store data of any type. While the typical Data Warehouse pre-categorizes data at its point of entry (which can dictate how it’s going to be analysed), Data Lakes categorize data only once a specific query is launched. This attempts to address a crucial issue of the Big Data Problem, which is that currently we don’t recognize the potential value of much of the data that we are collecting. Therefore, storing data in pre-categorized form in a Data Warehouse may not make sense. The concept of a Data Lake was originally coined by James Dixon, CTO of Pentaho, who describes it this way:
“If you think of a datamart (Data Warehouse) as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
According to The Boston Consulting Group Data Lakes have three core functions:
- To ingest structured and unstructured data from multiple sources into a data repository
- To manage data by cleaning, describing, and improving it
- To model data to produce insights that can be visualized or integrated into operating systems
Check out their diagram below for an in-depth look at how a Data Lake works:
One of the key benefits of Data Lakes lies in their flexibility. Traditional data warehouses are resistant to change, costly to operate, and hard to scale. Data Lakes typically use low-cost servers with an architecture that allows for servers to be added as needed to increase power and capacity. They are currently being used and have huge potential in the following fields: insight generation, operational analytics, and transaction processing.
A SIMPLE EXAMPLE
To better understand Data Lakes, take this simple example. Imagine that you are trying to draw insights from the relatively unstructured data your company has collected below:
|Customer A||Customer B||Customer C|
|John Davis||Minesota||Product 2|
|New York||Product 1||Online|
|Product 1||Online||Joe Brown|
|Male||Sara Edwards||New Jersey|
In order to analyse and draw insights from this data using a traditional data warehouse, you would have to structure it in the following way:
|Customer A||Customer B||Customer C|
|Name||John Davis||Sara Edwards||Joe Brown|
|Location||New York||Minesota||New Jersey|
|Product||Product 1||Product 1||Product 2|
|Point of Purchase||In-store||Online||Online|
While structuring this data may seem easy in this particular case, what if you were attempting to analyse data from millions of customers, with hundreds of different variables each? Ideally, a Data Lake would be able to draw insights from this data in its raw, unstructured format. While this is an oversimplification of the Data Lake concept, it clearly demonstrates its value.
WHAT MAKES DATA LAKES POSSIBLE?
Data Lakes typically operate in the Hadoop environment. Hadoop is an open source, Java-based programming framework that supports the processing of large amounts of data in a distributed computing environment. It is part of the Apache project created by the Apache Software Foundation. Specifically designed as a file system instead of a database, Hadoop can handle many different formats of data, making it the ideal platform for Data Lakes. Hadoop is a great tool to analyse data that doesn’t fit nicely into pre-defined tables. The framework can run on a large amount of servers that don’t share common memory or disks. When you want to load all of your data into Hadoop, the software will break down the data into pieces and spread them across the different servers while keeping track of where each piece of data is stored. This allows Data Lakes to be flexible and easily scalable.
THE FUTURE OF DATA LAKES
Some of the big names in enterprise software are already seeing the potential in Data Lakes. In 2015, Amazon Web Services released a video outlining how their platform can be harnessed to create Data Lakes. Microsoft has also developed a Data Lake solution as part of its Azure cloud suite. There are even some new players to watch out for. Zaloni, which procalims itself “The Data Lake Company”, recently received $7.5 million in funding in February of 2016.
While many of the traditional enterprise software vendors are dabbling in the Hadoop environment and Data Lakes, the company that has been making strides in this environment is SAP. SAP’s HANA platform combines an in-memory database with application services, high-speed analytics, and flexible data acquisition tools in a single, in-memory platform. Combing the processing power of SAP HANA with Hadoop’s ability to store and process huge amounts of data provides enormous potential. Specifically with the HANA Vora platform which utilizes Hadoop, SAP is helping businesses to bridge the divide between corporate data and Big Data to improve decision-making.
Theoretically, Data Lakes do provide a potential answer to the Big Data Problem as a new way to share, analyse, and interpret data. But, it seems that they have a long way to go. The Data Lake concept still has two major challenges, which are a lack of tools and skills. The Hadoop environment lacks mature metadata as well as security tools. This environment also has a shortage of skilled workers who are familiar with the open-source software framework.
James Dixon began discussing Data Lakes way back in 2010, when the concept may have been a bit ahead of its time. Like any other technological innovation, it seems that the Data Lake follows the traditional Hype Cycle. For its 2016 Hype Cycle for Data Science, Gartner identifies the Data Lake concept as currently being at the peak of inflated expectations. While we can still expect the concept to hit the dreaded trough of disillusionment in the near future, we can also expect it to eventually hit its plateau of productivity.