According to Google, the interest in “Big Data” has been trending up for several years and has really gained steam in the last couple. The purpose of this post is to help highlight the differences between data lakes and data warehouses to help you make an informed decision on how to manage your data.
Those of us that are data and analytics practitioners have certainly heard the term and as we begin to discuss big data solutions with customers, the conversation naturally turns to a discussion of data lakes. However, I often find that customers either haven’t heard the term or don’t really have a good understanding of what it means.
First, let’s define our terms…
Wikipedia, defines Data Warehouses as:
“…central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”
This is a very high level definition that describes the purpose of a data warehouse but doesn’t explain how the purpose is achieved.
I would go on to add that a data warehouse has the following properties:
It represents an abstracted picture of the business organized by subject area.
It is highly transformed and structured.
Data is not loaded to the data warehouse until the use for it has been defined.
Pentaho CTO James Dixon has generally been credited with coining the term “data lake”. He describes a data mart (a subset of a data warehouse) as akin to a bottle of water…”cleansed, packaged and structured for easy consumption” while a data lake is more like a body of water in its natural state. Data flows from the streams (the source systems) to the lake. Users have access to the lake to examine, take samples or dive in.
This is also a fairly imprecise definition. Let's add a few specific properties of a data lake:
All data is loaded from source systems. No data is turned away.
Data is stored at the leaf level in an untransformed or nearly untransformed state.
Data is transformed and schema is applied to fulfill the needs of analysis.
Next, let's highlight five key differentiators of a data lake and how they contrast with the data warehouse approach.
1. Data Lakes Retain All Data
During the development of a data warehouse, a considerable amount of time is spent analyzing data sources, understanding business processes and profiling data. The result is a highly structured data model designed for reporting. A large part of this process includes making decisions about what data to include and to not include in the warehouse. Generally, if data isn’t used to answer specific questions or in a defined report, it may be excluded from the warehouse. This is usually done to simplify the data model and also to conserve space on expensive disk storage that is used to make the data warehouse performant.
In contrast, the data lake retains ALL data. Not just data that is in use today but data that may be used and even data that may never be used just because it MIGHT be used someday. Data is also kept for all time so that we can go back in time to any point to do analysis.
This approach becomes possible because the hardware for a data lake usually differs greatly from that used for a data warehouse. Commodity, off-the-shelf servers combined with cheap storage makes scaling a data lake to terabytes and petabytes fairly economical.
2. Data Lakes Support All Data Types
Data warehouses generally consist of data extracted from transactional systems and consist of quantitative metrics and the attributes that describe them. Non-traditional data sources such as web server logs, sensor data, social network activity, text and images are largely ignored. New uses for these data types continue to be found but consuming and storing them can be expensive and difficult.
The data lake approach embraces these non-traditional data types. In the data lake, we keep all data regardless of source and structure. We keep it in its raw form and we only transform it when we’re ready to use it. This approach is known as “Schema on Read” vs. the “Schema on Write” approach used in the data warehouse.
3. Data Lakes Support All Users
In most organizations, 80% or more of users are “operational”. They want to get their reports, see their key performance metrics or slice the same set of data in a spreadsheet every day. The data warehouse is usually ideal for these users because it is well structured, easy to use and understand and it is purpose-built to answer their questions.
The next 10% or so, do more analysis on the data. They use the data warehouse as a source but often go back to source systems to get data that is not included in the warehouse and sometimes bring in data from outside the organization. Their favorite tool is the spreadsheet and they create new reports that are often distributed throughout the organization. The data warehouse is their go-to source for data but they often go beyond its bounds
Finally, the last few percent of users do deep analysis. They may create totally new data sources based on research. They mash up many different types of data and come up with entirely new questions to be answered. These users may use the data warehouse but often ignore it as they are usually charged with going beyond its capabilities. These users include the Data Scientists and they may use advanced analytic tools and capabilities like statistical analysis and predictive modeling.
The data lake approach supports all of these users equally well. The data scientists can go to the lake and work with the very large and varied data sets they need while other users make use of more structured views of the data provided for their use.
4. Data Lakes Adapt Easily to Changes
One of the chief complaints about data warehouses is how long it takes to change them. Considerable time is spent up front during development getting the warehouse’s structure right. A good warehouse design can adapt to change but because of the complexity of the data loading process and the work done to make analysis and reporting easy, these changes will necessarily consume some developer resources and take some time.
Many business questions can’t wait for the data warehouse team to adapt their system to answer them. The ever increasing need for faster answers is what has given rise to the concept of self-service business intelligence.
In the data lake on the other hand, since all data is stored in its raw form and is always accessible to someone who needs to use it, users are empowered to go beyond the structure of the warehouse to explore data in novel ways and answer their questions at their pace.
If the result of an exploration is shown to be useful and there is a desire to repeat it, then a more formal schema can be applied to it and automation and reusability can be developed to help extend the results to a broader audience. If it is determined that the result is not useful, it can be discarded and no changes to the data structures have been made and no development resources have been consumed.
5. Data Lakes Provide Faster Insights
This last difference is really the result of the other four. Because data lakes contain all data and data types, because it enables users to access data before it has been transformed, cleansed and structured it enables users to get to their results faster than the traditional data warehouse approach.
However, this early access to the data comes at a price. The work typically done by the data warehouse development team may not be done for some or all of the data sources required to do an analysis. This leaves users in the driver’s seat to explore and use the data as they see fit but the first tier of business users I described above may not want to do that work. They still just want their reports and KPI’s.
In the data lake, these operational report consumers will make use of more structured views of the data in the data lake that resemble what they have always had before in the data warehouse. The difference is that these views exist primarily as metadata that sits over the data in the lake rather than physically rigid tables that require a developer to change.
Which Approach Should I Choose?
That’s a difficult question. If you already have well established data warehouse, I certainly don’t advocate throwing all that work out the window and starting over from scratch. However, like many other data warehouses, yours may suffer from some of the issues I have described. If this is the case, you may choose to implement a data lake ALONGSIDE your warehouse. The warehouse can continue to operate as it always has and you can start filling your lake with new data sources. You can also use it for an archive repository for your warehouse data that you roll off and actually keep it available to provide your users with access to more data than they have ever had before. As your warehouse ages, you may consider moving it to the data lake or you may continue to offer a hybrid approach.
If you are just starting down the path of building a centralized data platform, I urge you to consider both approaches.
What about Technology?
I have purposely not mentioned any specific technology to this point. The term data lake has become synonymous with the big data technologies like Hadoop while data warehouses continue to be aligned with relational database platforms. My goal for this post was to highlight the difference in two data management approaches and not to highlight a specific technology. However, the fact remains that the alignment of the approaches to the technologies mentioned above is not coincidence. Relational database technologies are ideal for data warehouse applications because they excel at high-speed queries against very structure data.
The Hadoop ecosystem on the other hand works great for the data lake approach because it adapts and scales very easily for very large volumes and it can handle any data type or structure. However, in addition, Hadoop can also support data warehouse scenarios by applying structured views to the raw data. It is this flexibility that allows Hadoop to excel at providing data and insights to all tiers of business users.
What Does the Future Hold?
Technologies in both camps continue to evolve.
Relational database software continues to advance and developments in both software and hardware specifically aimed at making data warehouses faster, more scalable and more reliable.
The Hadoop ecosystem is seeing unprecedented adoption and the fact that it is a collection of open source projects supported by the community has meant that development and advancement happens at a much quicker pace than traditional software.
Hadoop’s reliance on open source software and commodity hardware make it very compelling from both the cost and features perspectives to consider if you are evaluating either a new data platform or are planning to replace or upgrade a legacy system.