Sep 28, 2015

Lambda Architecture: Low Latency Data in a Batch Processing World

Josh Fennessy Posted by Josh Fennessy

There's a true challenge in today's data-rich environment between requests for up-to-the second data, and the time it takes to process, cleanse, and verify that data. For enterprise business intelligence and data warehousing LambdaArchitecturev2.pngsystems, the Holy Grail is to produce a single version of the truth. Unfortunately, that single version of truth comes at the highly coveted price of time.

It's no secret that it takes considerable time to process and manage large amounts of data. In today's real-time society, key decision makers and data scientists feel the pressures of delivering decisions and analytics based closely on real-time trends.

How do we avoid making crucial decisions based on yesterday's trends? Enter the Lambda Architecture.

Introducing the Lambda Architecture

The Lambda Architecture offers a novel way to bridge the gap between the single version of the truth and the highly sought I-want-it-now, real-time solution.  By combining traditional batch processing systems with stream consumption tools a version of symmetry can be achieved. The Lambda Architecture does this with three main components:

The Batch Layer

This layer is familiar to many students of data warehousing. It is the ETL layer; the master data layer; the data warehouse. What goes in here is correct. It is the truth. We trust this layer.

The Speed Layer

This layer is the new kid on the block. It's fast and loose. New data that is created is nearly instantly stored in this layer. It is not clean. It might not even be right; but it's now. It's current. We only trust this layer on the surface.

The Serving Layer

This is the mediator. The calm one. The Serving Layer accepts queries, and decides when to use the Batch Layer, and when to use the Speed Layer. It prefers the Batch Layer – the trusted layer. If you ask for the up-to-the-second data, it will return that too. It's the bridge between what we trust and what we want right now.

How Is this Achieved?

To implement a Lambda Architecture is to implement a number of parallel or consequent projects.

The Batch Layer is typically implemented using traditional data warehousing tools and served via traditional data warehousing databases. In Apache Hadoop terms, this could mean Oozie workflows to process the data, and Apache Hive or Cloudera Impala to answer queries. This layer is built using a predefined schedule, giving enough time for processing – usually once or twice per day. Updates to the Batch Layer include ingesting those pieces of data currently stored in the Speed Layer.

The Speed Layer is typically implemented using a stream processing framework. Apache Storm, Kafka, Flume, or Spark Streaming are common examples of tools used to consume data at the rate of creation. This layer offers minimal data processing and generally is simply a "lift 'n shift" of the streaming data into a central repository. Apache HBase or Cassandra are common platforms for storing data in the Speed Layer.

The Serving Layer is a bit more complicated in that it needs to be able to answer a single query request against two or more databases, processing platforms, and data storage devices. Apache Druid is an example of a cluster-based tool that can marry the Batch and Speed layers into a single answerable request.

Why doesn't everyone do this?

Frankly, it's complicated. Setting up a true Lambda Architecture requires a large number of hardware and software resources. It requires different codebases for each layer, often using completely different technologies. Changes to the code introduce complexities to the code often at three or four times that of a traditional data warehouse architecture. Often times, the extra cost of implementing a complex architecture like this does not outweigh the benefit of being able to use data slightly newer than standard data warehousing systems.

OK, I know I need this. What next?

BlueGranite specializes in best practices of data management. A good place to start is an Architecture Design session. These sessions are typically spread over two to three days, during which time our architects strive to understand your data, to understand your requirements, and help to build a high-level roadmap of implementation.

From there, our consulting services will help to implement a pilot project to prove the architecture's benefit in your organization. We work in a phased approach to help you realize return on investment during the implementation cycle, not just at the end.

Contact us today to learn more.

Free eBook Download
Josh Fennessy

About The Author

Josh Fennessy

Josh is a Solution Architect at BlueGranite. Josh is passionate about enabling information workers to become data leaders. His passions in the data space include: Modern Data Warehousing, unstructured analytics, distributed computing, and NoSQL database solutions.

Latest Posts

New Call-to-action