May 14, 2013

Real-world Big Data: Data Warehouse Archival to Hadoop

Rob Kerr Posted by Rob Kerr

Hadoop is a hot topic of conversation for 2013.  But is this purely owing to the power of the vendor hype machine, or is Hadoop and its underlying distributed storage technology, Hadoop Distributed File System (HDFS) really deserving of the attention it’s receiving this year? 

For Hadoop to deserve such a high level of interest, it should bring unique solutions to the marketplace.  With most deployments committed to a set of technologies already, new technologies need to bring something unique and truly revolutionary.

In this series of articles, I’m going to examine specific applications of Hadoop technology that deliver capabilities not often available to conventional data warehouse implementations.  So let’s get started!

The high cost of data archival

One of the fundamental questions we always ask when sizing and designing a new data warehouse implementation is, “How much history do we need to keep in the warehouse?”.  Of course this is a loaded question. Almost any business executive will answer such an open-ended question with one word: “Forever!”.

Once the costs of providing boundless storage are calculated, a more thoughtful process ensues that weighs the benefit of long-range historical data analysis against the cost of expensive, high-performance enterprise storage.

In some applications, such as criminal justice or human resources, “forever” online data retention really is a requirement.  But in a commercial data warehouse, it’s far more common to balance historical completeness against cost.

Current approaches for historical archiving

In analyzing operational metrics, it’s common to arrive at a solution that stores transaction-level detail for 2-3 years (to provide 2 years of year-over-year analysis).   Historical data is sometimes archived on tape or optical media (in case it’s ever needed), or it’s discarded altogether if not deemed relevant to any future analysis.  As detailed data is purged from on-line storage, it may be replaced with aggregated data.

Data Warehouse cold archive pattern

This cold-archiving approach is effective at managing cost. It also helps ensure the performance of a system over time by keeping it’s processing loads relatively constant.  But what happens when there truly is a business need to analyze old, historical data? Often data put into cold archives cannot be retrieved in a timely fashion, and business users may simply omit detailed history from an analysis altogether, or make estimates based on aggregations.

A Hadoop Alternative

In this scenario, the combination of low-cost storage and unlimited, linear-cost scalability offered by Hadoop’s HDFS storage subsystem can be an excellent fit.  By archiving historical data from high-cost, high-performance storage systems to low-cost HDFS storage, the primary reasons for removing data from the query-able data warehouse are achieved. Storage cost is reduced and query performance is maintained.  

Because Hadoop can scale linearly simply by adding more storage nodes (without needing to upgrade nodes already acquired), the business can continue to archive data by incrementally adding storage nodes as its archive needs grow.

Data warehouse Hadoop archive pattern

But unlike other alternatives (NAS, tape, optical libraries, etc.), Hadoop at its core is both a low cost storage platform and an engine to query the data it stores. Data stored in Hadoop can be queried in a variety of ways. 

At the lowest level, flexible Map/Reduce jobs can be used.  However when the data archived are data warehouse tables that have predictable structures, a simple and effective approach is to use Hive or PolyBase.

Hive is a technology that lets users use a SQL syntax to query data in the Hadoop cluster.  It can be accessed through familiar tools like Excel, PowerPivot and Reporting Services through the use of a standard ODBC driver. 




More enterprise deployments are beginning to leverage the power of Hadoop to find creative solutions to use cases that were either impossible or prohibitively expensive in the past. While Hadoop enables an amazing variety of scenarios, many of its uses are simple and have compelling and short-term ROI.  Low-cost online data warehouse archive is one of the simplest Hadoop patterns to implement, and has an easily quantifiable and often immediate return on investment.

Read more about Real-world Big Data with our article on Social Sentiment Analysis with Hadoop and SQL, found on our blog at

New Call-to-action
Rob Kerr

About The Author

Rob Kerr

Rob Kerr is the former CTO at BlueGranite. He is now the founder of Mobile Toolworks LLC, where he helps startup and established companies reinvent their businesses using cloud and mobile technologies. Rob received his degree in Computer Information Systems from Western Michigan University, hold patents in connected device technologies, was awarded Microsoft’s Most Valued Professional (MVP) for tech community contributions, and have held software engineering, consulting and executive-level management positions prior to founding Mobile Toolworks.

Latest Posts

New Call-to-action