BUSINESS INSIGHTS

Sep 26, 2016

5 Critical Differences: Cloudera Impala vs. Hortonworks HDB

Scott Faculak Posted by Scott Faculak

Historically, Cloudera has been able to reduce the big data learning curve and speed up adoption in traditional relational database management (RDBMS) environments by leveraging their interactive query engine, Impala. The Hortonworks distribution did not have an answer for Impala’s massively parallel processing (MPP) capabilities until recently with its new release of HDB powered by Apache HAWQ.

HDB provides Hortonworks with the missing link in its big data ecosystem and surpasses Impala by giving the open source community an even easier to use and more robust MPP tool. As noted in the diagram below, Hortonworks is not positioning HDB to replace the enterprise data warehouse (EDW) or Apache Hive; however, there is some debate around the longevity of stand-alone MPP appliances due to cost and difficulty scaling. As for Hive, Hortonworks still considers it the de facto engine for petabyte scale queries and extraction load transformation (ELT) workloads.

hortonworks-hdb.png

Stable YARN Resource Manager

Hortonworks HDB has several advantages over Cloudera Impala starting with its YARN integration making it a manageable resource that can be balanced across the cluster with other workloads. Currently, there is no limit or throttling on the I/O for Impala queries (Cloudera Documentation) which may make queries seem less responsive on a heavily loaded cluster due to the delay in execution.

Minimal Learning Curve

Two other key differences make HDB more approachable for those new to the big data ecosystem by focusing on ANSI SQL compliance and a built-in SQL optimizer. The learning curve involved in transitioning to a big data cluster from traditional RDBMs or MPP appliances can be daunting unless they share a common language. With HDB, the transition from SQL Server, Oracle or Netezza is now seamless thanks to HDB’s ANSI SQL 92, 99, 2003 compliance reducing the training required to get up and running.

Productive Query Execution

To compliment the language transition HDB also comes with a best-in-class query optimizer to again remove the burden from the analyst, speeding up results and reducing development time. Optimizing Cloudera Impala can be an art form and transferring skills between other RDMBs may be more difficult as Impala is limited to ANSI SQL 92. Cloudera does offer a significant amount of documentation on optimization for those willing to spend the time, which you can find here.

Cohesive Machine Learning

For those interested in analytical workloads, HDB offers native machine learning with MADlib. This again reduces the learning curve by providing integrated tools for machine learning and predictive modeling allowing the analyst to avoid loading separate libraries.

Amplified Functioning

The final advantage comes in benchmarks reported by Hortonworks stating a performance increase of 1.3 - 6x over Impala. Again, depending on the workload of the cluster the performance increase could be significantly more due to Impala’s inability to be managed by YARN leaving the job delayed in queue.

HDB offers some significant improvements over Impala and should be assessed based on your unique workloads and analyst capabilities. Although HDB appears impressive, there are also current limitations that need to be considered; for instance, no Ranger or Atlas integration, leaving HDB outside of the normal Hortonworks governance framework. Ranger and Atlas integration are in the roadmap for those willing to take advantage of the technology today.

If you are considering a big data platform and have specific analytical needs or questions related to how HDB integrates with your environment, BlueGranite can help.  Please contact us today!

 

Scott Faculak

About The Author

Scott Faculak

Scott Faculak is a recognized technology leader engaging in next generation, big data Hadoop solutions to ensure best in class business intelligence, analytics and operational reporting. He is a strategic visionary, leading data architecture and solution development efforts. As an analytic solution provider with over 15 years of business intelligence practice, he is capable of maximizing financial, operational and marketing competencies in multiple industries. He effectively leads teams composed of business intelligence developers, analysts, project managers, data engineers and support staff, consistently exceeding corporate goals, initiatives and expectations.

Latest Posts

Selecting a Data Warehousing Technology in the Azure Cloud