Historically, Cloudera has been able to reduce the big data learning curve and speed up adoption in traditional relational database management (RDBMS) environments by leveraging their interactive query engine, Impala. The Hortonworks distribution did not have an answer for Impala’s massively parallel processing (MPP) capabilities until recently with its new release of HDB powered by Apache HAWQ.
HDB provides Hortonworks with the missing link in its big data ecosystem and surpasses Impala by giving the open source community an even easier to use and more robust MPP tool. As noted in the diagram below, Hortonworks is not positioning HDB to replace the enterprise data warehouse (EDW) or Apache Hive; however, there is some debate around the longevity of stand-alone MPP appliances due to cost and difficulty scaling. As for Hive, Hortonworks still considers it the de facto engine for petabyte scale queries and extraction load transformation (ELT) workloads.
Stable YARN Resource Manager
Hortonworks HDB has several advantages over Cloudera Impala starting with its YARN integration making it a manageable resource that can be balanced across the cluster with other workloads. Currently, there is no limit or throttling on the I/O for Impala queries (Cloudera Documentation) which may make queries seem less responsive on a heavily loaded cluster due to the delay in execution.
Minimal Learning Curve
Two other key differences make HDB more approachable for those new to the big data ecosystem by focusing on ANSI SQL compliance and a built-in SQL optimizer. The learning curve involved in transitioning to a big data cluster from traditional RDBMs or MPP appliances can be daunting unless they share a common language. With HDB, the transition from SQL Server, Oracle or Netezza is now seamless thanks to HDB’s ANSI SQL 92, 99, 2003 compliance reducing the training required to get up and running.
Productive Query Execution
To compliment the language transition HDB also comes with a best-in-class query optimizer to again remove the burden from the analyst, speeding up results and reducing development time. Optimizing Cloudera Impala can be an art form and transferring skills between other RDMBs may be more difficult as Impala is limited to ANSI SQL 92. Cloudera does offer a significant amount of documentation on optimization for those willing to spend the time, which you can find here.
Cohesive Machine Learning
For those interested in analytical workloads, HDB offers native machine learning with MADlib. This again reduces the learning curve by providing integrated tools for machine learning and predictive modeling allowing the analyst to avoid loading separate libraries.
The final advantage comes in benchmarks reported by Hortonworks stating a performance increase of 1.3 - 6x over Impala. Again, depending on the workload of the cluster the performance increase could be significantly more due to Impala’s inability to be managed by YARN leaving the job delayed in queue.
HDB offers some significant improvements over Impala and should be assessed based on your unique workloads and analyst capabilities. Although HDB appears impressive, there are also current limitations that need to be considered; for instance, no Ranger or Atlas integration, leaving HDB outside of the normal Hortonworks governance framework. Ranger and Atlas integration are in the roadmap for those willing to take advantage of the technology today.
If you are considering a big data platform and have specific analytical needs or questions related to how HDB integrates with your environment, BlueGranite can help. Please contact us today!