Hadoop Speeds Data Delivery at Bloomberg

Many organizations are adopting Hadoop as their data platform because of two fundamental issues:

  1. They have a lot of data that they need to store, analyze and make sense of, on the order of 10s of terabytes or greater.
  2. The cost of doing the above in Hadoop is significantly less expensive than the alternatives.

But those organizations are finding there are other good reasons for using Hadoop and other NoSQL data stores (HBase, Cassandra). Hadoop has rapidly become the dominant distributed data platform, much as Linux quickly dominated the Unix operating system market. With that platform come a rich ecosystem of applications for building data products, whether it’s for the growing SQL on Hadoop movement or real-time data access with HBase.

At the latest Hadoop SF meetup at Bloomberg’s office, two presenters discussed how Bloomberg was taking advantage of this converged platform to power their data products. Bloomberg is the leading provider of securities data to financial companies, but they describe their data problem as “medium data” – they don’t have as much data to deal with, but they do have strong requirements around how quickly they need to deliver it to their users. They have thousands of developers working on all aspects of these data products, but especially custom low-latency data delivery systems.

When Bloomberg explored the use of HBase as the backend of their portfolio pricing lookup tool, they had quite a challenge – support an average query lookup of 10M+ cells in around 60 ms. Initial efforts to use HBase were promising, but not quite fast enough. Through several iterations of optimization, including parallel client queries, scheduling garbage collection, and even enhancing high availability further to minimize the impact of a failed server (HBASE-10070), they were able to hit their targets, and allow the move from custom data server products to HBase.

With the move to Hadoop, Bloomberg’s also needed better cluster management capabilities. Open-source tools are already dominant in this space, and while Bloomberg leverages a combination of Apache Bigtop for Hadoop, Chef for configuration management, and Zabbix for monitoring, many other good tools exist (I’m most fond of Ansible, Monit and proprietary Cloudera Manager personally). Combining the abilities of the Hadoop platform for developing and running large-scale data products with more efficient provisioning and operational models gives Bloomberg exactly what they need. It’s a model that’s going to play out repeatedly in the coming years at many organizations as Hadoop proves its capabilities as a modern data platform.

Mark Kidwell is Principal Big Data Consultant at DesignMind. He specializes in Hadoop, Data Warehousing, and Technical and Project Leadership.