Blog Post

Languages of Big Data Apache Hadoop


Languages of Big Data

Big Data is here and rapidly growing.  If you are just starting to learn about Big Data then it is important for you to understand all of the various pieces that can be used in Big Data architectures.  The Apache foundation is a major player in the Big Data space.  Each heading below is a hyperlink to the homepage of each project.  Enjoy learning about all Apache Hadoop and related technologies have to offer.

Apache Hadoop

Taken from the Hadoop homepage – “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. ”

  • Hadoop Distributed File System (HDFS): – Distributed file system for  high-throughput access to application data
  • Hadoop YARN – A job scheduling and cluster resource management framework
  • Hadoop MapReduce – Based on YARN to provide for parallel processing of large data sets.
  • Hadoop HIVE – data warehouse infrastructure that provides data summarization and ad hoc querying
  • Hadoop Mahout – Machine Learning and Data Mining framework
  • Hadoop PIG -  high-level data-flow language and execution framework for parallel computation

Apache Sqoop

Sqoop is a tool designed to load data from relational databases into Hadoop

Apache Flume

Flume is used for log file data to be collected and aggregated.  It has a simple and flexible architecture based on streaming data flows that can allow online analytics.

Apache Solr

Solr is open source enterprise search platform from the Apache Lucene project.  It allows full text search and near real-time indexing, dynamic clustering, database integration, and will index rich documents such as Word or PDF.

MongoDB, SOLR /Lucene/Elastic Search, NoSQL, Hadoop

MapReduce, Hive, Hbase, Pig, Mahout, Avro, Oozie


The post Languages of Big Data Apache Hadoop appeared first on Derek Wilson - Blog.