Introduction to Hadoop

James Serra, 2014-03-11 (first published: 2014-02-27)

Hadoop was created by the Apache foundation as an open-source software framework capable of processing large amounts of heterogeneous data-sets in a distributed fashion (via MapReduce) across clusters of commodity hardware on a storage framework (HDFS). Hadoop uses a simplified programming model. The result is Hadoop provides a reliable shared storage and analysis system.

MapReduce is a software framework that allows developers to write programs that perform complex computations massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. MapReduce libraries have been written in many programming languages (usually Java), with different levels of optimization. It works by breaking down a large complex computation into multiple tasks and assigning those tasks to individual worker/slave nodes and taking care of coordination and consolidation of the results. A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Hadoop Distributed File System (HDFS) is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. When data is pushed to HDFS, it will automatically split into multiple blocks and stores/replicates the data across various datanodes, ensuring high availability and fault tolerance.

NameNode holds the information about all the other nodes in the Hadoop cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of the Hadoop cluster. Each DataNode is responsible for holding the data. JobTracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results. TaskTracker is responsible for running the task/computation assigned to it.

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker.

There are many other tools that work with Hadoop:

Hive is part of the Hadoop ecosystem and provides an sql-like interface to Hadoop. Hive uses MapReduce code to extract data from the Hadoop cluster. Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. Hive appeals to data analysts familiar with SQL.

Pig enables you to write programs using a procedural language called Pig Latin that are compiled to MapReduce programs on the cluster. It also provides fluent controls to manage data flow. Pig is more of a scripting language while Hive is more SQL-like. With Pig you can write complex data transformations on a data set such as aggregate, join and sort. It can be extended using User Defined Functions in Java and called directly by Pig.

While Hadoop is a natural choice for processing unstructured and semi-structured data like logs and files, there may be a need to process structured data stored in relational databases as well. Sqoop (SQL-to-Hadoop) is a tool that allows you to import structured data from SQL Server and SQL Azure to HDFS and then use it in MapReduce and Hive jobs. You can also use Sqoop to move data from HDFS to SQL Server.

How Hadoop fits in with the Parallel Data Warehouse (PDW) and Polybase:

More info:

Big Data Basics – Part 3 – Overview of Hadoop

Video Hadoop Tutorial: Core Apache Hadoop

Hadoop Tutorial: Intro to HDFS

Hadoop: What it is, how it works, and what it can do

Hive Data Warehouse: Lessons Learned

Book Review: Big Red - Voyage of a Trident Submarine

by Andy Warren

SQLServerCentral.com

Blogs

I've grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It's a fairly human look at what's involved in sailing on a Trident missile submarine...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-03-10

1,439 reads

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

by Robert Davis

SQLServerCentral.com

Blogs

Question: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? This question was sent to me via email. My reply follows. Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? Databases to be mirrored are currently running on 2005 SQL instances but will be upgraded to 2008 SQL in the near future.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-23

1,567 reads

Inserting Markup into a String with SQL

by Phil Factor

SQLServerCentral.com

T-SQL

In which Phil illustrates an old trick using STUFF to intert a number of substrings from a table into a string, and explains why the technique might speed up your code...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-18

1,631 reads

Networking - Part 4

by Andy Warren

SQLServerCentral.com

Blogs

You may want to read Part 1 , Part 2 , and Part 3 before continuing. This time around I'd like to talk about social networking. We'll start with social networking. Facebook, MySpace, and Twitter are all good examples of using technology to let...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-17

1,530 reads

Speaking at Community Events - More Thoughts

by Andy Warren

SQLServerCentral.com

Blogs

Last week I posted Speaking at Community Events - Time to Raise the Bar?, a first cut at talking about to what degree we should require experience for speakers at events like SQLSaturday as well as when it might be appropriate to add additional focus/limitations on the presentations that are accepted. I've got a few more thoughts on the topic this week, and I look forward to your comments.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-13

360 reads

Introduction to Hadoop

Rate

Share

Share

Rate

Introduction to Hadoop

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts