Big Data - An Introduction to Pig

In my previous article, I have explained Big Data and Hadoop in details. In this article I would like to go little deeper with Pig. Pig is a high-level platform for creating Map Reduce programs used with Hadoop. Pig is made up of two components: the first is the language itself, which is called PigLatinand the second is a runtime environment where PigLatin programs are executed. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python or JavaScript and then call directly from the language.

We know Pig was initially developed at Yahoo research at 2006. The whole intension behind developing Pig was to allow people using Hadoop to focus more on analyzing large data sets and spend less time to write mapper and reducer programs. Pigs eat almost anything, the Pig programming language is designed to handle any kind of data and for the very same reason Yahoo! named it Pig.

The first step in a Pig program is to LOADthe data you want to manipulate from HDFS. Then you run the data through a set of transformations (which, under the covers, are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.

Let us talk about LOAD, TRANSFORM, DUMP and STORE in details.

LOAD

The objects that are being worked on by Hadoop are stored in HDFS. In order for a Pig program to access this data, the program must first tell Pig what file (or files) it will use, and it is done through the LOAD 'data_file' command (where 'data_file' specifies either an HDFS file or directory). If a directory is specified, all the files in that directory will be loaded into the program. If the data is stored in a file format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in and interpret the data.

TRANSFORM

The transformation logic is where all the data manipulation happens. Here we can FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and much more. The following is an example of a Pig program that takes a file composed of Facebook comments, selects only those comments that are in English, then groups them by the user who is commenting, and displays the sum of the number of re-comments of that user’s comments.

L = LOAD 'hdfs//node/facebook_comment';

FL = FILTER L BY iso_language_code EQ 'en';

G = GROUP FL BY from_user;

RT = FOREACH G GENERATE group, SUM(recomments);

DUMP and STORE

If we don’t specify the DUMP or STORE command, the results of a Pig program are not generated. When we are debugging our Pig programs, we typically use the DUMP command to send the output to the screen. We simply change the DUMP call to a STORE call so that any results from running your programs are stored in a file for further processing or analysis when we go into production. Please note that DUMP command can be used anywhere in our program to dump intermediate result sets to the screen and actually we need it because it helps us in debugging.

How to run Pig program

Now when we are ready with our Pig program, than we need to run in the Hadoop environment. There are three ways to run a Pig program:

Embedded in a script
Embedded in a Java program
From the GRUNT(Pig command line)

It doesn’t matter which of the three ways we run the program. The Pig runtime environment translates the program into a set of map and reduces tasks and runs them under our behalf. I will talk about Python in my next blog, till than happy reading.

Big Data - An Introduction to Pig

Let us talk about LOAD, TRANSFORM, DUMP and STORE in details.

LOAD

TRANSFORM

L = LOAD 'hdfs//node/facebook_comment';

FL = FILTER L BY iso_language_code EQ 'en';

G = GROUP FL BY from_user;

RT = FOREACH G GENERATE group, SUM(recomments);

DUMP and STORE

How to run Pig program

Now when we are ready with our Pig program, than we need to run in the Hadoop environment. There are three ways to run a Pig program:

Embedded in a script
Embedded in a Java program
From the GRUNT(Pig command line)

Book Review: Big Red - Voyage of a Trident Submarine

by Andy Warren

SQLServerCentral.com

Blogs

I've grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It's a fairly human look at what's involved in sailing on a Trident missile submarine...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-03-10

1,439 reads

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

by Robert Davis

SQLServerCentral.com

Blogs

Question: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? This question was sent to me via email. My reply follows. Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? Databases to be mirrored are currently running on 2005 SQL instances but will be upgraded to 2008 SQL in the near future.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-23

1,567 reads

Inserting Markup into a String with SQL

by Phil Factor

SQLServerCentral.com

T-SQL

In which Phil illustrates an old trick using STUFF to intert a number of substrings from a table into a string, and explains why the technique might speed up your code...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-18

1,631 reads

Networking - Part 4

by Andy Warren

SQLServerCentral.com

Blogs

You may want to read Part 1 , Part 2 , and Part 3 before continuing. This time around I'd like to talk about social networking. We'll start with social networking. Facebook, MySpace, and Twitter are all good examples of using technology to let...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-17

1,530 reads

Speaking at Community Events - More Thoughts

by Andy Warren

SQLServerCentral.com

Blogs

Last week I posted Speaking at Community Events - Time to Raise the Bar?, a first cut at talking about to what degree we should require experience for speakers at events like SQLSaturday as well as when it might be appropriate to add additional focus/limitations on the presentations that are accepted. I've got a few more thoughts on the topic this week, and I look forward to your comments.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-13

360 reads

Big Data - An Introduction to Pig

Rate

Share

Share

Rate

Big Data - An Introduction to Pig

Rate

Share

Share

Rate

Big Data - An Introduction to Pig

Rate

Share

Share

Rate

Big Data - An Introduction to Pig

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts