Hadoop Pig

Pig Join- A Beginners guide to Pig Joins & Examples

As we know Pig is a framework to analyze datasets using a high-level scripting language called Pig Latin and Pig Joins plays an important role in that. Here I will talk about Pig join with Pig Join Example.This will be a complete guide to Pig join and Pig join example and I will show the examples with different scenario considering in mind.

Pig Join Example

Pig Join

Pig joins are similar to the SQL joins we have read. It is used to find the relation between two tables based on certain common fields.

In pig also, we have basically below three kinds of joins-

  • Self-join
  • Inner-join
  • Outer-join − left join, right join, and full join

We will see the each pig join in details here with few pig join examples as well.

Self- Join

Self-join is joining a table with itself. Usually, we rename the other relation in self-join. So in Self-join we load the same data under different relations so that we can put the join. I will explain this in details while explaining Pig join example.

Inner Join

Inner join is one of the most frequently used join. Inner join returns the common rows between the two tables based on the condition implied. Inner join is also called as equi join.

Let’s say we have two tables A & B and we are joining based on the join-predicate “Order Id” then Inner join will check each row of A to B and whenever there will be any match of order id, it will return that row.

Outer Join

Outer join returns at least the complete rows of one of the relations. Based on the fact that from which relation it will return the complete row, it is further divided into three types-

  • Left Outer Join: Returns all the rows from left table even if there is no matched rows in the right table.
  • Right Outer Join: Returns all the rows from right table even if there is no matched rows in the left table.
  • Full Outer Join: It returns rows when there is a match in any of the participating relation.

Now let me share the examples on each of these Pig Joins.

Pig Join Example

Let’s say we have the following two files of a retail chain- Customer file and order file.

Here are the data of the customer and order files-

customer.txt

1,Ramesh,32,Ahmedabad,2000
2,Khilan,25,Delhi,150
3,kaushik,23,Kota,2000
4,Chaitali,25,Mumbai,6500
5,Hardik,27,Bhopal,8500
6,Komal,22,MP,4500
7,Muffy,24,Indore,10000

order.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

Please note both the files and comma separated and so while loading the data, you will have to use the command PigStorage(‘,’).

Load both the text files in the relations-

grunt> customers = LOAD '/user/cloudera/customer.txt' USING PigStorage(',')   as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD '/user/cloudera/order.txt' USING PigStorage(',')   as (id:int, name:chararray, age:int, address:chararray, salary:int);

Now let’s start with Pig Join Example with each joins.

Self-Join Example

As I mentioned above, in self-join, we have to rename the same relation and then apply the join condition. Let’s do it.

Let’s say both the relations are: Relation1 and Relation2. Now insert the customer.txt file in both the relations.

grunt> Relation1 = LOAD '/user/cloudera/customer.txt' USING PigStorage(',')   as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> Relation2 = LOAD '/user/cloudera/customer.txt' USING PigStorage(',')   as (id:int, name:chararray, age:int, address:chararray, salary:int);

Now apply the self join and let’s say the output should be stored in Relation3. Then the join wil be like-

grunt> Relation3= JOIN Relation1 BY id, Relation2 by id;

Check the output using the below dump command.

grunt> dump Relation3;

It will give the output like the below screenshot.

Pig Join Example

Inner Join Example

Here we will find the common data between the customer relation and order relation.

grunt> orders = LOAD '/user/cloudera/order.txt' USING PigStorage(',')   as (orderid:int, date: chararray, custID:int, amount: int);
grunt> customers = LOAD '/user/cloudera/customer.txt' USING PigStorage(',')   as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> Cust_Order= JOIN customers by id, orders by custID;

Here Cust_Order will be the output relation of the inner join and will contain the common rows between the order table and the customer table.

To verify the output, just use the dump command.

grunt> dump Cust_Order;

Pig Join Example

Left Outer Join Example

Let’s take the same relations customers and orders.

grunt> OuterJoin_Left= JOIN customers by id LEFT OUTER, orders by custID;
grunt> dump OuterJoin_Left;

Pig Join Example

Right Outer Join Example

Considering the same example, let’s find the right outer join of both the relations.

grunt> OuterJoin_Right= JOIN customers by id RIGHT OUTER, orders by custID;
grunt> DUMP OuterJoin_Right;

Pig Join Example

Full Outer Join Example

Here is the example of Full outer join-

grunt> FullOuter= JOIN customers by id FULL OUTER, orders by custID;
grunt> dump FullOuter;

Pig Join Example

Wrapping it up!

These were all about Pig Join and Pig join example. I hope you understood the joins in Pig thoroughly and can implement as and when required.

Like I used one column in the join condition, you can use multiple columns as well. Do try these and share your experience.

Advertisement: Stay connected to your favorite Windows Applications on the go with a cloud desktop by CloudDesktopOnline.com- Provider of Desktop-as-a-Service. Get a free trial of Office 365 from O365CloudExperts .

Leave a Comment