As we know Pig is a framework to analyze datasets using a high-level scripting language called Pig Latin and Pig Joins plays an important role in that. Here I will talk about Pig join with Pig Join Example.This will be a complete guide to Pig join and Pig join example and I will show the examples with different scenario considering in mind.
Pig Join
Pig joins are similar to the SQL joins we have read. It is used to find the relation between two tables based on certain common fields.
In pig also, we have basically below three kinds of joins-
- Self-join
- Inner-join
- Outer-join − left join, right join, and full join
We will see the each pig join in details here with few pig join examples as well.
Self- Join
Self-join is joining a table with itself. Usually, we rename the other relation in self-join. So in Self-join we load the same data under different relations so that we can put the join. I will explain this in details while explaining Pig join example.
Inner Join
Inner join is one of the most frequently used join. Inner join returns the common rows between the two tables based on the condition implied. Inner join is also called as equi join.
Let’s say we have two tables A & B and we are joining based on the join-predicate “Order Id” then Inner join will check each row of A to B and whenever there will be any match of order id, it will return that row.
Outer Join
Outer join returns at least the complete rows of one of the relations. Based on the fact that from which relation it will return the complete row, it is further divided into three types-
- Left Outer Join: Returns all the rows from left table even if there is no matched rows in the right table.
- Right Outer Join: Returns all the rows from right table even if there is no matched rows in the left table.
- Full Outer Join: It returns rows when there is a match in any of the participating relation.
Now let me share the examples on each of these Pig Joins.
Pig Join Example
Let’s say we have the following two files of a retail chain- Customer file and order file.
Here are the data of the customer and order files-
customer.txt
1,Ramesh,32,Ahmedabad,2000 2,Khilan,25,Delhi,150 3,kaushik,23,Kota,2000 4,Chaitali,25,Mumbai,6500 5,Hardik,27,Bhopal,8500 6,Komal,22,MP,4500 7,Muffy,24,Indore,10000
order.txt
102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060
Please note both the files and comma separated and so while loading the data, you will have to use the command PigStorage(‘,’).
Load both the text files in the relations-
grunt> customers = LOAD '/user/cloudera/customer.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD '/user/cloudera/order.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int);
Now let’s start with Pig Join Example with each joins.
Self-Join Example
As I mentioned above, in self-join, we have to rename the same relation and then apply the join condition. Let’s do it.
Let’s say both the relations are: Relation1 and Relation2. Now insert the customer.txt file in both the relations.
grunt> Relation1 = LOAD '/user/cloudera/customer.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> Relation2 = LOAD '/user/cloudera/customer.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int);
Now apply the self join and let’s say the output should be stored in Relation3. Then the join wil be like-
grunt> Relation3= JOIN Relation1 BY id, Relation2 by id;
Check the output using the below dump command.
grunt> dump Relation3;
It will give the output like the below screenshot.
Inner Join Example
Here we will find the common data between the customer relation and order relation.
grunt> orders = LOAD '/user/cloudera/order.txt' USING PigStorage(',') as (orderid:int, date: chararray, custID:int, amount: int); grunt> customers = LOAD '/user/cloudera/customer.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> Cust_Order= JOIN customers by id, orders by custID;
Here Cust_Order will be the output relation of the inner join and will contain the common rows between the order table and the customer table.
To verify the output, just use the dump command.
grunt> dump Cust_Order;
Left Outer Join Example
Let’s take the same relations customers and orders.
grunt> OuterJoin_Left= JOIN customers by id LEFT OUTER, orders by custID; grunt> dump OuterJoin_Left;
Right Outer Join Example
Considering the same example, let’s find the right outer join of both the relations.
grunt> OuterJoin_Right= JOIN customers by id RIGHT OUTER, orders by custID; grunt> DUMP OuterJoin_Right;
Full Outer Join Example
Here is the example of Full outer join-
grunt> FullOuter= JOIN customers by id FULL OUTER, orders by custID; grunt> dump FullOuter;
Wrapping it up!
These were all about Pig Join and Pig join example. I hope you understood the joins in Pig thoroughly and can implement as and when required.
Like I used one column in the join condition, you can use multiple columns as well. Do try these and share your experience.
Advertisement: Stay connected to your favorite Windows Applications on the go with a cloud desktop by CloudDesktopOnline.com- Provider of Desktop-as-a-Service. Get a free trial of Office 365 from O365CloudExperts .
Leave a Comment