Hadoop Scenario Based Interview Questions and Answers [Refreshed]

[vc_row][vc_column][vc_column_text]

If you have ever appeared for the Hadoop interview, you must have experienced many Hadoop scenario based interview questions.

Here I have compiled a list of all Hadoop scenario based interview questions and tried to answer all those Hadoop real time interview questions. You can use these Hadoop interview questions to prepare for your next Hadoop Interview.

Also, I will love to know your experience and questions asked in your interview. Do share those Hadoop interview questions in the comment box. I will list those in this Hadoop scenario based interview questions post. Let’s make it the only destination for all Hadoop interview questions and answers.

Let’s start with some major Hadoop interview questions and answers. I have covered the interview questions from almost every part of Hive, Pig, Sqoop, HBase, etc.

[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”1. What are the differences between -copyFromLocal and -put command” txt_align=”justify” css=”.vc_custom_1482389955046{border-radius: 2px !important;}”]Ans: Basically, both put and copyFromLocal fulfill similar purposes, but there are some differences. First, see what both the command does-

-put: it can copy the file from source to destination

– copyFromLocal: It copies the file from local file system to Hadoop system

As you saw, put can do what copyFromLocal is doing but the reverse is not true. So the main difference between -copyFromLocal and -put commands is, in -copyFromLocal, the source has to be the local file system which is not mandatory for –put command.

Uses of these commands-

hadoop fs -copyFromLocal <localsrc> URI

hadoop fs -put <localsrc> … <destination>

[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”2. What are the differences between -copyToLocal and -put command” txt_align=”justify” color=”mulled-wine”]Ans: The answer will be similar to what I explained in the above question. The only difference is, there it was –copyFromLocal and here it is –copyToLocal.

So in –copyToLocal command, the destination has to be the local file system.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”3. What is the default block size in Hadoop and can it be increased?” txt_align=”justify”]Ans: The default block size in Hadoop 1 is 64 MB while in Hadoop 2, it is 128MB.

It can be increased as per your requirements. You can check Hadoop Terminology for more details.

In fact changing the block size is very easy and you can do it by setting fs.local.block.size in the configuration file easily. Use the below command to change the default block size in Hadoop.

hadoop fs -D fs.local.block.size=sizeinKB -put local_name remote_location

Just put the size you want of a block in KB in place of “sizeinKB” variable.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”4. How to import RDBMS table in Hadoop using Sqoop when the table doesn’t have a primary key column?” txt_align=”justify”]Ans: Usually, we import an RDBMS table in Hadoop using Sqoop Import when it has a primary key column. If it doesn’t have the primary key column, it will give you the below error-

ERROR tool.ImportTool: Error during import: No primary key could be found for table <table_name>. Please specify one with –split-by or perform a sequential import with ‘-m 1’

Here is the solution of what to do when you don’t have a primary key column in RDBMS, and you want to import using Sqoop.

If your table doesn’t have the primary key column, you need to specify -m 1 option for importing the data, or you have to provide –split-by argument with some column name.

Here are the scripts which you can use to import an RDBMS table in Hadoop using Sqoop when you don’t have a primary key column.

sqoop import \
–connect jdbc:mysql://localhost/dbname \
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”
-m 1

sqoop import \
–connect jdbc:mysql://localhost/ dbname\
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”
–split-by created_date

[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”5. What is CBO in Hive?” txt_align=”justify”]Ans: CBO is cost-based optimization and applies to any database or any tool where optimization can be used.

So it is similar to what you call Hive Query optimization. Here are the few parameters, you need to take care while dealing with CBO in Hive.

Parse and validate query
Generate possible execution plans
For each logically equivalent plan, assign a cost

You can also check Hortonworks technical sheet on this for more details.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”6. Can we use LIKE operator in Hive?”]Yes, Hive supports LIKE operator, but it doesn’t support multi-value LIKE queries like below-

SELECT * FROM user_table WHERE first_name LIKE ANY ( ‘root~%’ , ‘user~%’ );

So you can easily use LIKE operator in Hive as and when you require. Also, when you have to use a multi-like operator, break it so that it can work in Hive.
E.g.:

WHERE table2.product LIKE concat(‘%’, table1.brand, ‘%’)

[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”7. Can you use IN/EXIST operator in Hive?”]UPDATE: Yes, it is supported now. Please check comment for the link.

Yes, now Hive support IN or EXIST operators. Also, you can use left semi join here. Left Semi Join performs the same operation IN do in SQL.

So if you have the below query in SQL-

SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);

Then the suitable query for the same in Hive can be-

SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)

Both will fulfill the same purpose.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”8. What are the differences between INNER JOIN and LEFT SEMI JOIN?”]Ans: Left semi-join in Hive is used instead of IN operator (as IN is not supported in Hive). Now coming to the differences, inner join returns the common data from both the table depending on condition applied while left semi joins only returns the records from the left-hand table.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”9. What are the differences between External and Internal Tables in Hive” txt_align=”justify”]Ans: As we know there are a couple of kinds of tables in Hive- Internal and External (Managed) table. In the internal table (default), data will be stored at the default Hive location while in the external table; you can specify the location.

The major difference between the internal and external tables are-

External Table	Internal Table
External table stores files on the HDFS	Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file.
If you delete an external table the file still remains on the HDFS server. As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.	Deleting the table deletes the metadata & data from master-node and HDFS respectively.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.	Deleting the table deletes the metadata & data from master-node and HDFS respectively. • Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation).
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.	It is the default table in Hive.

Hive may have internal or external tables this is a choice that affects how data is loaded, controlled, and managed.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”10. When to use external and internal tables in Hive?”]Use EXTERNAL tables when:

The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
You are not creating a table based on existing table (AS SELECT).

Use INTERNAL tables when:

The data is temporary
You want Hive to completely manage the lifecycle of the table and data

[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”11. We have a Hive partitioned table with partition column as country. We have 10 partition and data for now is jut for one country, If we will copy the data manually for other 9 partitions, whether those will be reflected if we will run a command.” txt_align=”justify”]Ans: This is really a good question. As the data has been kept manually in all the other file directory and so directly it won’t be available.

Data will be available directly for all partition when you will put it through command and not manually.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”12. Where the Mapper’s Intermediate data will be stored?” txt_align=”justify”]

Ans: The mapper output (which is intermediate data) is stored on the Local file system (not in HDFS) of each mapper nodes. This is a temporary directory location which can be setup in the configuration file by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”13. What is Partition and Combiner in MapReduce?” txt_align=”justify”]Partition and combiner are the two phase of a MapReduce operation those are executed before the reduce phase and after the map phase. Here are the details of partition and combiner in MapReduce.

Combiner: Combiner works like a mini reducer in Map phase which takes the input from map phase. It performs local reduce function on mapper result before they are distributed further. Once combiner functionality is executed (if required) then the output is passed to the reducer phase.

Partition: Partition comes into picture when you are using more than one reducer. Partition decides which reducer is responsible for a particular key.

It takes the input from mapper phase or Combiner phase (if used) and then sends it across the responsible reducer based on the key. The number of partitions is equal to the number of reducers.

So in partition and combiner, combiner comes first and then partition. The below image from Yahoo depicts the operation beautifully.

When combiner is being used

Partition and combiner in MapReduce

When Combiner is not being used

Partition and combiner in MapReduce [Images are from Yahoo]

[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”14. What is the difference between Static and Dynamic Partition”]Partition in Hive is an important concept and is one of the best Hive performance tuning techniques as well.

As we know, there are two type of partition in Hive and those are-

Static Partition
Dynamic Partition

Now coming to the difference between static and dynamic partition, the static partition is the default case of Hive.

Static Partition: Usually while loading big files in Hive tables, Static Partition is preferred. This mainly saves the time required to load the data into Hive tables.

You add the partition column manually and move the file into the partition table manually. You can get the partition column name from file name without reading the whole file.

In static partition, you need to specify the partition column value in each load.For example, let’s say we are having a table with the population of USA and the file is based on the state. In this case, we can apply the partition based on the state. And so each time you’ll load a file, you need to specify the state value as shown below.

hive>LOAD DATA INPATH ‘/hdfs path of the file’ INTO TABLE tblname PARTITION(state=”Illions”)
hive>LOAD DATA INPATH ‘/hdfs path of the file’ INTO TABLE tblname PARTITION(state=”DC”)

As this is the default mode of Hive and so you can find the below property set in hive-site.xml

set hive.mapred.mode = strict

You should use where clause to use limit in the static partition.

Dynamic Partition: Here every row of the data available in the file is read and partition is getting done through a MapReduce job. Usually, we do dynamic partition when we do kind of ETL jobs.

For example, let’s say you are loading a table X from some copy command and then copy the data from table X to table Y after some calculation and further some ETL processes. In such cases, dynamic partitions are used.

As this is not the default mode of Hive and so you need to set the following two properties in Hive-site.XML file.

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

In dynamic partition, we should not specify the partition column values in each load statement. Here are the steps you can do to create the dynamic partition table with data.

Create a non-partitioned table X and load the data
Now create a partitioned table Y and specify the partition column (say state)
load data from X to Y like below

hive> INSERT INTO TABLE Y PARTITION(state) SELECT * from X;

Here you should ensure that the partition column is the last column of the non-partitioned table.

Hope it clarified the difference between the static partition and dynamic partition in Hive.[/vc_cta][/vc_column][/vc_row]

5 Comments

Check This Hadoop Tutorial to Understand the Problem Scope | John Preston says:

December 22, 2016 at 4:51 pm

[…] web scale information of several gigabytes or terabytes or petabytes. To make this conceivable, Hadoop Tutorial uses a conveyed record system which separates input information and sends division of unique […]

Heidi says:

February 28, 2017 at 9:52 am

Surgnisirply well-written and informative for a free online article.

Harshil says:

May 8, 2019 at 5:46 pm

I am not sure when this article was written, but Hive supports IN and EXISTS at least since 2014. Although it does have some limits to it which can be checked here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries. Rest of the content is very good and helps revise the concepts.

- HDFS Tutorial Team says:
  
  May 20, 2019 at 6:04 pm
  
  Thanks for pointing the correction!
  
  We have added a note there.
  
kuldeep says:

March 26, 2020 at 7:03 pm

Q 11) Isn’t the usage of commands the manual way of doing things?

Sujith says:

July 4, 2020 at 9:35 am

For the first two questions. whether it really works??

Because I have checked that even PUT command has the restriction that the source file should present in the local file (same like copyFromLocal command). If I use ‘Put’ command to copy the file from non-local location to HDFS, then it showing the error like there is no such source file in the local file system. Because it is keep on searching in the local file system for the source file rather than HDFS.

Scenario Based Hadoop Interview Questions and Answers [Mega List]

5 Comments

Leave a Comment X

You may also like

5 Comments

Leave a Comment X