If you have ever appeared for the Hadoop interview, you must have experienced many Hadoop scenario based interview questions.
Here I have compiled a list of all Hadoop scenario based interview questions and tried to answer all those Hadoop real time interview questions. You can use these Hadoop interview questions to prepare for your next Hadoop Interview.
Also, I will love to know your experience and questions asked in your interview. Do share those Hadoop interview questions in the comment box. I will list those in this Hadoop scenario based interview questions post. Let’s make it the only destination for all Hadoop interview questions and answers.
Let’s start with some major Hadoop interview questions and answers. I have covered the interview questions from almost every part of Hive, Pig, Sqoop, HBase, etc.
-put: it can copy the file from source to destination
– copyFromLocal: It copies the file from local file system to Hadoop system
As you saw, put can do what copyFromLocal is doing but the reverse is not true. So the main difference between -copyFromLocal and -put commands is, in -copyFromLocal, the source has to be the local file system which is not mandatory for –put command.
Uses of these commands-
So in –copyToLocal command, the destination has to be the local file system.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”3. What is the default block size in Hadoop and can it be increased?” txt_align=”justify”]Ans: The default block size in Hadoop 1 is 64 MB while in Hadoop 2, it is 128MB.
It can be increased as per your requirements. You can check Hadoop Terminology for more details.
In fact changing the block size is very easy and you can do it by setting fs.local.block.size in the configuration file easily. Use the below command to change the default block size in Hadoop.
ERROR tool.ImportTool: Error during import: No primary key could be found for table <table_name>. Please specify one with –split-by or perform a sequential import with ‘-m 1’
Here is the solution of what to do when you don’t have a primary key column in RDBMS, and you want to import using Sqoop.
If your table doesn’t have the primary key column, you need to specify -m 1 option for importing the data, or you have to provide –split-by argument with some column name.
Here are the scripts which you can use to import an RDBMS table in Hadoop using Sqoop when you don’t have a primary key column.
or [/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”5. What is CBO in Hive?” txt_align=”justify”]Ans: CBO is cost-based optimization and applies to any database or any tool where optimization can be used.So it is similar to what you call Hive Query optimization. Here are the few parameters, you need to take care while dealing with CBO in Hive.
- Parse and validate query
- Generate possible execution plans
- For each logically equivalent plan, assign a cost
You can also check Hortonworks technical sheet on this for more details.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”6. Can we use LIKE operator in Hive?”]Yes, Hive supports LIKE operator, but it doesn’t support multi-value LIKE queries like below-
So you can easily use LIKE operator in Hive as and when you require. Also, when you have to use a multi-like operator, break it so that it can work in Hive.E.g.: [/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”7. Can you use IN/EXIST operator in Hive?”]UPDATE: Yes, it is supported now. Please check comment for the link.
Yes, now Hive support IN or EXIST operators. Also, you can use left semi join here. Left Semi Join performs the same operation IN do in SQL.
So if you have the below query in SQL-
Then the suitable query for the same in Hive can be- Both will fulfill the same purpose.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”8. What are the differences between INNER JOIN and LEFT SEMI JOIN?”]Ans: Left semi-join in Hive is used instead of IN operator (as IN is not supported in Hive). Now coming to the differences, inner join returns the common data from both the table depending on condition applied while left semi joins only returns the records from the left-hand table.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”9. What are the differences between External and Internal Tables in Hive” txt_align=”justify”]Ans: As we know there are a couple of kinds of tables in Hive- Internal and External (Managed) table. In the internal table (default), data will be stored at the default Hive location while in the external table; you can specify the location.The major difference between the internal and external tables are-
External Table | Internal Table |
---|---|
External table stores files on the HDFS | Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file. |
If you delete an external table the file still remains on the HDFS server. As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS. | Deleting the table deletes the metadata & data from master-node and HDFS respectively. |
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level. | Deleting the table deletes the metadata & data from master-node and HDFS respectively. • Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation). |
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file. | It is the default table in Hive. |
Hive may have internal or external tables this is a choice that affects how data is loaded, controlled, and managed.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”10. When to use external and internal tables in Hive?”]Use EXTERNAL tables when:
- The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
- Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
- Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
- You are not creating a table based on existing table (AS SELECT).
Use INTERNAL tables when:
- The data is temporary
- You want Hive to completely manage the lifecycle of the table and data
Data will be available directly for all partition when you will put it through command and not manually.[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”12. Where the Mapper’s Intermediate data will be stored?” txt_align=”justify”]
Combiner: Combiner works like a mini reducer in Map phase which takes the input from map phase. It performs local reduce function on mapper result before they are distributed further. Once combiner functionality is executed (if required) then the output is passed to the reducer phase.
Partition: Partition comes into picture when you are using more than one reducer. Partition decides which reducer is responsible for a particular key.
It takes the input from mapper phase or Combiner phase (if used) and then sends it across the responsible reducer based on the key. The number of partitions is equal to the number of reducers.
So in partition and combiner, combiner comes first and then partition. The below image from Yahoo depicts the operation beautifully.
When combiner is being used
When Combiner is not being used
[Images are from Yahoo]
[/vc_cta][/vc_column][/vc_row][vc_row][vc_column][vc_cta h2=”14. What is the difference between Static and Dynamic Partition”]Partition in Hive is an important concept and is one of the best Hive performance tuning techniques as well.
As we know, there are two type of partition in Hive and those are-
- Static Partition
- Dynamic Partition
Now coming to the difference between static and dynamic partition, the static partition is the default case of Hive.
Static Partition: Usually while loading big files in Hive tables, Static Partition is preferred. This mainly saves the time required to load the data into Hive tables.
You add the partition column manually and move the file into the partition table manually. You can get the partition column name from file name without reading the whole file.
In static partition, you need to specify the partition column value in each load.For example, let’s say we are having a table with the population of USA and the file is based on the state. In this case, we can apply the partition based on the state. And so each time you’ll load a file, you need to specify the state value as shown below.
As this is the default mode of Hive and so you can find the below property set in hive-site.xml
You should use where clause to use limit in the static partition.
Dynamic Partition: Here every row of the data available in the file is read and partition is getting done through a MapReduce job. Usually, we do dynamic partition when we do kind of ETL jobs.
For example, let’s say you are loading a table X from some copy command and then copy the data from table X to table Y after some calculation and further some ETL processes. In such cases, dynamic partitions are used.
As this is not the default mode of Hive and so you need to set the following two properties in Hive-site.XML file.
In dynamic partition, we should not specify the partition column values in each load statement. Here are the steps you can do to create the dynamic partition table with data.
- Create a non-partitioned table X and load the data
- Now create a partitioned table Y and specify the partition column (say state)
- load data from X to Y like below
Here you should ensure that the partition column is the last column of the non-partitioned table.
Hope it clarified the difference between the static partition and dynamic partition in Hive.[/vc_cta][/vc_column][/vc_row]
[…] web scale information of several gigabytes or terabytes or petabytes. To make this conceivable, Hadoop Tutorial uses a conveyed record system which separates input information and sends division of unique […]
Surgnisirply well-written and informative for a free online article.
I am not sure when this article was written, but Hive supports IN and EXISTS at least since 2014. Although it does have some limits to it which can be checked here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries. Rest of the content is very good and helps revise the concepts.
Thanks for pointing the correction!
We have added a note there.
Q 11) Isn’t the usage of commands the manual way of doing things?
For the first two questions. whether it really works??
Because I have checked that even PUT command has the restriction that the source file should present in the local file (same like copyFromLocal command). If I use ‘Put’ command to copy the file from non-local location to HDFS, then it showing the error like there is no such source file in the local file system. Because it is keep on searching in the local file system for the source file rather than HDFS.