Recently, one of our users appeared for CenturyLink Hadoop admin interview and they shared CenturyLink Hadoop admin interview questions and answers. This CenturyLink Hadoop admin interview questions and answer has been edited and modified by the team of HdfsTutorial and has been presented to you.
Please prepare these questions if you are also going to any Hadoop Admin Interview. These CenturyLink Hadoop admin interview questions and answers will help you to prepare well and you will get an idea of Hadoop admin questions.Also, if you have any better answer or want to suggest some edits to this CenturyLink Hadoop admin interview questions and answers, please suggest us in the comment box.You can also check our other Hadoop interview questions and answers using below links-
- Hadoop Scenario Based Interview Questions
- Top Hadoop Interview Questions
- Tableau Interview Questions and Answers
- Capgemini Hadoop Interview Questions and Answers
- PIG Interview Questions and Answers
9 CenturyLink Hadoop admin interview Questions and Answers
#1 Please share your self-rating out of 10 for the below Hadoop Admin skills.
- Security – Kerberos
- Monitoring tools- Nagios, Ganglia, Bright computing, Cloudera Manager, HortonWorks Ambari, and Hue etc.
- Automation tools – Chef, Jenkins or any other DevOps tools
#2 I have one production Hadoop cluster and want to build a new Hadoop development cluster. Now, I want to move the data from the prod server to the dev server. Please explain how to copy the data from prod HDFS location to Dev HDFS location.
The best way to migrate the data from one Hadoop cluster to another Hadoop cluster is distcp (distributed copy). You can use the distcp command to copy the data from one hdfs location in cluster A1 to another hdfs location of cluster A2.
Distcp is a distributed copy command which is a general utility to copy large data between filesystems within and across the cluster. The distcp command submits a regular MapReduce job which does file by file copy.
Also, the other Hadoop shell commands such as put, cp, copyFromLocal, get etc. are not suggested to use the large data else you may face I/O bottlenecks.
Let’s say we want to copy the file from NN1 to NN2 then the below command can be used.
$ hadoop distcp hdfs://NN1:8020/file1 hdfs://NN2:8020/file2
#3 In the above question, earlier I moved around 30% of data and now want to move the remaining 70% data without replacing the earlier data. So, write the command to copy the data from one cluster to another cluster without data overwriting.
Here “–update” command can be used along with distcp. The syntax can be found as below-
$ distcp -update hdfs://NN1:9820/source/first hdfs://NN1:9820/source/second hdfs://nn2:9820/target
#4 In Hive, I have a query having around 4 joins. I run the query and it ran till 70% in 4-5 milliseconds but after that, it got stuck. I waited for around 5 mins and when nothing happened, I aborted it. Now tell me how you can triage the issue for the following cases-
- When you don’t have access to create or alter anything
- When you can do basic operations
This is majorly related to Hive performance tuning. Usually, you will see Hive query will run fine until some time but after some time, Hive query starts giving after 99% status.
Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However, in some cases, these jobs get stuck at 99% and never really finish.
To overcome on this Hive query fails at 99%, you can make some changes in the Hive configuration as below-
set hive.exec.parallel=true; set mapred.compress.map.output=true; set mapred.output.compress=true; set hive.exec.compress.output=true; set hive.exec.parallel=true; set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true;
For the first case, you can create a temp table or view and get your work done as well. While for 2nd case, you can create the bucketing as joins with bucketing will help you improve the Hive query performance.
#5 I have a file of name “file.txt” and the size is 1GB. Due to some reason, I don’t want to split it into more than one block. That means I want to keep the block size as 1GB. How to change the block size for a local file.
In this case, you have to increase the HDFS block size for the specific file file.txt. To do this, you can execute the below commands on the command line.
hadoop fs -D fs.local.block.size=”put desired block size in KB” -put local_name remote_location
#6 I have a sensitive file which is encrypted and want this file to be stored at more than 3 datanodes. How to change the replication factor for a specific file in HDFS?
If you want to change the replication factor for the entire system, you can do that from hdfs-site.xml.
But as we have to change the HDFS replication factor for a specific file, you can do the below-
hdfs dfs –setrep –w 3 /user/hdfs/file.txt
You can also change the replication factor of a directory using the command:
hdfs dfs -setrep -R 2 /user/hdfs/test
But changing the replication factor for the directory will change the RF for the existing file. But the new file will take the existing global replication factor.
#7 I am working on the edge node of the hive and want to connect to the hive server 2. How you can implement Kerberos authentication here, write the steps.
You can check this guide to connect and configure the Kerberos authentication- link
#8 which all Hadoop testing tools you have used?
Here are some of the best Hadoop testing tools being used in the Big Data Hadoop environment-
- MRUnit – MRUnit is a Java framework that helps developers unit test Hadoop MapReduce jobs
- Mockito – Mockito is a Java Framework similar to MRUnit for unit testing which tests Hadoop MapReduce jobs
- PigUnit – PigUnit is a Java framework that helps developers unit test Pig Scripts
- HiveRunner – HiveRunner is an Open Source unit test framework for Hadoop Hive queries based on JUnit4
- Beetest – Beetest is a Unit Testing Framework for Hive Queries
- Hive_Test – Hive_Test is another Open source unit testing framework for Hive
- HBaseTestingUtility – HBaseTestingUtility is a Java API for HBase Mini-cluster and we can use this along with Junit/Mockito/MRUnit frameworks to unit test HBase Applications
- QuerySurge – QuerySurge is a test tool built to automate Data Warehouse testing and the ETL Testing process. JDBC-compliant DB, DWH, DMart, flat file, XML, Hadoop.
As of now, there are no automation tools or frameworks available for Flume, Sqoop and Oozie unit testing. Maybe the Hadoop testing tools can be also available for Sqoop and Oozie but as of now, these need to be tested manually.
#9 Explain the Hadoop Security Tools
There are basically three areas where securities are implemented in Hadoop. Security in Hadoop is done via Authentication, Authorization, and Encryption.
- Authentication: It ensures only genuine user and service accesses cluster. Hadoop security tools currently used for authentication are MIT Kerberos, AD, and OpenLDAP etc.
- Authorization: Authorization ensures what user and applications can do with data. Hadoop security tool used for authorization is Apache Sentry.
- Encryption: Encryption ensures the data protection which includes the protection of data from unauthorized access. Hadoop security tool used for encryption at rest is Navigator Encrypt and for in transit can be implemented by enabling TLS/SSL.
Although these Hadoop security tools can vary depending on the Hadoop distributions you will be using. But as you can see most of these Hadoop security tools are open source and so you can use these with any top Hadoop distributions.
These were the CenturyLink Hadoop admin interview questions and answers asked in a Hadoop Admin Interview. Hope these helped you to prepare for your next Hadoop Admin Interview.
If you have attended any Hadoop interview recently and want to share your experience with us, write us using the following form or email id.