Important Hadoop Terminology is the 4th chapter in HDFS Tutorial Series. In this section, I will talk about some of the important terminologies about the HDFS or you can say Hadoop. These terms are the building blocks and throughout Hadoop, you will use these terms and so please try to REALLY UNDERSTAND these.
If you are clear with these basic, learning Hadoop will be fun else you will never enjoy and always you will think how things are happening.
So let’s start with basic Hadoop terminology one by one-
Here comes the role of Mapper. Mapper goes at each DataNode and runs certain code/operation to get the desired work done. So it will go, execute and run the code to find where the data actually exist.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
8. Block- It is the smallest unit in which files are getting split. By default a block size is 64 MB in Hadoop 1 and 128 MB in Hadoop 2 which can be increased as required.
Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.
NO…HDFS is highly faulted tolerant and so whenever any DataNode will become faulty, a backup node will take its data and will start operating.
That backup node is called secondary NameNode. It is the safe mode of Hadoop and mainly supervise the operation.
Secondary NameNode keeps the data of the faulty DataNode in the below two location-
• Edit Log- File Name is stored
Few things to remember here is-
• These NameNode and DataNode are nothing but computers
• The number of blocks depends on the file size. All the blocks will be of the same size except the last as the remaining size will be the size of the last block.
• Number of blocks (input splits) and number of Mappers will be same
• Hadoop can work without Reducer but not without Mapper
• Number of reducers will be equal to number of output
Now let me take you through a couple of interesting and VERY IMPORTANT chapters.
Previous Chapter: Why HDFS Needed?CHAPTER 5: HDFS Architecture