HDFS File Processing is the 6th and one of the most important chapters in HDFS Tutorial series. This is another important topic to focus on. Now we know how blocks are replicated and kept on DataNodes.
In this chapter, I will tell you how file processing is being done and the working of HDFS.
So we have a client who has a file of 200MB (Hadoop is mainly for high file size but for better understanding I am taking size as 200MB…consider it as 2TB file size).
So the client is having a file size of 200 MB (eg. Name – abc.txt). Now I will explain the complete HDFS working based on this file.
Step 1: Split the files into blocks
Considering the default block size of 64 MB, this abc.txt will be divided into following blocks-
(200/64) MB= 3.125
So we will have 4 blocks. The first three of the size 64 MB each and last of the size 8 MB. This splitting work will be done by NameNode and not the client.
The client just needs to say to NameNode that it has a file which is of 200 MB of size.
For easy understanding, consider the file names as-
• b.txt -64 MB size
• c.txt -64 MB size
• d.txt -8 MB size: although it is of just 8 MB but will be stored in a 64 MB size but the remaining memory can be used for other files as well.
As the 4 blocks.
Step 2: Consider Replication
So, as we have a file size of 200 MB and considering an RF of 3, it will be of 600 MB. So these sizes we will have to consider.
Step 3: Place the files
Now we have to place the files to DataNodes but whether the client knows which all DataNodes are having free space?
No. So Client will contact NN and will ask-
Hey NN,I have a 200 MB of files which I want to keep on DataNodes. Please tell me the DataNodes where I should keep it.
Now NN will take the Metadata from the client which will have files size name etc. and in response will provide the DataNode information to the client.
Now, the client knows where he has to keep the files and so it will directly coordinate with the DataNodes and will place the files.
I have drawn a sample architecture of it like below-
Now Client has placed the files on DN 1, 3, 5 and 7.
Step 4: Consider Replication blocks
Now DN 1 has the file a.txt and so as the RF is 3, it will pass the files to the other available DN. Similarly, the other DN will do the same thing depending on the replication factor.
Step 5: What about ACK
Now how the client will come to know whether data has been stored on the DN or not and if yes, what all are the DN where the replica has been done? So the DN will send the proper ack. to the client with this information.
Step 6: How NN will come to know?
Well, the communication between client and DN is done. Files are stored, replicated and the client knows where it has been stored but what about NN?
DN will have to send the report to NN that they are alive and performing. So DN will send Heartbeat and block report to NN in every short interval of time.
Hear beat will convey that, the particular DN is alive and working and block report will say about the block. NN will store this information in Metadata.
Heartbeat will be sent every short period of time and block report sometime. If some DN is not sending the heartbeat since some time and so NN will think that it is corrupted and so NN will remove that DN from its metadata.
To fulfill that faulty DN, NN will assign that data to some other DN and will add their info to the Metadata.
Step 7: Process Data
Now storage has been done and we will have to process needs to be done. Suppose we are writing the program in MapReduce.
So suppose we have written a code that comes as 10 KB file (eg.program.txt). Now this code should be run on the DataNodes so that we can get the desired output.
How will this happen?
Here JobTracker (JT) will come into the picture. JT will take that proram.txt and will send to the DataNodes. But do JT knows which all DN has the blocks?
NO, and so to get that info. JT will connect with NN and will ask for the details. NN will provide the details about DN through Meta Data.
Now JT will connect with the nearby DN and will ask the TaskTracker (TT) available at each DN to perform and run the code at their respective DN.
Now TT will perform the code on the file available at DN and this process is Map. So the number of input splits and a number of mappers will be same.
Step 8: What about the fault tolerance between JT & TT
Now suppose TT is performing the task assigned by JT on DN 1 for a.txt file and suddenly it got faulty. So in this case, TT will convey this message to JT that it has become dead and so JT will assign the same job to the replicated DN which is nearby and TT will start working again on the local data.
Step 9: How JT will know that TT is working?
There can be a time when TT can be dead apart from the DN. So to keep JT updated about the working on TT, TT will send the heartbeat to JT every 3 seconds to say that it is alive and working properly.
If the JT has not received the heart beat for 10 heart beats (30 sec) from TT then JT will think that either TT is dead or working very slow.
In this case, JT will assign the job to some other TT where same data exists.
But what if the JT will fail? In this case, all the TT will get disturbed and so JT is known as single point failover and should be a high-class hardware and not the commodity hardware.
Step 10: What about Results?
By now everything has been done. We have stored data, processed data, considered about the fault but what about the output?
Suppose the DN having a.txt is generating 4 KB of output, similarly b.txt and c.txt DN is also generating 4 KB and DN having d.txt generates 1 KB as the size is less.
But is it the output?
No, we will have to combine that individual output and here comes Reducer into the picture. The reducer will combine the output and will give the proper output.
The number of reducers will be directly proportional to the number of output we will have.
Ok…we got the output but do client and NN know about it? No
So suppose the reducer is with system 5 and so next time when the DN 5 will send the block report to NN, it will have the test output and once NN will find the test output it will store it in Metadata.
At the same time client will keep on checking the progress of the work and as soon as it will reach up to 100% it will connect with NN and NN will say that the out is with DN 5.
Now Client will connect with DN 5 and will get its test output.
This is all about MapReduce and the operation of HDFS. Hope you read this chapter of HDFS File Processing carefully as it describes the complete working of HDFS like how actually the files are stored and processed.
Previous Chapter: HDFS ArchitectureCHAPTER 7: Input Formats in Hadoop