Nitendra Gautam

Introduction to Hadoop HDFS

Hadoop Distributed File System (HDFS)

HDFS or Hadoop Distributed File System is a distributed file system that is designed for storing very large files with streaming data access patterns running on clusters of commodity hardware.

It was originally created and implemented by Google, where it was known as the Google File System (GFS). HDFS is designed such that it can handle large amounts of data and reduces the overall input/output operations on the network. It also increases the scalability and availability of the cluster because of data replication and fault tolerance.

When input files are ingested into the Hadoop framework, they are divided into a block size of 64 MB or 128 MB and are distributed among Hadoop clusters. Block size can be pre-defined in the cluster configuration file or can be passed as a custom parameter while submitting a MapReduce job. This storage strategy helps Hadoop framework store large files having bigger size than the disk capacity of each node. It enables HDFS to store data from terabytes to petabytes scale.

 HDFS Architecture
Figure: HDFS Architecture

HDFS stands for Hadoop distributed file system that is designed for storing very large files with streaming data access patterns running on clusters of commodity hardware. Google created and implemented the HDFS for the first time and called it as Google File System(GFS). HDFS is designed such that it can handle large amount of data and reduces the overall input/output operations on the network.In addition to this HDFS also has feature of replicating the data and tolerant any kind of fault

It also increases the scalability and availability of the cluster because of data replication and fault tolerance. When input files are ingested into the Hadoop framework, they are divided into a block size of 64 MB or 128 MB and are distributed among Hadoop clusters. Block size can be pre-defined in the cluster configuration file or can be passed as a custom parameter while submitting a MapReduce job. This storage strategy helps Hadoop framework store large files having bigger size than the disk capacity of each node. It enables HDFS to store data from terabytes to petabytes scale

HDFS is based of master/slave architectutre for which it has two different daemons.

  • Namenode
  • DataNodes

Namenode

Namenode is the node which stores the file system metadata i.e. which file maps to what block locations and which blocks are stored on which datanode. The namenode maintains two in-memory tables, one which maps the blocks to datanodes (one block maps to data nodes for a replication value of and a datanode to block number mapping. Whenever a datanode reports a disk corruption of a particular block, the first table gets updated and whenever a datanode is detected to be dead (because of a node/network failure) both the tables get updated.

DataNode

It is the node where the actual data resides and is responsible for reading and writing requests from clients. When any file is stored in data node it is converted to different storage blocks and replicated in different nodes

All datanodes send a heartbeat message to the namenode every 3 seconds to say that they are alive. If the namenode does not receive a heartbeat from a particular data node for 10 minutes, then it considers that data node to be dead/out of service and initiates replication of blocks which were hosted on that data node to be hosted on some other data node.

The data nodes can talk to each other to rebalance data, move and copy data around and keep the replication high.When the datanode stores a block of information, it maintains a checksum for it as well. The data nodes update the namenode with the block information periodically and before updating verify the checksums. If the checksum is incorrect for a particular block i.e. there is a disk level corruption for that block, it skips that block while reporting the block information to the namenode. In this way, namenode is aware of the disk level corruption on that datanode and takes steps accordingly.

Rack Awareness in Hadoop

To maintain high availability and reduce the data between the networks, HDFS is designed to be rack aware.HDFS is rack aware in the sense that the namenode and the job tracker obtain a list of rack ids corresponding to each of the slave nodes (data nodes) and creates a mapping between the IP address and the rack id. HDFS uses this knowledge to replicate data across different racks so that data is not lost in the event of a complete rack power outage or switch failure.

Handling failures by Name Node

Datanode constantly communicates with the Namenode, each Datanode sends a Heartbeat message to the Namenode periodically.

If the signal is not received by the Namenade as intended, the Namenode will consider that Datanode as a failure and doesn’t send any new request to the dead datanode. If the Replication Factor is more than 1, the lost Blocks from the dead datanode can be recovered from other datanodes where the replica is available thus providing features like data availability and Fault tolerance.

Fault Tolerance in Hadoop

Fault tolerance in HDFS refers to the working strength of a system in unfavourable conditions and how that system can handle such situation. HDFS is highly fault tolerant as it can handle faults by the process of replica creation. The replica of users data is created on different machines in the HDFS cluster. So whenever if any machine in the cluster goes down, then data can be accessed from other machine in which same copy of data was created.

Input Split in Hadoop

Input splits are a logical division of your records whereas HDFS blocks are a physical division of the input data. It’s extremely efficient when they’re the same, but in practice it’s never perfectly aligned. Even though Records may cross block boundaries, Hadoop guarantees that all the records will be processed.A machine processing a particular split may fetch a fragment of a record from a block other than its “main” block and which may reside remotely. The communication cost for fetching a record fragment is inconsequential because it happens relatively rarely.

In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size.

Most common Input Formats in Hadoop HDFS

There are three most common input formats in Hadoop HDFS.

  • Text Input Format: Default input format in Hadoop.
  • Key Value Input Format: used for plain text files where the files are broken into lines
  • Sequence File Input Format: used for reading files in sequence

Reference

Gautam, N. “Analyzing Access Logs Data using Stream Based Architecture.” Masters, North Dakota State University ,2018.Available