Hadoop Distributed File System (HDFS) is the distributed file system of a Hadoop framework. HDFS is highly fault-tolerant. It provides high-throughput data access and can process terabytes or even petabytes of data. This minimizes data costs. You can use HDFS in scenarios in which large-scale data needs to be read and written in distributed mode. HDFS is especially suitable for scenarios in which the number of read operations is greater than the number of write operations.
Benefits
HDFS in an E-MapReduce (EMR) cluster has the following benefits:
Is highly fault-tolerant and scalable.
Provides a shell command interface.
Provides a web UI that allows you to manage components in a visualized manner.
Allows you to manage file permissions in a similar manner in which you manage file permissions in Linux.
Is locality-aware. Storage space is allocated based on the location of compute nodes.
Balances data among DataNodes.
Supports rolling restart and upgrade operations.
Architecture
HDFS is built based on a primary/secondary architecture. HDFS in an EMR cluster consists of one NameNode and multiple DataNodes.
The NameNode is deployed on the master node of the EMR cluster and is used to manage metadata information of files and interact with clients. A DataNode is deployed on each core node of the EMR cluster and is used to manage files stored on the node. Each file that is uploaded to HDFS is split into one or more data blocks. These data blocks are distributed to different DataNodes based on the data backup policy of the cluster. The location information of data blocks is managed by the NameNode in a centralized manner.
Terms
Item | Description |
NameNode | The NameNode manages the file system namespaces, maintains the directory tree and metadata of all files in the tree, and records the mapping between each data block and the file to which the data block belongs. All the information is persisted to the FsImage and EditLog files on local disks. |
DataNode | DataNodes store files. DataNodes store or provide data blocks based on the instructions of the NameNode or clients, and report information about the data blocks to the NameNode on a regular basis. |
client | You can use a client to access the file system and communicate with the NameNode and DataNodes. A client serves as a file system interface that is similar to POSIX. |
block | In HDFS, files are evenly split into data blocks. Each data block is 128 MB in size and the data blocks may be stored on different DataNodes. HDFS can store a single file whose size may even exceed the capacity of a disk. By default, three replicas are stored for each block. If the core nodes of an EMR cluster use cloud disks, two replicas are stored for each block. The replicas are stored on different DataNodes. This way, data security is improved. In addition, distributed jobs can use local data for computing, which reduces data transfers. |
Secondary NameNode | For a non-HA cluster, a Secondary NameNode process is automatically started. The Secondary NameNode process aims to consume EditLog files and merge FsImage and EditLog files on a regular basis to generate new FsImage files. This reduces the load on the NameNode. |
high availability | For a high availability (HA) cluster, two NameNodes are automatically started. The active NameNode and the standby NameNode assume different roles. The active NameNode is used to handle requests from DataNodes and clients. The standby NameNode serves as the backup of the active NameNode and stores the same data as the active NameNode. If an exception occurs in the active NameNode, the standby NameNode is used to handle requests from DataNodes and clients. |