By Pei Xiaohui
Introduction: Network File System (NFS) is an important concept in storage systems. It serves as a basis for distributed file systems that are compatible with POSIX semantics. With NFS, you can share file systems among multiple hosts and take advantage of data sharing to minimize the required storage space. This article will help you understand how NFS ensures consistency by analyzing how to obtain consistent state views of NFS file locks.
File locking is one of the basic features of file systems, and applications use it to control the concurrent file access of other applications. NFS is a standard network file system for UNIX and similar systems. As it has evolved, NFS has gradually gained native support for file locks, commencing with NFSv4. Since its inception in the 1980s, three versions of NFS have been released, which are NFSv2, NFSv3, and NFSv4.
The biggest difference between NFSv4 and earlier versions is that NFSv4 is "stateful." Some operations require servers to maintain the corresponding states, such as file locks. For example, if a client requests a file lock, the server must maintain the state of the file lock. Otherwise, the server will be unable to detect the access that conflicts with other clients. You must use the Network Lock Manager (NLM) to implement file locking in NFSv3. However, if the server and NLM are not well-coordinated, this feature is prone to errors. In contrast, NFSv4 is a stateful protocol. Therefore, you can implement file locking by using NFSv4 alone, without the NLM protocol.
Applications can call fcntl()
or flock()
to manage NFS file locks. The following flowchart explains the call process by which Apsara File Storage NAS obtains file locks when an NFSv4 file system is mounted.
As the call stack in the preceding flowchart shows, the implementation logic of NFS file locks largely reuses the design and data structure of the virtual file system (VFS) layer. After a client obtains a file lock from the server over the Remote Procedure Call (RPC) protocol, the client calls the locks_lock_inode_wait()
function to assign the file lock to the VFS layer for management. We will not discuss the design of file locks in the VFS layer in this article.
A file lock is a typical non-idempotent operation. Retry and failover of operations on file locks can cause inconsistency of file lock state views between the client and the server. NFSv4 uses the seqid variable to ensure that each file lock operation is performed at most once, as detailed in the following paragraphs:
For each open or lock state, the client and the server maintain the seqid variable independently at the same time. When the client initiates an operation that can cause a state change, such as open, close, lock, unlock, or release_lockowner, the seqid on the client increases by 1, and the client sends it to the server as a parameter. If the seqid sent by the client is R, and the seqid maintained by the server is L, then:
The server can identify whether the operation is a normal request, a retry request, or an invalid request based on the preceding rule.
With this method, we can ensure that each file lock operation is performed no more than once on the server. Therefore, we can avoid repeated operations caused by RPC retries. However, this method alone is not enough. For example, after the client has sent a lock operation; if the calling thread encounters a signal interruption, and the server accepts and performs the lock operation, the server will record the lock as held by the client. However, the client cannot maintain the file lock due to the preceding interruption, resulting in the inconsistency of file lock state views between the client and the server. Therefore, the client must also handle exceptions to ensure consistent file lock state views.
As you can see from the analysis in the preceding section, the client must also handle exceptions to ensure consistent file lock state views. Client designers implement the SunRPC and NFS protocols on clients to address this issue. These two protocols work together from two aspects. The following section describes how to ensure consistent file lock state views with both protocols.
Developed by Sun Microsystems, Inc., SunRPC is a network communication protocol specifically designed for requests on remote servers. Next, let's look at the design philosophy in the implementation of SunRPC from the perspective of ensuring consistent file lock state views.
1. The client uses the xid of the int32_t type to identify each RPC request initiated by an upper-layer user. The same xid identifier is applied to multiple retries of each RPC request. Therefore, when the server returns the result of any of the multiple retries, the upper-layer user will be informed of the successful status of the request. This ensures that the client can obtain the request result even if the server takes a longer time to process the request. This feature is different from many traditional methods, including Netty, MINA, and BRPC, which require each RPC request to have an independent xid or packetid.
2. The server uses the duplicate request cache (DRC) to cache the recent RPC requests results. When the server receives an RPC request, it searches from the DRC by xid first. A hit indicates that the RPC request is a retry, for which the server can directly return the cached result. This method can help you avoid repeated processing of the same request caused by RPC retries. To avoid unexpected results returned by DRC due to xid reuse, developers have implemented the following mechanism to reduce the probability of errors caused by reuse:
3. The client is allowed to perform unlimited retries before it receives a response from the server. This ensures that the caller can obtain a certain result from the server. However, this policy can also keep the caller hanging when there is no response.
4. When you mount an NFS file system, you can specify the retry policy of SunRPC using the soft or hard parameters. A soft mount does not allow retries after a timeout, but a hard mount allows retries. When you soft mount an NFS file system, the NFS implementation does not ensure consistent file lock state views on the client and the server. When a timeout is returned for an RPC request, the application is required to support the cleanup and recovery of the state, for example, by closing the files with access errors. However, few applications will play such cooperative roles in practice. Therefore, the users of Apsara File Storage NAS generally hard mount the NFS file systems.
One of the key issues that SunRPC needs to address is the time it takes to process RPC requests. The time is uncontrollable. To address this issue, protocol designers have customized the design to minimize the side effects of RPC retries of non-idempotent operations.
When an application is waiting for the result of an RPC request, the thread allows for signal interruptions. When a signal interruption occurs, the file lock state views may be inconsistent between the client and the server because the client cannot obtain the RPC request results. For example, the lock operation has been performed on the server, but the client is not informed. This requires the client to perform additional actions to restore the consistency of file lock state views between the client and the server. The following example explains the consistency mechanism in the implementation of the NFS protocol by briefly analyzing how it works after the process of obtaining a file lock encounters a signal interruption.
In the process of obtaining a NFSv4 file lock, the client calls the _nfs4_do_setlk()
function to initiate an RPC operation and calls the nfs4_wait_for_completion_rpc_task()
to wait, as shown in the following sample code:
static int _nfs4_do_setlk(struct nfs4_state *state, int cmd, struct file_lock *fl, int recovery_type)
{
......
task = rpc_run_task(&task_setup_data);
if (IS_ERR(task))
return PTR_ERR(task);
ret = nfs4_wait_for_completion_rpc_task(task);
if (ret == 0) {
ret = data->rpc_status;
if (ret)
nfs4_handle_setlk_error(data->server, data->lsp,
data->arg.new_lock_owner, ret);
} else
data->cancelled = 1;
......
}
As we can learn by analyzing the implementation of nfs4_wait_for_completion_rpc_task()
, when ret is below 0, it indicates that the process of obtaining the lock encounters a signal interruption. In addition, the cancelled member record of struct nfs4_lockdata
is used. After rpc_task
is completed, call the nfs4_lock_release()
callback function for lock release.
The code in the red box indicates that when nfs4_lock_release()
has detected a signal interruption, the client calls the nfs4_do_unlck()
function to unlock the file lock that may have been obtained. The nfs_free_seqid()
function is not called to free the nfs_seqid
held by the client due to the following reasons:
With the preceding method, the client can ensure the final consistency of the file lock state views between the client and the server after a signal interruption. However, this causes some availability losses.
File locking is one of the basic features that is natively supported by file systems. As a shared file system, Apsara File Storage NAS is faced with the problem of keeping the consistency of file lock state views between the client and the server. NFSv4.0 can help us solve this problem. The iterations of NFS will not stop as long as technology keeps moving forward. Therefore, we can expect more from future NFS.
We believe in the power of technology and the people that have it. We look forward to seeing the future of storage and collaborating with you to create it.
Alibaba Cloud NAS: The One Container Solution for Cloud-Native Technology
57 posts | 12 followers
FollowAlibaba Cloud Community - December 29, 2022
Alibaba Cloud Storage - October 31, 2018
Alibaba Cloud Product Launch - December 11, 2018
Alibaba Cloud Community - May 2, 2024
Alibaba Clouder - February 11, 2019
Alibaba Cloud Native Community - September 19, 2023
57 posts | 12 followers
FollowSimple, scalable, on-demand and reliable network attached storage for use with ECS instances, HPC and Container Service.
Learn MoreBuild your cloud drive to store, share, and manage photos and files online for your enterprise customers
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreMore Posts by Alibaba Cloud Storage