This article describes a TCP connection case where the socket is always in the FIN_WAIT_1 state
. In such a scenario, the TCP connection doesn't properly close due to the setting of conntrack kernel parameters and iptables rules. Furthermore, the article also illustrates the new packet processing logic of the conntrack-related code after a conntrack entry times out.
A process on the ECS instance establishes a socket connection to another server. However, once the process is killed, it is observed that the tcpdump cannot capture any FIN packet. As a result, the connection on the server is not properly closed. The following sections comprehensively analyze the reasons behind this problem.
Usually, after the process is killed, close() command is called in user mode to initiate a TCP FIN to the peer end. Therefore, the preceding symptom is abnormal. The key information in this regard is as follows:
1) The process is killed in user mode.
2) No FIN packet is captured from the network card of the ECS instance.
The above description implies that the problem lies in the kernel-mode between the user space and the network card driver. However, it is not clear whether the problem is caused by the system call or occurs after the FIN is constructed. At this point, a relatively simple and effective method is to check the status of the socket. If the socket is in the TIME_WAIT_1 state
, it proves that the system call is normal.
According to the TCP state machine, the socket enters the TIME_WAIT_1
state after sending the FIN and enters the TIME_WAIT_2
state after receiving the ACK packet from the peer end. Another piece of information about the socket is that the socket is in the TIME_WAIT_1
state for a long time. This also reversely proves it reasonable that no FIN packet is captured from the network card. The FIN packet is not sent out of the network card of the virtual machine. The peer end does not receive the FIN packet, and therefore will not return the ACK packet.
According to the preceding analysis, further focus on the impacts of iptables (Netfilter) and TC mechanisms on packets on the precondition that no big bugs are found. It turns out that many iptables rules are configured on the ECS instance. Use iptables -nvL
to print the match count of each rule, or use the log writing method. The following snippet shows an example.
# 记录下new state的报文的日志
iptables -A INPUT -p tcp -m state --state NEW -j LOG --log-prefix "[iptables] INPUT NEW: "
In this case, the count and logs indicate that the DROP rule matches on the last hop of the OUTPUT chain, as shown below.
# iptables -A OUTPUT -m state --state INVALID -j DROP
Now, this reveals the root cause of the problem. The iptables rule drops the FIN packet sent after the process is killed. As a result, the peer end doesn't receive the PIN packet, and the connection is not properly closed.
By now, the following questions are yet to be answered:
Let's address the first question. Does the problem always occur? What are the triggering conditions?
For the process on the ECS instance that establishes a TCP connection with the server, the problem actually does not always occur. We recommend using NetCat to check whether the problem always occurs. The test result indicates the followings:
1) When NetCat is used for similar operations, the same problem recurs. This indicates that the problem always occurs and has nothing to do with a specific process or connection.
2) When the connection time is relatively long, the problem may recur. When the connection time is relatively short, the FIN packet is sent normally after the process is killed.
Further, checking the settings of conntrack-related kernel parameters reveals that the conntrack parameter, net.netfilter.nf_conntrack_tcp_timeout_established = 120
, is significantly adjusted on the ECS instance.
The default value is 5 days, while the optimal value recommended for Alibaba Cloud is 1200 seconds. However, the current setting on this ECS instance is 120 seconds, which is a very short time.
It is concluded that the connection tracking record in conntrack has been deleted after 120 seconds specified by nf_conntrack_tcp_timeout_established
. At this time, if FIN is initiated for this connection, FIN is considered INVALID on Netfilter. On the OUTPUT chain of the iptables filter table, the drop action is adopted for packets in the INVALID connection state. As a result, the FIN packet is dropped from the OUTPUT chain of the Netfilter filter table.
For a TCP connection, it is logical to consider the FIN packet INVALID if one end initiates FIN when there is no connection tracking entry in conntrack. However, no document clearly describes how the conntrack module determines the state of a "new" packet when the user space of the TCP socket still exists but the conntrack entry does not exist.
The descriptions about how the NEW, ESTABLISHED, RELATED, and INVALID states are defined in conntrack are similar in all documents.
The NEW state implies matching the first packet that the conntrack module sees, within a specific connection. For example, if you see a SYN packet and it is the first packet in a connection that you see, it will match. However, the packet may as well not be a SYN packet and still be considered NEW. This may lead to certain problems in some instances, but it may also be extremely helpful when you need to pick up lost connections from other firewalls, or when a connection has already timed out, but in reality is not closed.
According to the preceding description, the first packet (for example, a TCP SYN packet) seen by the conntrack module is in the NEW state. Sometimes a non-SYN packet is also considered NEW.
Considering the case described in this article, the conntrack entry has expired. At this time, no matter what packet is sent in user mode to the conntrack module, it is the first packet seen by the conntrack module. Now, the question arises whether this packet will be considered NEW by conntrack? For example, the SYN, SYNACK, FIN, and RST packets have different semantics. According to practical experience, it is fine to directly set these packets to the INVALID state. The following test reproduces the problem and checks the code logic.
Use the following script to set iptables rules.
#!/bin/sh
iptables -P INPUT ACCEPT
iptables -F
iptables -X
iptables -Z
# 在日志里记录INPUT chain里过来的每个报文的状态
iptables -A INPUT -p tcp -m state --state NEW -j LOG --log-prefix "[iptables] INPUT NEW: "
iptables -A INPUT -p TCP -m state --state ESTABLISHED -j LOG --log-prefix "[iptables] INPUT ESTABLISHED: "
iptables -A INPUT -p TCP -m state --state RELATED -j LOG --log-prefix "[iptables] INPUT RELATED: "
iptables -A INPUT -p TCP -m state --state INVALID -j LOG --log-prefix "[iptables] INPUT INVALID: "
iptables -A INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 21 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
iptables -A INPUT -p tcp --dport 8088 -m state --state NEW -j ACCEPT
iptables -A INPUT -p icmp --icmp-type 8 -j ACCEPT
iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
# 在日志里记录OUTPUT chain里过来的每个报文的状态
iptables -A OUTPUT -p tcp -m state --state NEW -j LOG --log-prefix "[iptables] OUTPUT NEW: "
iptables -A OUTPUT -p TCP -m state --state ESTABLISHED -j LOG --log-prefix "[iptables] OUTPUT ESTABLISHED: "
iptables -A OUTPUT -p TCP -m state --state RELATED -j LOG --log-prefix "[iptables] OUTPUT RELATED: "
iptables -A OUTPUT -p TCP -m state --state INVALID -j LOG --log-prefix "[iptables] OUTPUT INVALID: "
# iptables -A OUTPUT -m state --state INVALID -j DROP
iptables -P INPUT DROP
iptables -P OUTPUT ACCEPT
iptables -P FORWARD DROP
service iptables save
systemctl restart iptables.service
Run the "iptables -nvL
" command to view the rules.
Note: During the test, similar problems are reproduced even when packets in the INVALID state are not dropped from the OUTPUT chain. Since, FIN returned to the peer end in the INPUT direction is also a packet in the INVALID state, and will be dropped by the default DROP rule of the INPUT chain.
Next, set nf_conntrack_tcp_timeout_established
to a small value.
sysctl-w net. netfilter. nf_conntrack_tcp_timeout_established = 20
Use the NetCat to conduct a test. After the first connection is established and remains idle for 20 seconds, the ESTABLISHED entry in conntrack disappears (you can view the entry using the "iptstate" command or conntrack tool).
If the process is directly killed and FIN is sent, FIN is considered INVALID by conntrack. If packets are continuous then the packets are considered NEW by conntrack.
Check the packets of the nf_conntrack
module starting from the nf_conntrack_in function
. The logic for the non-existing new conntrack entries is shown below.
nf_conntrack_in @net/netfilter/nf_conntrack_core.c
|--> resolve_normal_ct @net/netfilter/nf_conntrack_core.c // 利用__nf_conntrack_find_get查找对应的连接跟踪表项,没找到则init新的conntrack表项
|--> init_conntrack @net/netfilter/nf_conntrack_core.c // 初始化conntrack表项
|--> tcp_new @net/netfilter/nf_conntrack_proto_tcp.c // 到TCP协议的处理逻辑,called when a new connection for this protocol found。在这里根据tcp_conntracks数组决定状态。
In reslove_normal_ct
, the logic is to first find the corresponding conntrack entry using __nf_conntrack_find_get
. In the scenario described in this article, the conntrack entry has timed out. Therefore, this entry does not exist. The code logic goes to init_conntrack
to initialize a table item.
/* look for tuple match */
hash = hash_conntrack_raw(&tuple, zone);
h = __nf_conntrack_find_get(net, zone, &tuple, hash);
if (!h) {
h = init_conntrack(net, tmpl, &tuple, l3proto, l4proto,
skb, dataoff, hash);
if (!h)
return NULL;
if (IS_ERR(h))
return (void *)h;
}
In the following logic of init_conntrack
, "new" of nf_conntrack_l4proto
reads and verifies the content of a packet that is new for the conntrack module. If the returned value is "false", the logic goes to the subsequent "if statement" to end the process of initializing the conntrack entry. In the scenario described in this article, initialization of the conntrack entry really ends here.
This "new" TCP packet, the TCP connection for a non-existing (time-out) conntrack entry, will be verified in the new (tcp_new
) logic.
if (!l4proto->new(ct, skb, dataoff, timeouts)) {
nf_conntrack_free(ct);
pr_debug("init conntrack: can't track with proto module\n");
return NULL;
}
In the following tcp_new
logic, the key logic is to assign a value to new_state
. If the value of new_state
is equal to or greater than TCP_CONNTRACK_MAX
, the logic returns "false" and exits. For the FIN packet, the value assigned to new_state
is TCP_CONNTRACK_MAX (sIV)
. The specific logic is analyzed as follows.
/* Called when a new connection for this protocol found. */
static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
unsigned int dataoff, unsigned int *timeouts)
{
enum tcp_conntrack new_state;
const struct tcphdr *th;
struct tcphdr _tcph;
struct net *net = nf_ct_net(ct);
struct nf_tcp_net *tn = tcp_pernet(net);
const struct ip_ct_tcp_state *sender = &ct->proto.tcp.seen[0];
const struct ip_ct_tcp_state *receiver = &ct->proto.tcp.seen[1];
th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
BUG_ON(th == NULL);
/* Don't need lock here: this conntrack not in circulation yet */
// 这里get_conntrack_index拿到的是TCP_FIN_SET,是枚举类型tcp_bit_set的值
new_state = tcp_conntracks[0][get_conntrack_index(th)][TCP_CONNTRACK_NONE];
/* Invalid: delete conntrack */
if (new_state >= TCP_CONNTRACK_MAX) {
pr_debug("nf_ct_tcp: invalid new deleting.\n");
return false;
}
......
}
tcp_conntracks
is a three-dimensional array, which exists as a TCP state transition table.
tcp_conntrack
array is 0, indicating ORIGINAL, or the packet sending end.get_conntrack_index
. Get_conntrack_index(th)
obtains the value TCP_FIN_SET
of enum tcp_bit_set
(defined as follows) based on the FIN flag in the packet. tcp_bit_set
is in one-to-one correspondence with the middle layer subscript of the tcp_conntracks
array to be introduced below./* What TCP flags are set from RST/SYN/FIN/ACK. */
enum tcp_bit_set {
TCP_SYN_SET,
TCP_SYNACK_SET,
TCP_FIN_SET,
TCP_ACK_SET,
TCP_RST_SET,
TCP_NON
TCP_CONNTRACK_NONE
, which corresponds to 0 in enum tcp_conntrack
.The following snippet shows the content of the array. The source code has a lot of comments that describe the status transition (which is omitted here). This article only focuses on the definition of the packet state of the first packet that is received after the conntrack entry times out.
static const u8 tcp_conntracks[2][6][TCP_CONNTRACK_MAX] = {
{
/* ORIGINAL */
/*syn*/ { sSS, sSS, sIG, sIG, sIG, sIG, sIG, sSS, sSS, sS2 },
/*synack*/ { sIV, sIV, sSR, sIV, sIV, sIV, sIV, sIV, sIV, sSR },
/*fin*/ { sIV, sIV, sFW, sFW, sLA, sLA, sLA, sTW, sCL, sIV },
/*ack*/ { sES, sIV, sES, sES, sCW, sCW, sTW, sTW, sCL, sIV },
/*rst*/ { sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL },
/*none*/ { sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV }
},
{
/* REPLY */
/*syn*/ { sIV, sS2, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sS2 },
/*synack*/ { sIV, sSR, sIG, sIG, sIG, sIG, sIG, sIG, sIG, sSR },
/*fin*/ { sIV, sIV, sFW, sFW, sLA, sLA, sLA, sTW, sCL, sIV },
/*ack*/ { sIV, sIG, sSR, sES, sCW, sCW, sTW, sTW, sCL, sIG },
/*rst*/ { sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL },
/*none*/ { sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV }
}
};
According to the preceding analysis, for a new packet of the conntrack module, the values are as follows:
tcp_conntracks[0][get_conntrack_index(th)][TCP_CONNTRACK_NONE] =>tcp_conntracks[0][get_conntrack_index(th)][0]
tcp_conntracks0[0] = tcp_conntracks0[0] => INVALID state //
case described in this articletcp_conntracks0[0] = tcp_conntracks0[0] => INVALID state
tcp_conntracks0[0] = tcp_conntracks0[0] => INVALID state
When iptables is used in the operating system (or hooks provided by Netfilter are used in other scenarios), we recommend that you set nf_conntrack_tcp_timeout_established
to a value smaller than the default value (5 days). This is the best practice recommended to prevent too many entries in the conntrack table. But the critical question is how to determine whether the value of nf_conntrack_tcp_timeout_established
is appropriate. Unless you clearly know the filtering action that is performed on every packet according to the iptables rules, we do not recommend to set the value to hundreds of seconds or even smaller.
If nf_conntrack_tcp_timeout_established
is set to a very small value, FIN or RST (linger enable) sent is very likely dropped by iptables rules when the connection is closed and the conntrack entry times out. In the case described in this article, each chain drops packets in the INVALID state according to the filter table rules of iptables. Even if these packets are not dropped, when rules are set, the default rules of the INPUT chain will drop INVALID packets and will not allow them to enter the chain. The final impact is that the user-mode socket stays in an uncommon state, such as FIN_WAIT_1
and LAST_ACK
, and consequently the TCP connection doesn't close properly.
7 posts | 5 followers
FollowAlibaba Cloud Community - January 2, 2024
William Pan - August 19, 2019
Alibaba Cloud Community - July 9, 2024
Alibaba Cloud Community - August 6, 2024
Alibaba Clouder - May 21, 2019
OpenAnolis - August 3, 2022
7 posts | 5 followers
FollowAlibaba Cloud offers an accelerated global networking solution that makes distance learning just the same as in-class teaching.
Learn MoreConnect your business globally with our stable network anytime anywhere.
Learn MoreAlibaba Cloud is committed to safeguarding the cloud security for every business.
Learn MoreSimple, secure, and intelligent services.
Learn MoreMore Posts by William Pan