Troubleshooting methods for NLB health check issues - Server Load Balancer

Network Load Balancer (NLB) performs health checks to probe the availability of backend servers. If the health check status of a listener changes to Unhealthy, a backend server encounters an error or the health check or backend servers are incorrectly configured. This topic describes how to troubleshoot health check issues for NLB.

Problem

The health check status of an NLB listener becomes Unhealthy.

Causes

If this is the first time that you configure health checks for the listener, we recommend that you check whether health checks are correctly configured. Check the following possible causes:

Invalid health check parameter settings
Invalid health check ports

If the issue is reported even after health checks are correctly configured, we recommend that you check for backend server issues. Check the following possible causes:

Invalid security service configurations
Invalid route configurations
Heavy loads on backend servers

Solutions

Errors that occur after the first health check probe is sent

Cause 1: Invalid parameter settings

Log on to the NLB console.
In the top navigation bar, select the region in which the NLB instance is deployed.
In the left-side navigation pane, choose NLB > Server Group.
On the Server Groups page, find the server group associated with your NLB instance and click Modify Health Check Settings.
In the Modify Health Check Settings dialog box, check the parameter settings. We recommend that you use the default settings.

Cause 2: Invalid health check ports

1. Check the health check ports

Log on to the NLB console.
In the top navigation bar, select the region in which the NLB instance is deployed.
In the left-side navigation pane, choose NLB > Server Group.
On the Server Groups page, click the ID of the server group associated with your NLB instance.
On the server group details page, click the Backend Servers tab and record the backend server port.
On the server group details page, click the Details tab and then click Modify Health Check Settings in the Health Check section. In the Modify Health Check dialog box, record the health check configurations.

TCP listeners

Log on to the unhealthy backend server and run the following command to probe the health check port.
For more information about how to log on to backend servers, see Methods for connecting to an ECS instance.
```
telnet [$IP] [$Port]
```
Note
- Replace [$IP] with the private IP address of the backend server.
- Replace [$Port] with the health check port on the backend server. If no health check port is specified, the backend server port is used as the health check port by default. If a health check port is specified, enter the specified health check port.
View the command output.
If the output contains the message "telnet: connect to address [$IP]: Connection refused", the connection request is rejected, as shown in the following figure. In this case, the health check port on the backend server is in an error state.
If the output contains the message "Connected to [$IP]", the health check port on the backend server is accessible, as shown in the following figure.

UDP listeners

Log on to the unhealthy backend server and run the following command to probe the health check port.
For more information about how to log on to backend servers, see Methods for connecting to an ECS instance.
```
netstat -anu | grep [$IP]:[$Port]
```
Note
- Replace [$IP] with the private IP address of the backend server.
- Replace [$Port] with the health check port on the backend server. If no health check port is specified, the backend server port is used as the health check port by default. If a health check port is specified, enter the specified health check port.
View the command output.
If the output does not contain the [$IP]:[$Port] entry, the health check port on the backend server is in an error state, as shown in the following figure.
If the output contains the [$IP]:[$Port] entry, the health check port on the backend server is accessible, as shown in the following figure.

2. Check whether the health check port on the backend server is open or the same as the port specified in the health check settings. In this example, an NGINX application is used.

Log on to the unhealthy backend server and run the following command to query the status of the NGINX application:
```
systemctl status nginx
```
The following output shows that the NGINX application is disabled.
Run the following command to enable the NGINX application:
```
systemctl start nginx
```
Run the following command to query the status of the NGINX application:
```
systemctl status nginx
```
The following output shows that the NGINX application is enabled.
Log on to the NLB console and perform the following operations:
1. In the left-side navigation pane, choose NLB > Server Groups
2. On the Server Groups page, find the server group that you want to manage, and click Modify Health Check in the Actions column.
3. In the Modify Health Check Settings dialog box, view the health check port.
4. View the health check status of the listener. If the listener is unhealthy, run the following command to query the health check port of the NGINX application:
```
netstat -tanp |grep nginx
```
The following output shows that the health check port of the NGINX application is different from the health check port specified in the health check settings.
Modify the /etc/nginx/nginx.conf file, change the listen value to the port specified in the health check settings, save the modifications, and then close the file.
Note
If you cannot change the listen value due to business restrictions, change the health check port in the health check settings based on the use scenario. For more information, see Overview of NLB listeners.
Run the following command to restart the NGINX application. Wait until the health check status changes to Healthy.
```
systemctl restart nginx
```

Errors that occur during subsequent health check probes

Cause 1: Invalid security service configurations

NLB uses the CIDR block of the virtual private cloud (VPC) to communicate with backend servers. Make sure that the VPC CIDR block is not blocked by an access control policy, such as iptables and third-party security services. If the VPC CIDR block is blocked, the health check status changes to Unhealthy and NLB service errors occur. In the following example, the VPC CIDR block is blocked by iptables.

Log on to the NLB console and view the IP address used by the NLB instance to communicate with backend servers.
Log on to the unhealthy backend server and run the following command to query all rules in the filter table:
```
iptables -nL
```
The following output shows that requests sent from the IP address used by the NLB instance are blocked by backend servers.
Run the following command to delete the rule that blocks requests from the IP address:
```
iptables -t filter -D INPUT -s 192.168.20.75 -j DROP  # The IP address is for reference only.
```
Note
Replace the IP address in the command with the actual private IP address used by your NLB instance.
Run the following command to query whether other rules that block requests from the IP address exist. Make sure that the IP address is not blocked by a rule.
```
iptables -nL
```
Make sure that the health check status of the listener changes to Healthy.

Cause 2: Invalid route configurations

If the VPC CIDR block used by the NLB instance is added to the route tables of backend servers, the NLB instance cannot receive health check probe packets. As a result, the health check status changes to Unhealthy. In the following example, the Linux route command is used to query route configurations.

Log on to the NLB console and view the IP address used by the NLB instance to communicate with backend servers.
Log on to the unhealthy backend server and run the following command to query route configurations:
```
route -n
```
If a route entry whose destination is the IP address used by the NLB instance, genmask is 255.255.255.255, and gateway is not the default gateway of the network interface controller (NIC), the route entry is incorrectly configured. If the destination of a route entry is 0.0.0.0, the value of Gateway is the default gateway.
Run the following command to delete the incorrect route entry:
```
ip route del blackhole 192.168.20.75 # The IP address is for reference only.
```
Note
Replace the IP address in the command with the actual private IP address used by your NLB instance.
Make sure that the health check status of the listener changes to Healthy.

Cause 3: Heavy loads on backend servers

Check whether the backend servers are overloaded. For more information, see Troubleshoot and resolve high load issues on Linux instances.

References

You can use the instance diagnostics feature of NLB to troubleshoot health check issues. For more information, see Diagnose an NLB instance.