Issue
After a Server Load Balancer (SLB) instance is configured, errors such as 500 Internal Server Error, 502 Bad Gateway, and 504 Gateway Timeout may occur. The errors may be caused by blockage from Internet service providers (ISPs), blockage from Alibaba Cloud Security due to abnormal client activities, configuration errors of the SLB instance, health check failures, or failures in accessing web applications on the backend Elastic Compute Service (ECS) instances. This topic describes the Solutions to these errors. If the errors persist, troubleshoot the errors based on the information in Troubleshooting procedure.
Causes
The following section describes the possible causes of HTTP 5xx errors.
- The domain name of the origin server does not have an Internet Content Provider (ICP) number, or no Layer 7 forwarding rules are configured for the domain name in Anti-DDoS
- The client IP address is blocked by Alibaba Cloud Security
- The client IP address is blocked by the ISP
- Requests are blocked by the security software of the backend ECS instance
- The parameters of the Linux kernel of the backend ECS instance are not properly configured
- The backend ECS instance has performance bottlenecks
- SLB reports 502 errors due to health check failures
- Health checks are successful but 502 errors are reported for web applications
- The service access logic is invalid
Solutions
Solutions
The domain name of the origin server does not have an ICP number, or no Layer 7 forwarding rules are configured for the domain name in Anti-DDoS.
If the domain name of the origin server does not have an ICP number, apply for an ICP number for the domain name. For more information, see ICP filing. If Anti-DDoS is configured for the domain name, configure rules for the domain name. For more information, see Rules.
The client IP address is blocked by Alibaba Cloud Security
Add the client IP address to the IP address whitelist. For more information, see IP address whitelist.
The client IP address is blocked by the ISP
Check whether the same issue exists for clients of other ISPs. If not, the issue is caused by blockage from a specific ISP. You can check whether packets are blocked by the ISP or submit a ticket to contact Alibaba Cloud for technical support. If packets are blocked by the ISP, contact the ISP to resolve the issue.
Requests are blocked by the security software of the backend ECS instance
100.64.0.0/10 is a CIDR block reserved by Alibaba Cloud for SLB servers for health checks and request forwarding. The CIDR block is not assigned to users and does not pose security risks. If you install security software or enable the firewall on an ECS instance, you can add this CIDR block to the whitelist or uninstall security software to prevent 500 and 502 errors. This section uses iptables of a Linux instance as an example.
- Log on to the backend server on which issues are detected and run the following command to view all rules of the filter table:
iptables -nL
If the following output is displayed, the backend server blocks requests from the private CIDR block of SLB.
- Run the following command to delete this rule: Note If iptables can be disabled, run the /etc/init.d/iptables stop command to disable iptables.
sudo iptables -t filter -D INPUT -s 100.64.0.0/10 -j DROP
- Run the following command to verify that the backend server does not block requests from the private CIDR block of SLB.
sudo iptables -nL
The parameters of the Linux kernel of the backend ECS instance are not properly configured
/etc/sysctl.conf
file, set the values of the following parameters to 0 and run the sysctl -p
command. net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.eth0.rp_filter = 0
The backend ECS instance has performance bottlenecks
Check the metrics of the backend ECS instance, such as CPU utilization and Internet bandwidth. Performance bottlenecks may cause access issues. If the ECS instance has performance bottlenecks, you can upgrade the ECS instance.
SLB reports 502 errors due to health check failures
If health checks fail, see Troubleshoot health check exceptions. 502 errors occur if the health check feature of SLB is disabled and the web service on the backend server cannot process HTTP requests. For example, the web service is not running.
Health checks are successful but 502 errors are reported for web applications
The 502 error indicates that SLB can forward requests from clients to the backend server, but the web application on the backend server cannot process the requests. You must check the configurations and running status of the web application on the backend server. For example, the time used by the web application to process an HTTP request exceeds the timeout period of SLB.
For a Layer 7 (HTTP or HTTPS) listener, the default connection request timeout period is 60 seconds. If more than 60 seconds is required for the backend ECS instance to process PHP requests, SLB returns the 504 status code. For a Layer 4 (TCP or UDP) listener, the default connection request timeout period is 900 seconds. It is unlikely that more than 900 seconds is required for the backend ECS instance to process PHP requests.
Make sure that the web service and related services run as expected. Check whether PHP requests are properly processed, and optimize the processing of PHP requests by the backend server. NGINX and PHP-FPM are used in the following example.
- The number of PHP requests that are processed has reached the upper limit. The total number of PHP requests on the server has reached the upper limit specified by the max_children parameter in PHP-FPM. If a new PHP request reaches the server, SLB returns the 502 or 504 status code.
- If existing and new PHP requests are both processed in a timely manner, no error occurs.
- If existing PHP requests are processed in a slow manner and new PHP requests remain in the waiting state until the value of fastcgi_read_timeout on NGINX is exceeded, the 504 status code is returned.
- If existing PHP requests are processed in a slow manner and new PHP requests remain in the waiting state until the value of request_terminate_timeout on NGINX is exceeded, the 502 status code is returned.
- If the execution time of PHP scripts exceeds the upper limit, or the time that is used by PHP-FPM to process PHP scripts exceeds the value of request_terminate_timeout on NGINX, a 502 error occurs and the following error entry is shown in NGINX logs:
[error] 1760#0: *251777 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: XXX.XXX.XXX.XXX, server: localhost, request: "GET /timeoutmore.php HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000"
- Health checks are performed on static pages. If health checks are successful, but exceptions occur in the processing of dynamic requests, an error is reported. For example, dynamic requests cannot be processed because PHP-FPM is not running.
The service access logic is invalid
Make sure that the backend ECS instance of the Layer 4 listener of SLB does not use the IP address of the SLB instance to access the SLB instance. If the backend ECS instance accesses its own port through the public IP address of the SLB instance to which the ECS instance is added as a backend server, the request is sent to the backend ECS instance based on the scheduling policy configured for the SLB instance. This leads to an infinite loop, which results in 500 or 502 errors.
Troubleshooting procedure
- Check whether access issues occur on all clients. If not, check whether the client that reports an error is blocked by Alibaba Cloud Security. In addition, check whether the domain name or IP address of SLB is blocked by the ISP.
- Check the page on which the 500, 502, or 504 error is reported to determine whether the error is reported due to issues on SLB, Anti-DDoS, or backend ECS instances.
- Anti-DDoS issues: If Anti-DDoS is configured, make sure that Layer 7 forwarding rules are properly configured.
- SLB issues:
- Check whether the SLB instance fails health checks. If so, see Health check failures.
- Use a Layer 4 listener instead of a Layer 7 listener to check whether issues persist.
- ECS instance configuration issues: If the 5XX status code occurs intermittently, it is likely that a backend ECS instance has configuration issues. In the hosts file of the client, resolve the domain name to the IP address of a backend ECS instance, and check whether the backend ECS instance has issues.
- If the error is caused by the backend ECS instance, check the web application logs of the backend ECS instance. Check whether the web service is running as expected and whether the web access logic is valid. Uninstall the antivirus software on the ECS instance and restart the instance before you troubleshoot the error.
- Check whether the CPU, memory, disk, or bandwidth of the backend ECS instance has performance bottlenecks.
- If the ECS instance runs Linux, check whether the TCP kernel parameters of the backend ECS instance are properly configured.
Applicable scope
SLB