Check and fix the problem of missing IP addresses of CentOS 7 instances and Windows instances -

Description

After the ECS instance is continuously used for more than a period of time, and the ECS instance has not been restarted during this period, the instance is disconnected, the network is paralyzed, and the public IP address and private IP address cannot be pinged.

Cause

When you start an ECS instance for the first time, the system uses DHCP (Dynamic Host Configuration Protocol, Dynamic Host Configuration Protocol) to automatically assign IP addresses to the Elastic Network Interface and obtains the expiration time of the IP address lease. Normally, the dhclient process of the Linux and the DHCP Client service of the Windows system periodically update the lease expiration time to the DHCP server to ensure the availability of the instance IP address. Some instances created from CentOS 7 images (see Scope of Application) have a small probability of clearing the dhclient process, and the DHCP Client service of the Windows Server operating system has known issues. As a result, your instance cannot automatically update the expiration time of the renewal of the IP address. When the IP address that is renewed for the first time expires, the private IP address of the instance is released, causing the instance network to be blocked.

Applicable scope

ECS instances that meet the following conditions and automatically assign IP addresses to Elastic Network Interface by using DHCP. You need to fix the problem as described in this topic. If the ECS instance is configured with a static IP address, you do not need to handle this issue.

Any type of instance created based on the following CentOS 7 public image (ECS instances created before May 31, 2018 and not restarted after November 15, 2018).
- centos_7_04_64_20G_alibase_20180419.vhd
- centos_7_04_64_20G_alibase_20180326.vhd
- centos_7_04_64_20G_alibase_201701015.vhd
- centos_7_03_64_20G_alibase_20170818.vhd
- centos_7_02_64_20G_alibase_20170818.vhd
- centos_7_03_64_40G_alibase_20170710.vhd
- centos_7_03_64_40G_alibase_20170625.vhd
- centos_7_03_64_40G_alibase_20170523.vhd
- centos_7_03_64_40G_alibase_20170503.vhd
Instances that run the following Windows Server operating systems (ECS instances that were created before November 15, 2018 and have not been restarted later).
- Windows Server 2008 R2
- Windows Server 2012 R2
- Windows Server 2016
- Windows Server Version 1709

Solutions

Take note of the following items:

Before you perform high-risk operations such as modifying the specifications or data of an Alibaba Cloud instance, we recommend that you check the disaster recovery and fault tolerance capabilities of the instance to ensure data security.

Before you modify the specifications or data of an Alibaba Cloud instance, such as an Elastic Compute Service (ECS) instance or an ApsaraDB RDS instance, we recommend that you create snapshots or enable backups for the instance. For example, you can enable log backups for an ApsaraDB RDS instance.

If you have granted specific users the permissions on sensitive information, such as usernames and passwords, or submitted sensitive information in the Alibaba Cloud Management Console, we recommend that you modify the sensitive information at the earliest opportunity.

You can choose one of the four solutions provided in this article.

Method 1: Cloud Assistant batch repair. It is suitable for scenarios where multiple instances can be performed in the ECS console. The operation method is easier.
Method 2: The Python SDK script is written based on the Cloud Assistant API. Use the region as the repair unit to check the status of your instances in batches and complete automatic repair. Suitable for users who are familiar with scripted O&M.
Method 3: Shell and PowerShell scripts are provided. You must log on to an ECS instance to manually fix them. This method is suitable for polling or testing in a small number of instances. The script content is the same as Solution 1.
Method 4: Check network interface controller one by one.

Method 1: Cloud Assistant batch repair

This example uses the Cloud Assistant to perform a check and automatic repair on the ECS instance. Make sure that the Cloud Assistant client is installed on your instance. ECS instances created after December 01, 2017 are by default pre-installed with the Cloud Assistant client. For more information, see Cloud Assistant client.

Create a Cloud Assistant command. For detailed procedure, see Create a Cloud Assistant command.
Download the following Shell or PowerShell script and paste it into the command content of the Cloud Assistant.
- CentOS instance: linux_fix_dhclient.sh
- Windows instances: win_fix_dhclient.ps1
Select the corresponding ECS instance and run the created Cloud Assistant command. For more information, see Run commands.
Confirm that the execution is successful. For more information, see Query execution results and status. The following figure shows the command results returned for CentOS and Windows instances.

Method 2: Batch repair of Python SDK scripts

In this example, a Python script is written based on the Cloud Assistant API, which can check and automatically repair all affected instances in an Alibaba Cloud region. For more information about how to install the ECS SDK, see Alibaba Cloud GitHub repository installation documentation.

Before you begin

See the following command to download the relevant Python SDK dependencies to your local computer or ECS instance.

pip install aliyun-python-sdk-core
pip install aliyun-python-sdk-ecs

Procedure

Download the autofix_dhclient.py file to the ECS instance.
Run the following command to view the instructions of the script:
Note: This step is optional.
```
python autofix_dhclient.py
```
The following command output is returned.
```
Usage: autofix_dhclient.py <AccessKeyID> <AccessKeySecret> <region-id>
```
Description: The parameters are described as follows:
- AccessKeyID: your AccessKey ID. For more information, see Create an AccessKey.
- AccessKeySecret: Your AccessKeySecret.
- region-id: the ID of the region where the instance resides. For more information, see Regions and zones.
For more information, see Parameter description. Enter parameters such as AccessKeyID, AccessKeySecret, and region-id. Run the script as root or as an administrator.
```
python autofix_dhclient.py LTAIn*******Py6J kXXIOEoPXXvsYRUd**********TRyU cn-hangzhou
```

Result

The following content is a schematic diagram of the script running result.

PyhtonSDKResult

The following table describes the status check of an instance.

Cloud Assistant: This check item checks whether your instance has Cloud Assistant client installed.
- Installed: indicates that the Cloud Assistant client is installed on the instance.
- Not Installed: indicates that the Cloud Assistant client is missing. You can continue the repair work after installing the Cloud Assistant client.
NeedFix: This check item checks whether the instance needs to repair the dhclient process or the DHCP Client service.
- Yes: indicates that the repair needs to be continued. The script will automatically complete the subsequent work.
- No: indicates that no repair is required.
- Unknown: indicates that the script cannot be determined. You must manually execute the script.
FixResult: This check item reports the script fix result.
- Success: indicates that the dhclient process or DHCP Client service is successfully repaired.
- Failed: indicates that the fix failed.
- NoChange: indicates that no repair is required.
- Unknown: indicates that the script cannot be determined. You must manually execute the script.

Method 3:Shell/PowerShell script repair

This method requires you to log on to the affected instances and troubleshoot the problem one by one. Therefore, it is suitable for scenarios with a small number of instances.

CentOS Instance Procedure

Log on to the ECS instance. For more information, see Overview.
Gets the script linux_fix_dhclient.sh to any directory.
Switch to the working directory where the script is located and run the script as root.
```
sudo bash linux_fix_dhclient.sh
```
Description:
- When the return result is "0", the script has completed the inspection and repair work.
- Return to other status indicates that the repair failed.

Procedure for Windows instances

Log on to the ECS instance. For more information, see Overview.
Gets the repair win_fix_dhclient.ps1 script to any directory.
Open PowerShell with administrator privileges and run the following command.
```
powershell -executionpolicy bypass -file C:\win_fix_dhclient.ps1
```
Description:
- You need to replace the C:\win_fix_dhclient.ps1 with the actual file path.
- If "No ip will expire in recent 500 days. Then no need fix." is returned, it indicates that the DHCP Client service of the instance has no exception and no repair is required.
- When returning "Found one ip will expire in 500 days. We need fixing it!!! Fix it now... Fix success. "indicates that the DHCP Client service of the instance is abnormal and the script has completed the repair work.
- Return to other status indicates that the repair failed.

Method 4: Troubleshoot network interface controller one by one

This method requires you to check and fix the dhclient process (CentOS instance) or IP address lease expiration time (Windows instance) corresponding to each network interface controller.

CentOS Instance Procedure

Log on to the ECS instance. For more information, see Overview.
Run the following command to check all network interface controller of the instance.
```
ls -al /sys/class/net/
```
Run the following command to check whether the eth0 network interface controller uses DHCP to assign an IP address:
```
cat /etc/sysconfig/network-scripts/ifcfg-eth0
```
the system displays the following, BOOTPROTO=dhcp said network interface controller using DHCP distribution IP di zhi would not use DHCP distribution IP di zhi, go to step 7.
Run the following command to check the running status of the dhclient process corresponding to the eth0 network interface controller:
```
ps aux | grep dhclient | grep eth0
```
- An empty result indicates that the dhclient process is abnormal.
- The following result indicates that the dhclient process is running properly. Go to Step 7.
```
root 15340 0.0 0.3 113372 12788 ? Ss 14:16 0:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid -H izuf695ygwh32u2i******z eth0
```
Run the following command to restart the dhclient process:
```
ifup eth0
```
(this example eth0 network interface controller, for example, you need to command the eth0 replaces the network interface controller with the actual id of the eni.
Check the running status of the dhclient process corresponding to network interface controller again.
Repeat steps 3 to 6 to check and fix the running status of the dhclient process of all network interface controller.

Procedure for Windows instances

Log on to the ECS instance. For more information, see Overview.
Open the Command Prompt CMD as an administrator.
Run the following command to check whether the DHCP enabled entry of network interface controller described as Red Hat VirtIO Ethernet Adaptor is Yes and the time when its lease expires.
```
ipconfig /all
```
Note: The Red Hat VirtIO Ethernet Adaptor is the primary network interface controller secondary Elastic Network Interface network interface controller of the ECS instance. The VPN or LoopBack network interface controller that you have configured is not affected. In addition network interface controller that are not enabled with the DHCP service are not in the affected range.
If the lease expires within one year, run the following command to update the lease expiration time:
```
ipconfig /renew
```
Run the ipconfig /all command to confirm that the returned lease expires within ten years, indicating that the repair is completed.