It is often difficult to troubleshoot problems involving sending and receiving packets. This series of articles focus on network troubleshooting and summarizes troubleshooting procedures and analysis results to gradually provide a complete and deep analysis and study of NDIS (Network Driver Interface Specification) Framework and Qemu Virtio netkvm.
When Windows Server uses FTP to upload files, network exceptions are thrown during transmission, causing file transmission failures. Although the server network automatically recovers itself after a while, this problem often recurs during subsequent upload operations.
Although the problem seems very clear, we need to understand more details:
1. The condition that triggers the network exceptions: On another client, use FTP to upload files.
2. Network exception diagnosis:
3. Based on the previous network exception diagnosis, we roughly know that this problem has something to do with ARP. However, after adding static ARP information (arp.exe -s gw.ipv4.address ee:ff:ff:ff:ff:ff) by using arp.exe -s, we find that we still cannot ping the gateway, which means that ARP is not the only reason.
To locate the problem more accurately, we perform packet capture and analysis on both the Windows machine and its host. Unfortunately, the captured packets are directly printed on the screen. No specific packet content is retained. Screenshots of subsequent packet capture operations will be available in similar scenarios.
Here is the result of the packet capture analysis. After static ARP is added, the result of the packet capture on the VIF interface shows that the pinged ICMP Echo Request message is sent from the VIF interface and the ICMP Echo Reply message is received.
Based on the preceding tests and diagnosis, we think that this problem lies in the underlying driver. We can capture and analyze the dump file in Windows directly from NC.
Windows has integrated packet capture capability since 2008 R2. This capability is implemented on NDIS.sys and works with the ETW mechanism on Windows to make troubleshooting easier and more convenient. To enable this capability, simply run the following command:
netsh trace start capture=yes
After the problem recurs, run the following command:
netsh trace stop
Captured log files are written into the temp directory of the current user. After we run the stop command, Windows will print the path to the log file in cmd.exe.
The file that this command captures can be opened with Microsoft Network Monitor 3.4 or Microsoft Message Analyzer. Currently Wireshark cannot recognize this file.
ETW (Event Tracing for Windows) is a good tool to troubleshoot OS component behaviors. Windows provides some ready-to-use scenarios and providers, which we can view by using netsh trace show scenarios and netsh trace show providers. Perhaps I will write a separate article to describe these available scenarios and providers later. Regarding networking, I list some providers related to NDIS, TCP/IP, Afd, and Winsock here. These providers are applicable for deep analysis of system network behaviors in general scenarios. The command is as follows.
netsh trace start provider={2F07E2EE-15DB-40F1-90EF-9D7BA282188A} keywords=0xffffffffffffffff level=0xff provider={E53C6823-7BB8-44BB-90DC-3F86090D48A6} keywords=0xffffffffffffffff level=0xff provider={7D44233D-3055-4B9C-BA64-0D47CA40A232} keywords=0xffffffffffffffff level=0xff provider={50B3E73C-9370-461D-BB9F-26F32D68887D} keywords=0xffffffffffffffff level=0xff provider={43D1A55C-76D6-4F7E-995C-64C711E5CAFE} keywords=0xffffffffffffffff level=0xff maxSize=500MB fileMode=circular persistent=no overwrite=yes report=yes correlation=yes traceFile=c:\NetworkTrace.etl capture=yes packettruncatebytes=128 IPv4. Address=<ipv4.address.for.filtering>
To learn more about the meaning of the command, see the netsh trace capture help.
First, check the status of the NIC Miniport. It shows no exceptions. If a NIC is abnormal, we may typically look at general information like Pending OID or Reset. In this case, we can try to upgrade the NIC driver.
Next, check the send path, that is, information about sending request. In older versions of Windows, we can see the reference on mopen. Each time a request is sent, the tcpip.sys driver increases the reference of mopen for both TCP/IP and Miniport. After a message is sent, the NIC will invoke the callback function in the tcpip.sys driver to release the reference.
In versions later than Windows Server 2008 R2, the reference of mopen does not work any more. The count of sent requests is recorded in the Provider_Rundown_Protection of tcpip.sys to meet the capability of processing a sending request on different CPUs. The count in the Rundown targets each CPU. Addition and subtraction are performed to determine whether all pending NBLs have been sent.
Find the status information about sending and receiving requests by using the counts of other statuses. We find the statistics of Virtio netkvm and confirm that both sending and receiving is normal.
After further checking the data structure, we find that the NetReceiveBuffer List is empty and that NetNofReceiveBuffers is 0. This may happen because the NIC driver finds no buffer available, leading to the interruption of receiving packets. We have different options for the netkvm driver. For example, we can disable the NIC to prevent packet reception when the buffer is full.
Next, check the ParaNdis_ProcessRxPath
function and the virtqueue_get_buf
function to confirm that the ring buffer is full.
The Virtio netkvm code shows that the buffer content in the LIST_ENTRY data structure of NetReceiveBuffersWaiting is maintained by the Windows NDIS framework driver. The ReturnPacketHandler that the netkvm driver has registered (that is, netkvm) is invoked. ParaNdis5_ReturnPacket releases the buffer and returns the content to NetReceiveBuffers.
At this point, the central issue is why NDIS does not invoke the callback function. Windows mainly depends on the references (namely, the reference count) of the NET_BUFFER_LIST data structure to recycle buffer related to the NIC. If the buffer is used, its reference count will be increased by 1. If the buffer operation is completed, the driver corresponding to the reference will invoke Dereference to release the reference. The callback function will be invoked only when the reference count of the buffer becomes 0.
We can troubleshoot this problem by enumerating all unreleased buffers and printing the network packet structure. For example:
! list "-t \_LIST\_ENTRY.Flink -e -x \"dt netkvm! IONetDescriptor @$extret; dt ndis! _NDIS_PACKET poi(@$extret+0x40) Private.; dt _MDL poi(poi(@$extret+0x40)+8); db poi(poi(poi(@$extret+0x40)+8)+0x18) L0x50\" 0xfffffadf`37fa26d0"
Note: 0x0bda = 3034
is the data port used in the FTP Pasv mode.
In this case, the Serv-U FTP server does not process packet receipt, causing the buffer to be full. This problem can be solved by using the build-in IIS FTP in Windows.
Based on the previous analysis, we can generally make the following plans:
1. The NDIS driver on Windows itself is not properly releasing the buffer. We recommend that you install the latest ndis.sys patch.
2. Other third-party drivers have incorrect references to the buffer, resulting in the reference count unable to be 0. It is recommended to uninstall third-party drivers and keep the operating system clean.
3. Messages are not processed.
It is generally recommended to upgrade the tcpip.sys, afd.sys, and winsock components, and replace the current application with other software.
Tim Chen - June 26, 2019
Alibaba Clouder - March 19, 2020
Tim Chen - May 22, 2019
Alibaba Cloud Native - March 6, 2024
Tim Chen - May 22, 2019
William Pan - August 19, 2019
Connect your business globally with our stable network anytime anywhere.
Learn MoreEstablish high-speed dedicated networks for enterprises quickly
Learn MoreAlibaba Cloud offers an accelerated global networking solution that makes distance learning just the same as in-class teaching.
Learn MoreAlibaba Cloud's global infrastructure and cloud-native SD-WAN technology-based solution can help you build a dedicated global network
Learn MoreMore Posts by Tim Chen