This is the second article of the Windows Networking troubleshooting series. The previous article describes the Windows NDIS architecture and how to troubleshoot network problems under this architecture. In this article, we use a case to explain the implementation of Windows TCP/IP in NDIS.
A service was deployed on Windows Server 2008 R2 SP1 by using Apache. When we tried to stop this Apache service, it continued to be in the pending or stopping status. No operations could fix this problem, until we restarted the machine.
We have never previously heard of any known issues that prevent the Apache service from being stopped. It seems that this problem is caused by the application itself. However, we still provide the following empirical suggestions just in case:
1. Uninstall unnecessary third-party software, especially security software where the Filter driver or the WFP callout driver has been added.
2. Disable the advanced features of the network interface controller (NIC), especially TCP Chimney and RSS. Reference: https://blogs.technet.microsoft.com/onthewire/2014/01/21/tcp-offloadingchimney-rsswhat-is-it-and-should-i-disable-it/
3. Check the Windows patch version and install the latest patch.
After trying these steps, the problem still existed. At the same time, we confirmed that the patch was the latest version. So far, all our general methods have failed to solve this problem. We have to reproduce the problem and capture the memory dump for further analysis.
From the dump file, we can clearly see that the httpd.exe process of the Apache service does not exit because the Afd.sys driver is still waiting for a completion signal.
Because the AFD resource cannot be released, the application continues to wait. Even if we kill the application, a zombie process exists. We have to restart the machine.
We know that the Windows AFD resource is strongly associated with the TCP resource. tcpip.sys will only invoke the afd.sys callback routine to release resources and trigger the signal after the corresponding TCP resource is released.
Therefore, the more important issue is why the TCP resource hasn't been released. To figure this out, we directly check the TCP resource reference.
Basically, resource object management has been implemented in Windows. A TCP port is also an object. Before operations are performed on each object, the system will try AddReference to avoid memory access violations caused by the release of that object when it is being used. After the object is used, DeReference is invoked to reduce the corresponding reference count. Once the reference count of an object is 0, the corresponding routine will release that object. For a TCP listening port, the resource to be released is tcpip! TcpDereferenceListener, for example:
# Child-SP RetAddr Call Site
00 fffff880`05b727b0 fffff880`0160a01e tcpip! TcpDereferenceListener+0xe
01 fffff880`05b727e0 fffff880`0160a039 tcpip! TcpCloseListener+0x6e
02 fffff880`05b72830 fffff880`02b2fa30 tcpip! TcpTlListenerCloseEndpoint+0x9
03 fffff880`05b72860 fffff880`02b2fef2 afd! AfdCleanupCore+0x410
04 fffff880`05b729e0 fffff800`01937aaf afd! AfdDispatch+0x42
05 fffff880`05b72a30 fffff800`01935a2e nt! IopCloseFile+0x11f
06 fffff880`05b72ac0 fffff800`0193565f nt! ObpDecrementHandleCount+0x8e
07 fffff880`05b72b40 fffff800`01935964 nt! ObpCloseHandleTableEntry+0xaf
08 fffff880`05b72bd0 fffff800`016fd9d3 nt! ObpCloseHandle+0x94
09 fffff880`05b72c20 00000000`7774999a nt! KiSystemServiceCopyEnd+0x13
In this case, the TCP resource corresponding to port 80 obviously has more than 0x36 references. Besides TcpCreateListener, the other 0x35 references may be reference leaks or references that other drivers or software has added when they perform operations on this structure. For example, netstat will increase the reference to a port when it enumerates port information:
Since we have confirmed that no network-related third-party software is installed in the system, we can basically reach the conclusion that the problem is caused by TCP resource leaks on the operating system. At this point, we usually need to open a case with Microsoft to further analyze the operating system problems.
Just when we were ready to open a case with Microsoft for further analysis, we coincidentally found that the latest patch (released on July 10) may cause w3svc to hang. Although it is not related to Apache, the problem is essentially the same.
https://support.microsoft.com/en-us/help/4338818/windows-7-update-kb4338818
Microsoft later released update 4345459 to fix that problem.
In our case, the problem was solved after the patch was applied.
Windows Networking Troubleshooting 1: Not Receiving Data Packets at NIC
Windows Networking Troubleshooting 3: Network Bugs triggered
Tim Chen - May 22, 2019
Tim Chen - June 26, 2019
Apache Flink Community China - December 25, 2020
Tim Chen - June 26, 2019
Tim Chen - May 22, 2019
Tim Chen - June 26, 2019
Connect your business globally with our stable network anytime anywhere.
Learn MoreEstablish high-speed dedicated networks for enterprises quickly
Learn MoreAlibaba Cloud offers an accelerated global networking solution that makes distance learning just the same as in-class teaching.
Learn MoreAlibaba Cloud's global infrastructure and cloud-native SD-WAN technology-based solution can help you build a dedicated global network
Learn MoreMore Posts by Tim Chen