How Intel PAUSE instruction changes affect MySQL performance
Guided Reading
There are many x86 and arm instructions. Neither application programmers nor database kernel developers need to have a deep understanding of these instructions most of the time, but Pause instructions and database operations are too close. This article introduces the understanding of Pause instructions through a very interesting performance optimization, hoping to find out how CPU instruction sets affect your programs with half the effort.
The article is divided into two parts. The first part is the performance optimization process of a full table scan of MySQL clusters; The second part is the principle analysis after the problem is solved, as well as the context, advantages and disadvantages of Pause instruction and application scenario analysis.
Business structure
Some simplifications are made for the convenience of understanding:
Client ->Tomcat ->LVS ->MySQL (32 MySQL instance clusters, 8 cores for each instance)
Scenario Description
The client presses Tomcat and MySQL clusters (to make sub databases and sub tables of data). MySQL clusters are 32 instances. Each business SQL needs to be split into 256 SQL statements by Tomcat and sent to 32 MySQL databases (each MySQL has 8 sub databases). The 256 SQL statements sent to MySQL are not completely serial, but not completely parallel, and have a certain degree of parallelism.
The business SQL is a simple select sum, which is fast on every MySQL (indexed)
SELECT SUM(emp_arr_amt) FROM table_ c WHERE INSUTYPE='310' AND Revs_ Flag='Z' AND accrym='201910' AND emp_ no='1050457';
Description of monitoring indicators
● The logic RT/QPS in the following or screenshot refers to the RT and QPS of Tomcat seen on the client;
● RT: response time, the only indicator to judge the performance bottleneck;
● Physical RT/QPS refers to MySQL RT and QPS seen by Tomcat (RT here refers to RT that reaches the network card of Tomcat node, so it also includes network consumption)
Problem description:
The client presses a Tomcat node+32 MySQL nodes, and the QPS is about 430. The Tomcat node CPU is full, and the MySQL RT is 0.5ms. Add a Tomcat node, and the QPS is about 700. The Tomcat CPU is nearly full, and the MySQL RT is 0.6ms. Here, the performance basically increases linearly with the expansion, which is in line with expectations.
Continue to add Tomcat nodes to expand the capacity horizontally. The client presses three Tomcat nodes+32 MySQL nodes, and the QPS is still 700. The Tomcat node CPU is not running well, and the MySQL RT is 0.8ms, which is seriously not in line with expectations.
Performance pressure test principle:
The increase of concurrent QPS no longer indicates that there is a bottleneck, and the bottleneck is where RT increases the most
Troubleshooting MySQL
Through monitoring, the on-site DBA found that the MySQL CPU was less than 20%, and there was no slow query. It tried to use the client to directly press one MySQL over all the intermediate links, which could run the MySQL CPU to full capacity. At this time, the QPS was about 38000 (corresponding to the above scenario, when the client QPS was 700, the QPS on a single MySQL only ran to 6000). Therefore, the suspicion of MySQL was ruled out (this reasoning was not rigorous enough, which left a big hole for later investigation).
Then the next suspicion lies in the network, LVS and other intermediate links.
Suspicion of LVS and network
First of all, the problem of bandwidth is eliminated through a large query, because there are small packets here. When the PPS reaches 720000, it is natural to think of the flow restriction of gateways and LVS
PPS monitoring. This physical machine has four MySQL instances. The PPS is about 90000, and 9 * 32/4=720000
In the end, all network factors were eliminated. The core evidence is that when doing pressure testing, repeatedly ping MySQL from Tomcat. RT is the same as when there is no pressure, which also shows that the network is not a problem (please think about the role of this ping).
Confirmation of problems
Try to open the log on Tomcat and set the slow SQL threshold to 100ms. At this time, you can really see a large number of slow queries on MySQL from the log, because this SQL needs to be split into 256 SQL statements on Tomcat and issued at the same time. Once one SQL statement returns slowly, the entire request will be dragged down by this short board. The average RT is 0.8ms. However, if the average RT exceeds 100ms, the overall impact is still great.
Find the slow queries recorded by Tomcat (Tomcat adds a unique ID and sends it to MySQL) in the MySQL log. It is found that MySQL is indeed slow, so it is basically confirmed that it is a MySQL problem here. Finally, we don't have to worry about whether it is a network problem.
At the same time, packet capturing is performed on Tomcat, and RT on the network card is statistically analyzed:
The above is the average physical RT of each SQL captured by Tomcat. The above is the RT of 0.6 ms for the QPS 430. The following is three servers with the QPS of 700. However, the RT has risen to 0.9 ms, which is basically consistent with the physical RT recorded by Tomcat monitoring. If MySQL also has similar packet capture to calculate RT time, it can quickly eliminate network problems.
RT data obtained from network packet capture is more easily accepted by all people. I tried to capture packets on MySQL, but because of the LVS module, the access ports and IP addresses have been modified, so I cannot analyze the response time of a stream.
Focus on MySQL again
At this time, due to the basic confirmation of the problem points, the focus of checking whether there is a problem in MySQL is different. Instead of just looking at the CPU and slow queries, the problem is obviously more complex.
Lesson: CPU is only a factor affecting performance, RT is the result. Follow RT, not just CPU
Through monitoring, it is found that the MySQL CPU is not high all the time, but the running thread is often seen to soar to more than 100, which soon drops down. It looks like there are too many sudden concurrent query requests leading to queuing. Each MySQL instance is an 8Core CPU. Try to expand the MySQL instance to 16Core (just to verify this problem), and the QPS can indeed rise to 1000 (not the ideal 1400).
This is the MySQL status monitored on Tomcat:
At the same time, you can also see this surge through vmstat on MySQL:
From the above analysis, we can clearly see that although the overall pressure of MySQL is not great, it seems that occasionally there will be a surge in burka and running tasks.
It seems that monitoring of such transient and sudden concurrent traffic is difficult to see (almost all of them are averaged out). Only some real-time monitoring occasionally collects such transient and sudden spikes, which also leads to the neglect of MySQL at the beginning.
So the next core question is why MySQL has such a surge, and what is the impact of this surge?
Perf top
Directly use perf to view the MySQL D process and find out that ut_ The delay is illogical:
To expand it, you can basically select the number of index hit rows in the optimizer:
It is basically the same as what you can see in the MySQL command line through show processlist:
This is how the MySQL optimizer makes statistics on the indexes. When making statistics, it needs to lock. When thread running jitters, it can be seen from the show processlist that many threads are in the statistics state. That is to say, the locking of high parallel transmission affects the CPU failure and the sharp increase of RT.
Here ut_ The 28% CPU consumed by delay must be abnormal, so innodb_ spin_ wait_ After the delay is changed from 30 to 6, the performance is improved immediately. If you continue to add Tomcat nodes, QPS can also be increased linearly.
The highest CPU consumption call function stack is... mutex_ spin_ wait->ut_ Delay, which belongs to the logic of lock waiting. InnoDB uses a spin lock here, and lock waiting is done by calling ut_ Delay allows the CPU to do a null cycle without releasing the CPU while waiting for the lock, thus avoiding context switching, which will consume a relatively high amount of CPU.
Final performance
Adjust parameter innodb_ spin_ wait_ After delay=6, under the four Tomcat nodes, when the concurrency is 40, the QPS reaches 1700, the physical RT is 0.7, the logical RT is 19.6, and the cpu is 90%. At this time, you only need to continue to expand the number of Tomcat nodes to increase the QPS
Compare with that before adjustment, innodb_ spin_ wait_ Delay=30, concurrent 40, QPS 500+, physical RT: 2.6ms, logical RT: 72.1ms, cpu: 37%
Take a look at the vmstat and tsar -- cpu during the pressure test before adjustment. You can see that the process running jitters obviously
Compared with the modified delay, the process running is stable, even though the QPS is three times larger
Afterthought and analysis
The problem has been solved perfectly here, but I can't help asking why? ut_ How does delay work? And innodb_ spin_ wait_ The relationship between delay and spin lock?
Principle analysis
Since innodb is adjusted_ spin_ wait_ delay can solve this problem. First, analyze innodb_ spin_ wait_ The role of delay
About innodb_ spin_ wait_ delay
Innodb uses a large number of spinlocks (such as InnoDB mutexes and rw locks) to avoid context switching with high CPU consumption. This is the correct way to use spinlocks. In a multi-core scenario, they spin together to seize the same lock, which is likely to cause cache ping pong. In addition, multiple CPU cores will make each other's cache part invalid. So innodb here adds innodb_ spin_ wait_ Delay and Pause work together to relieve cache ping pong. That is to say, the high-speed spinning lock grabbing of the CPU is replaced by a delay (Pause) after the lock grabbing fails, but the CPU is not released. The lock grabbing continues after the delay time is up. That is, the continuous spinning lock grabbing is converted into a more sparse point lock grabbing (the interval delay is a random number). This not only avoids context switching, but also greatly reduces cache ping pong.
How spin locks reduce cache ping pong
When multiple threads compete for locks, the thread that failed to lock will be "busy waiting" until it gets the lock. What is "busy waiting"? It does not mean that CAS functions are executed all the time, but will work closely with the CPU. It reduces cache ping and power consumption during cyclic waiting through the PAUSE instruction provided by the CPU; For a single core CPU, busy waiting is meaningless. At this time, it will actively sleep the thread.
X86 PAUSE instruction
X86 has designed Pause instruction, that is, the code calling Pause instruction will not release it, but the CPU will nap, for example, 10 clock cycles, thousands of clock cycles more than one context switch.
In this way, once the spin lock grab fails, you can pause first, but the pause time is not long enough for MySQL, so you need to add the parameter innodb_ spin_ wait_ Delay to enlarge the break time.
In our scenario, we are very sensitive to the RT jitter of each SQL (256 times larger), so excessive delay will cause some SQL RTs to become higher.
Why should CPU have Pause
First, we can see the function of Pause instruction:
● Avoid context switching. The application layer may use yield and sleep if it wants to rest. These two operations are too heavy for the CPU (with context switching)
● It can free up the computing power for the super thread (HT shares the core, but has separate registers and other storage units. When CPU Pause is used, the corresponding HT can occupy computing resources). For example, the same core runs multiple Pauses first, and then runs the nop instruction. At this time, the IPC of the nop instruction is basically not affected by Pause
● Energy saving (CPU can rest but not let out). You can see 100% CPU from top when CPU Pause, but it does not consume energy.
Therefore, the Pause instruction can improve the utilization of hyperthreads, save energy, reduce context switching, and improve the efficiency of spinlocks.
Summary analysis
Different Intel CPU architectures lead to different CPU Cycles of Pause instruction, leading to MySQL innodb_ spin_ wait_ Delay When the spin lock fails (Pause innodb_spin_wait_delayN is required at this time), the delay takes longer, which makes the caller see the greater RT of MySQL, which leads to the failure of business concurrency on Tomcat Server, so the pressure will not rise eventually.
In the troubleshooting of long links, it is the most difficult to pinpoint which node has a problem. You should focus on RT instead of CPU.
Faster is not enough. When doing pressure testing, you should honestly observe QPS and RT from a concurrent point, and then increase the pressure until the pressure is too high. Then look at the changes in QPS and RT, and confirm the bottleneck.
There are many x86 and arm instructions. Neither application programmers nor database kernel developers need to have a deep understanding of these instructions most of the time, but Pause instructions and database operations are too close. This article introduces the understanding of Pause instructions through a very interesting performance optimization, hoping to find out how CPU instruction sets affect your programs with half the effort.
The article is divided into two parts. The first part is the performance optimization process of a full table scan of MySQL clusters; The second part is the principle analysis after the problem is solved, as well as the context, advantages and disadvantages of Pause instruction and application scenario analysis.
Business structure
Some simplifications are made for the convenience of understanding:
Client ->Tomcat ->LVS ->MySQL (32 MySQL instance clusters, 8 cores for each instance)
Scenario Description
The client presses Tomcat and MySQL clusters (to make sub databases and sub tables of data). MySQL clusters are 32 instances. Each business SQL needs to be split into 256 SQL statements by Tomcat and sent to 32 MySQL databases (each MySQL has 8 sub databases). The 256 SQL statements sent to MySQL are not completely serial, but not completely parallel, and have a certain degree of parallelism.
The business SQL is a simple select sum, which is fast on every MySQL (indexed)
SELECT SUM(emp_arr_amt) FROM table_ c WHERE INSUTYPE='310' AND Revs_ Flag='Z' AND accrym='201910' AND emp_ no='1050457';
Description of monitoring indicators
● The logic RT/QPS in the following or screenshot refers to the RT and QPS of Tomcat seen on the client;
● RT: response time, the only indicator to judge the performance bottleneck;
● Physical RT/QPS refers to MySQL RT and QPS seen by Tomcat (RT here refers to RT that reaches the network card of Tomcat node, so it also includes network consumption)
Problem description:
The client presses a Tomcat node+32 MySQL nodes, and the QPS is about 430. The Tomcat node CPU is full, and the MySQL RT is 0.5ms. Add a Tomcat node, and the QPS is about 700. The Tomcat CPU is nearly full, and the MySQL RT is 0.6ms. Here, the performance basically increases linearly with the expansion, which is in line with expectations.
Continue to add Tomcat nodes to expand the capacity horizontally. The client presses three Tomcat nodes+32 MySQL nodes, and the QPS is still 700. The Tomcat node CPU is not running well, and the MySQL RT is 0.8ms, which is seriously not in line with expectations.
Performance pressure test principle:
The increase of concurrent QPS no longer indicates that there is a bottleneck, and the bottleneck is where RT increases the most
Troubleshooting MySQL
Through monitoring, the on-site DBA found that the MySQL CPU was less than 20%, and there was no slow query. It tried to use the client to directly press one MySQL over all the intermediate links, which could run the MySQL CPU to full capacity. At this time, the QPS was about 38000 (corresponding to the above scenario, when the client QPS was 700, the QPS on a single MySQL only ran to 6000). Therefore, the suspicion of MySQL was ruled out (this reasoning was not rigorous enough, which left a big hole for later investigation).
Then the next suspicion lies in the network, LVS and other intermediate links.
Suspicion of LVS and network
First of all, the problem of bandwidth is eliminated through a large query, because there are small packets here. When the PPS reaches 720000, it is natural to think of the flow restriction of gateways and LVS
PPS monitoring. This physical machine has four MySQL instances. The PPS is about 90000, and 9 * 32/4=720000
In the end, all network factors were eliminated. The core evidence is that when doing pressure testing, repeatedly ping MySQL from Tomcat. RT is the same as when there is no pressure, which also shows that the network is not a problem (please think about the role of this ping).
Confirmation of problems
Try to open the log on Tomcat and set the slow SQL threshold to 100ms. At this time, you can really see a large number of slow queries on MySQL from the log, because this SQL needs to be split into 256 SQL statements on Tomcat and issued at the same time. Once one SQL statement returns slowly, the entire request will be dragged down by this short board. The average RT is 0.8ms. However, if the average RT exceeds 100ms, the overall impact is still great.
Find the slow queries recorded by Tomcat (Tomcat adds a unique ID and sends it to MySQL) in the MySQL log. It is found that MySQL is indeed slow, so it is basically confirmed that it is a MySQL problem here. Finally, we don't have to worry about whether it is a network problem.
At the same time, packet capturing is performed on Tomcat, and RT on the network card is statistically analyzed:
The above is the average physical RT of each SQL captured by Tomcat. The above is the RT of 0.6 ms for the QPS 430. The following is three servers with the QPS of 700. However, the RT has risen to 0.9 ms, which is basically consistent with the physical RT recorded by Tomcat monitoring. If MySQL also has similar packet capture to calculate RT time, it can quickly eliminate network problems.
RT data obtained from network packet capture is more easily accepted by all people. I tried to capture packets on MySQL, but because of the LVS module, the access ports and IP addresses have been modified, so I cannot analyze the response time of a stream.
Focus on MySQL again
At this time, due to the basic confirmation of the problem points, the focus of checking whether there is a problem in MySQL is different. Instead of just looking at the CPU and slow queries, the problem is obviously more complex.
Lesson: CPU is only a factor affecting performance, RT is the result. Follow RT, not just CPU
Through monitoring, it is found that the MySQL CPU is not high all the time, but the running thread is often seen to soar to more than 100, which soon drops down. It looks like there are too many sudden concurrent query requests leading to queuing. Each MySQL instance is an 8Core CPU. Try to expand the MySQL instance to 16Core (just to verify this problem), and the QPS can indeed rise to 1000 (not the ideal 1400).
This is the MySQL status monitored on Tomcat:
At the same time, you can also see this surge through vmstat on MySQL:
From the above analysis, we can clearly see that although the overall pressure of MySQL is not great, it seems that occasionally there will be a surge in burka and running tasks.
It seems that monitoring of such transient and sudden concurrent traffic is difficult to see (almost all of them are averaged out). Only some real-time monitoring occasionally collects such transient and sudden spikes, which also leads to the neglect of MySQL at the beginning.
So the next core question is why MySQL has such a surge, and what is the impact of this surge?
Perf top
Directly use perf to view the MySQL D process and find out that ut_ The delay is illogical:
To expand it, you can basically select the number of index hit rows in the optimizer:
It is basically the same as what you can see in the MySQL command line through show processlist:
This is how the MySQL optimizer makes statistics on the indexes. When making statistics, it needs to lock. When thread running jitters, it can be seen from the show processlist that many threads are in the statistics state. That is to say, the locking of high parallel transmission affects the CPU failure and the sharp increase of RT.
Here ut_ The 28% CPU consumed by delay must be abnormal, so innodb_ spin_ wait_ After the delay is changed from 30 to 6, the performance is improved immediately. If you continue to add Tomcat nodes, QPS can also be increased linearly.
The highest CPU consumption call function stack is... mutex_ spin_ wait->ut_ Delay, which belongs to the logic of lock waiting. InnoDB uses a spin lock here, and lock waiting is done by calling ut_ Delay allows the CPU to do a null cycle without releasing the CPU while waiting for the lock, thus avoiding context switching, which will consume a relatively high amount of CPU.
Final performance
Adjust parameter innodb_ spin_ wait_ After delay=6, under the four Tomcat nodes, when the concurrency is 40, the QPS reaches 1700, the physical RT is 0.7, the logical RT is 19.6, and the cpu is 90%. At this time, you only need to continue to expand the number of Tomcat nodes to increase the QPS
Compare with that before adjustment, innodb_ spin_ wait_ Delay=30, concurrent 40, QPS 500+, physical RT: 2.6ms, logical RT: 72.1ms, cpu: 37%
Take a look at the vmstat and tsar -- cpu during the pressure test before adjustment. You can see that the process running jitters obviously
Compared with the modified delay, the process running is stable, even though the QPS is three times larger
Afterthought and analysis
The problem has been solved perfectly here, but I can't help asking why? ut_ How does delay work? And innodb_ spin_ wait_ The relationship between delay and spin lock?
Principle analysis
Since innodb is adjusted_ spin_ wait_ delay can solve this problem. First, analyze innodb_ spin_ wait_ The role of delay
About innodb_ spin_ wait_ delay
Innodb uses a large number of spinlocks (such as InnoDB mutexes and rw locks) to avoid context switching with high CPU consumption. This is the correct way to use spinlocks. In a multi-core scenario, they spin together to seize the same lock, which is likely to cause cache ping pong. In addition, multiple CPU cores will make each other's cache part invalid. So innodb here adds innodb_ spin_ wait_ Delay and Pause work together to relieve cache ping pong. That is to say, the high-speed spinning lock grabbing of the CPU is replaced by a delay (Pause) after the lock grabbing fails, but the CPU is not released. The lock grabbing continues after the delay time is up. That is, the continuous spinning lock grabbing is converted into a more sparse point lock grabbing (the interval delay is a random number). This not only avoids context switching, but also greatly reduces cache ping pong.
How spin locks reduce cache ping pong
When multiple threads compete for locks, the thread that failed to lock will be "busy waiting" until it gets the lock. What is "busy waiting"? It does not mean that CAS functions are executed all the time, but will work closely with the CPU. It reduces cache ping and power consumption during cyclic waiting through the PAUSE instruction provided by the CPU; For a single core CPU, busy waiting is meaningless. At this time, it will actively sleep the thread.
X86 PAUSE instruction
X86 has designed Pause instruction, that is, the code calling Pause instruction will not release it, but the CPU will nap, for example, 10 clock cycles, thousands of clock cycles more than one context switch.
In this way, once the spin lock grab fails, you can pause first, but the pause time is not long enough for MySQL, so you need to add the parameter innodb_ spin_ wait_ Delay to enlarge the break time.
In our scenario, we are very sensitive to the RT jitter of each SQL (256 times larger), so excessive delay will cause some SQL RTs to become higher.
Why should CPU have Pause
First, we can see the function of Pause instruction:
● Avoid context switching. The application layer may use yield and sleep if it wants to rest. These two operations are too heavy for the CPU (with context switching)
● It can free up the computing power for the super thread (HT shares the core, but has separate registers and other storage units. When CPU Pause is used, the corresponding HT can occupy computing resources). For example, the same core runs multiple Pauses first, and then runs the nop instruction. At this time, the IPC of the nop instruction is basically not affected by Pause
● Energy saving (CPU can rest but not let out). You can see 100% CPU from top when CPU Pause, but it does not consume energy.
Therefore, the Pause instruction can improve the utilization of hyperthreads, save energy, reduce context switching, and improve the efficiency of spinlocks.
Summary analysis
Different Intel CPU architectures lead to different CPU Cycles of Pause instruction, leading to MySQL innodb_ spin_ wait_ Delay When the spin lock fails (Pause innodb_spin_wait_delayN is required at this time), the delay takes longer, which makes the caller see the greater RT of MySQL, which leads to the failure of business concurrency on Tomcat Server, so the pressure will not rise eventually.
In the troubleshooting of long links, it is the most difficult to pinpoint which node has a problem. You should focus on RT instead of CPU.
Faster is not enough. When doing pressure testing, you should honestly observe QPS and RT from a concurrent point, and then increase the pressure until the pressure is too high. Then look at the changes in QPS and RT, and confirm the bottleneck.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00