By Jiachun
<group, providerName, version>
, methodName
, and args[]
, into a byte array and sends it to the address over the network.providerObject
in the local service dictionary by <group, providerName, version>
, calls the specified method through reflection based on <methodName, args[]>
, and serializes the method return value as an array of bytes to return to the client.The process above is transparent to the method caller, and everything looks like a local call.
Important Concept: RPC trituple <ID, Request, Response>
.
Note: In Netty4.x, thread contention can be avoided better by replacing the global Map with IO Thread(worker) —> Map<InvokeId,Future>
.
1) metadata:<group, providerName, version>
2) methodName
3) Is parameterTypes[]
necessary?
a) What's the problem?
ClassLoader.loadClass()
during deserializationb) Can they be solved?
c) args[]
d) Other: traceId
, appName
...
1) What does the Proxy do?
2) What methods can be used to create a Proxy?
3) What are the most important aspects?
toString
, equals
, hashCode
, and other methods.4) Recommendation (bytebuddy):
The protocol header is marked with the serializer type. Multiple types are supported.
Java SPI:
The failure of one thread pool does not affect other thread pools.
Too many extensions need to start from here.
OpenTracing
It is necessary to have the extension capability to access the third-party throttling middleware easily.
1) Weighted Random (Dichotomy Instead of Traverse)
2) Weighted Polling (Maximum Common Divisor)
3) Minimum Load
4) Consistent Hash (Stateful Service Scenarios)
5) Others
Note: Preheating logic is required.
1) Fail-Fast
2) Failover
How do we handle asynchronous calls?
3) Fail-Safe
4) Fail-Back
5) Forking
6) Others
1) Write a FastMethodAccessor
using ASM to replace the reflection call on the server.
2) Serialization/Deserialization
Serialize and deserialize in business threads to avoid occupying I/O threads:
loadClass
has a serious lock contention problem, which can be observed through JMC.Select an efficient serialization/deserialization framework:
Framework selection is only the first step. If serialization framework does not go well, expand and optimize it:
byte[]
-> off-heap memory/off-heap memory -> byte[]
-> java objectbyte[]
step and read from/write to the off-heap memory directly. This requires the corresponding serialization framework to be expanded.writeBytes
are merged into writeShort/writeInt/writeLong
.UnsafeNioBufInput
reads from the off-heap memory directly, and UnsafeNioBufOutput
writes to the off-heap memory directly.3) I/O thread is bound to the CPU.
4) Client coroutine that calls a synchronous blocking operation in the client and encounters a bottleneck easily:
Name | Description |
Kilim | Bytecode enhancement during compilation |
Quasar Agent | Dynamic bytecode enhancement |
ali_wisp | Implementation of ali_jvm in the underlying environment |
5) Netty Native Transport and PooledByteBufAllocator
:
6) Release the I/O thread as soon as possible to do what it should do and minimize thread context switching
Poor Stability with Multiple Problems
EPollArrayWrapper.epollWait
returns a bug of 100% CPU usage caused by empty polling. Netty helps you work around by rebuilding the selector.Some Disadvantages of NIO Code Implementation
1) Selector.selectedKeys()
produces too much garbage.
Netty modified the implementation of sun.nio.ch.SelectorImpl
and used double arrays instead of HashSet to store selectedKeys:
NIO code is synchronized everywhere, such as allocate direct buffer and Selector.wakeup()
:
pooledBytebuf
of Netty has a fronted TLAB (Thread-local allocation buffer) that reduces lock contention effectively.fd_set
in Windows, we can only compromise and use two TCP connections for simulation.) If wakeup calls are insufficient, it will cause unnecessary congestion during the select operation. (If you are confused, use Netty directly, which has the corresponding optimization logic.)2) fdToKey
mapping
EPollSelectorImpl#fdToKey
maintains the mapping of SelectionKey
corresponding to all connected fd (descriptor), which is a HashMap.fdToKey
. These fdToKeys
roughly share all connections.3) Selector is the implementation of Epoll LT on the Linux platform.
4) Direct Buffers is managed by GC.
DirectByteBuffer.cleaner
: The virtual reference is responsible for free direct memory. DirectByteBuffer
is just a shell. If this shell survives through the age limit of the new generation and finally comes to the old generation, it will be a sad thing.Bits.reserveMemory()
-> { System.gc() }
. First of all, the entire process is interrupted by GC, and the code sleeps for 100 milliseconds. If the direct memory is still not enough after the code wakes up, oops .-XX:+DisableExplicitGC
parameter, there will be unexpected misfortune.UnpooledUnsafeNoCleanerDirectByteBuf
of Netty. The Netty framework releases the items in real-time by maintaining the reference count.EventLoop
Boss: the mainReactor and Worker: the subReactor
EventLoop
needs to be included in BossEventLoopGroup
, and only one can be used.WorkerEventLoopGroup
generally contains multiple EventLoop
, and the number is generally two times the CPU core number. The most important thing is to find the best value according to the scenario.ServerChannel
and Channel
. ServerChannel
corresponds to ServerSocketChannel
, and Channel
corresponds to a network connection.ChannelPipeline
Pooling&Reuse
PooledByteBufAllocator
mpsc_queue
while sacrificing a little bit of performance.Recycler
WeakOrderQueue
and associated with the stack. If the stack is empty in the next pop, all WeakOrderQueues
associated with the current stack are scanned first.WeakOrderQueue
is a linked list of multiple arrays. The default size of each array is 16.It creates fewer objects and has less GC pressure than NIO.
The following part describes some specific features for the optimization on Linux:
SO_REUSEPORT
: Port reuse – Multiple sockets are allowed to listen on the same IP address and port, and the cooperation with RPS/RFS improves the performance more. RPS and RFS simulate multi-queue network interface cards (NICs) at the software layer and provide load balancing capabilities. This prevents the interruption of packet reception, and delivery by NICs occurs at one CPU core, which affects the performance.TCP_FASTOPEN
: Three handshakes are also used to exchange data.EDGE_TRIGGERED
: Epoll ET is supported.select/poll
fd_set
between the user space and kernel space repeatedly.
Epoll
epoll_wait
is called, only the ready file descriptor is returned.Concepts:
Readable:
Writable:
Diagram:
Three Epoll Methods
1) Main Code: linux-2.6.11.12/fs/eventpoll.c
2) int epoll_create(int size)
Create an rb-tree
(red-black tree) and a ready-list
(ready linked list):
3) int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
epitem
in the rb-tree
and register ep_poll_callback
with the kernel interrupt handler. When the callback is triggered, put the epitem
in the ready-list.4) int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout)
ready-list
—> events[]
Data Structure of Epoll
epoll_wait
Workflow Overview
Code for reference: linux-2.6.11.12/fs/eventpoll.c
:
1) epoll_wait
calls ep_poll
rdlist
(ready-list) is empty (no ready fd), the current thread is suspended. The current thread is only awakened when rdlist
is not empty.2) The event status of file descriptor fd is changed.
ep_poll_callback
on the corresponding fd is triggered.3) ep_poll_callback
is triggered.
epitem
of the corresponding fd is added to the rdlist
. Therefore, the rdlist
is not empty, the thread is awakened, and the epoll_wait
can continue.4) Run the ep_events_transfer
function
epitem
in rdlist
to txlist
and clear rdlist
epitem
is returned to rdlist
.5) Run the ep_send_events
function
epitem
in txlist
and call the poll method corresponding to its associated fd to obtain the newer events1) The necessity of the business thread pool
2) WriteBufferWaterMark
3) Rewrite the MessageSizeEstimator
to reflect the real high and low watermarks
outboundHandler
is passed through when writing the object. At this time, the object has not been encoded into Bytebuf. Therefore, the size calculation is inaccurate (being smaller.)4) Pay attention to the setting of EventLoop#ioRatio
, which is 50 by default.)
EventLoop
to execute I/O tasks and non-I/O tasks.5) Who schedules the detection of idle procedures?
delayQueue
of EventLoop
, a priority queue implemented by a binary heap, is used, and the complexity degree is O(log N). Each worker processes its own procedure monitoring, which helps reduce context switching, but network I/O operations and idle procedures will affect each other.IdleStateHandler
using HashedWheelTimer
when the number of connections is large. Its complexity degree is O(1), and it allows network I/O operations and idle procedures to be independent of each other, but it incurs the context switching overhead.6) ctx.writeAndFlush
or channel.writeAndFlush
?
ctx.write
goes to the next outbound handler directly. Be careful not to let it bypass the idle procedure detection, which is not what you want.channel.write
moves backward from the end to the front and passes through all outbound handlers of the pipeline one by one.7) Use Bytebuf.forEachByte()
to replace the loop traverse of ByteBuf.readByte()
and avoid rangeCheck()
8) Use CompositeByteBuf
to avoid unnecessary memory copying
9) To read an int, use Bytebuf.readInt()
instead of Bytebuf.readBytes(buf, 0, 4)
.
10) Configure UnpooledUnsafeNoCleanerDirectByteBuf
to replace DirectByteBuf
of the JDK so that the Netty framework releases off-heap memory based on the reference count.
io.netty.maxDirectMemory
:
<0
: Without using cleaner, Netty inherits the maximum direct memory size set by JDK directly. The direct memory size of JDK is independent, so the total direct memory size will be twice as big as the JDK configuration.
== 0
: If cleaner is used, Netty does not set the maximum direct memory size.0
: If no cleaner is used, this parameter will limit the maximum direct memory size of Netty. (The direct memory size of JDK is independent and not limited by this parameter.)
11) Optimal Number of Connections
12) When using PooledBytebuf
, you should be good at using the -Dio.netty.leakDetection.level
parameter.
grep
command to check the logs from time to time. Once "LEAK:" appears, change the level to ADVANCED immediately and run again. This way, you can know where the leaking object was accessed.13) Channel.attr()
– Attach your own objects to the channel
1) AtomicIntegerFieldUpdater
--> AtomicInteger
in scenarios with a large number of objects
AtomicInteger
is 16 bytes in size, and the AtomicLong
is 24 bytes in size.AtomicIntegerFieldUpdater
acts as a static field to operate volatile int.2) FastThreadLocal
is faster than JDK in terms of implementation.
3) IntObjectHashMap
/ LongObjectHashMap
4) RecyclableArrayList
ArrayList
.5) JCTools
NonblockingHashMap
(comparable to ConcurrentHashMapV6/V8) not available in JDK.We are the Ant Intelligent Monitoring Technology Middle Platform Storage Team. We are using Rust, Go, and Java to build a new-generation low-cost time-series database with high performance and real-time analysis capability. You are welcome to transfer positions or recommend other applicants to our team. Please contact Feng Jiachun via email (jiachun.fjc@antgroup.com) for more information.
[1] Netty
[2] JDK-Source
[3] Linux-Source
[4] RPS/RFS
[5] I/O Multiplexing
[6] jemalloc
[7] SO_REUSEPORT
[8] TCP_FASTOPEN
[9] Main Reference Sources for Best Practices
Brief Introduction to Distributed Consensus: Raft and SOFAJRaft
Alibaba Cloud Native Community - May 21, 2021
Alibaba Cloud Community - May 29, 2024
frank.li - February 24, 2021
Alibaba Cloud Native Community - August 23, 2023
Alibaba Clouder - April 13, 2020
OpenAnolis - September 1, 2023
Alibaba Cloud Linux is a free-to-use, native operating system that provides a stable, reliable, and high-performance environment for your applications.
Learn MoreCustomized infrastructure to ensure high availability, scalability and high-performance
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreMore Posts by block
profedsonbelem@gmail.com October 23, 2021 at 8:02 am
excelente post