The Practice of Apache ShenYu Integrating with RocketMQ to Collect Massive Logs in Real-Time

By Hu Taishi (Kuaishou Java Development Engineer)

Meet Apache ShenYu

The most important thing about the gateway is traffic governance, which has many similarities with King Yu's flood taming. Therefore, the gateway's traffic governance project is named ShenYu.

ShenYu is a high-performance, multi-protocol, scalable, and responsive API Gateway. Its main features include rich protocols, plugins, traffic governance, and high performance.

ShenYu supports protocols (such as HTTP, Spring Cloud, gRPC, Dubbo, Sofa, Tars, and Motan). ShenYu uses a plugin design that supports hot swapping and rich built-in plugins to ensure scalability. The biggest benefit of plugin design is high scalability. A large number of traffic governance functions of ShenYu correspond to different plugins. ShenYu supports rich and flexible traffic management (such as authentication, throttling, fusing, security, load balancing, grayscale release, dynamic upstream, and observability). Due to the requirements for high performance of gateways, ShenYu adopts responsive full-link asynchronous processing, supporting cluster deployment and blue-green release.

The client initiates a request, the request passes through the gateway, and the gateway converts the request to the service provider and sends the response from the service provider to the client.

From the perspective of initiating requests, the architecture supports multiple languages and is not limited to operating platforms. The HTTP request passes through a proxy layer first, which can be Nginx, ShenYuNginx, ShenYu proxy, or Kubernetes, and the proxy layer routes the request to nodes in the ShenYu cluster through LoadBalance. After receiving the request, the ShenYu gateway filters the request first. The ShenYu gateway stores various metadata in local memory to ensure performance and uses efficient algorithms to achieve high performance.

Metadata is required to process data requests. Metadata is obtained by the synchronization mechanism between the ShenYu gateway and the ShenYu backend. After data changes are made in the ShenYu Admin backend (such as plugin changes, selector changes, rule changes, and data changes), the changes are synchronized to the ShenYu gateway in a specific way.

You can select the synchronization method of pull or push based on project features. First, you must filter the traffic by matching the request URL query parameters, request headers, request parameters, host, or request body. The matching conditions can be like, match, regular expression, include, or exclude. Then, load the corresponding implementation in the SPI way.

If the traffic matches, the traffic is forwarded to the respective plugin. Each plugin forms a plugin chain, and each plugin chain performs its function for the matched traffic. The final request is forwarded to the service provider via the export plugin. Then, the response from the service provider will be forwarded to the client.

Monitoring the running status of gateways is an important module. The number of requests for the observability metric login may be large, resulting in a large number of logs. As such, an integrated message queue is generally required.

The preceding figure shows the traffic filtering process of ShenYu. Each request includes certain metadata, which can be matched to sort out the required requests. ShenYu's traffic filtering involves two important components: selectors and rules.

After a request arrives, the system checks whether the plugin is enabled first. If the plugin is not enabled, the system does not process the request. If the plugin is enabled, the selector corresponding to the plugin determines whether the request matches. If the request matches, it is handed over to the rule for another match. If the request does not match, it is handed over to the next plugin for processing. If it matches, execute the plugin.

Selectors are equivalent to first-level matching, which is coarse-grained matching, while rules are equivalent to second-level matching, which is fine-grained matching. This design can ensure higher flexibility.

Plugins form a plugin chain. After the previous plugin is executed, you can decide whether to hand over the request to the next plugin for processing.

The filtering of traffic requires traffic processing. The traffic processing will connect the service provider to the gateway ShenYu and provide a series of ShenYu clients to access the service provider. The service provider depends on the ShenYu client. When the service provider starts, the ShenYu client obtains the metadata information and sends the metadata information to Disruptor. The client of the registry obtains the data in Disruptor and sends it to the registry. The registry supports many methods (such as HTTP, ZooKeeper, and Nacos).

ShenYu admin monitors data changes in the registry on the server side of the registry. You can configure the registry and select the synchronization method. After monitoring the metadata changes, the ShenYu admin synchronizes the metadata changes to the ShenYu gateway (such as saving MySQL for persistence or synchronizing the metadata changes to the ShenYu gateway through Zookeeper) and then updates the metadata changes to the local cache.

ShenYu's traffic governance is dynamic, flexible, and convenient to use.

During the synchronization between the ShenYu gateway and the backend, if the user operates the metadata or operates it through the API, after the ShenYu admin detects the change, it will load the synchronization method configured by the user through SPI. It supports the pull method (such as HTTP) and the push method (such as WebSocket). After detecting metadata changes, the ShenYu gateway immediately updates the cache and takes effect for subsequent requests.

ShenYu supports hot plugin swapping. For example, you can directly exclude some useless plugins. For dependent plugins, you can control whether to enable and configure various metadata in the backend. In addition, the official built-in plugin cannot cover all scenarios. Users can customize plugins. ShenYu provides a custom class loader. The community is also developing through the management backend to upload their plugins, which is more convenient for users to operate. ShenYu supports almost all mainstream registries.

The figure above shows the development of the number of ShenYu contributors. You can see the number of contributors had steadily increased since 2021, the year when ShenYu was donated to Apache. This shows that since ShenYu joined the Apache incubator, its popularity has increased, with more contributors and users.

The ShenYu community operates in a way that the community is greater than the code, and its ecology is rich. It supports almost all mainstream RPC frameworks and is growing rapidly.

ShenYu Observability Log

The preceding figure shows the observability log architecture of ShenYu.

A user request passes through Apache ShenYu and is logged by the Logging-RocketMQ plugin. The Logging-RocketMQ plugin puts Access Log into the buffer queue. A log consumer is created in the backend to asynchronously consume the log and then send the log to the RocketMQ cluster. In order to collect logs, a RocketMQ Consumer is required to persist logs in batches. You can write logs to Elasticsearch, ClickHouse, or other databases.

You can select Kibana, Grafana, Loki, or connect to the internal log system of the company for visualization. You can aggregate and analyze logs to generate alerts for logs. For example, you can generate alerts for the number of requests, request duration, and request exceptions.

Important parameters in logs are the input and output parameters, which can help you troubleshoot faults. The underlying ShenYu is a reactor-based asynchronous non-blocking mode, which is a responsive gateway and publish-subscribe mode. The request body and response body can only be retrieved once. If the body is retrieved once in the current log, the following subscribers will not be able to retrieve the body. The collection of AccessLog is required to be completely free of side effects and is an auxiliary function.

The decorator pattern is adopted on the input parameter collection to delegate the request, inheriting from ServerHttpRequestDecorator. Making a request requires overriding the getBody method in a side-effect-free manner. The method that starts with do is a method that has no side effects. Collection authentication is to insert the log collection code from this method. At the same time, the dataBuffer must adopt the readonlyBuffer method to avoid side effects. This is also a streaming read method, so it can be determined in the doFinally method that the requestbody has been read.

The collection of output parameters also uses the decorator pattern, which delegates the response, inherits from ServerHttpResponseDecorator, and rewrites the writeWith method. Similarly, in the writeWith method, the body parameter is converted first, the response body is collected in doOnNext, and doFinally determines that the response body has been collected. At this time, the log can be sent to the buffer queue.

When an exception occurs, ShenYu provides a global exception handler, which also handles exceptions in log collection.

The troubleshooting process is listed below:

① A Metrics exception alert is received, but it is often impossible to know which node has an error.

② Associate the corresponding error Trace. The maximum function of the Trace is to locate the faulty node.

③ View the error link and locate the source

④ Locate exception logs at the source

The preceding process shows that logs and Traces need to interact with each other. All logs generated by one request must be concatenated using Trace IDs.

Access log records the request duration but cannot accurately record which nodes take a long time. It also records the request failure but cannot accurately record which nodes fail. In addition, there are various abnormal errors, and more observability data can speed up troubleshooting.

The ShenYu log plugin is associated with the Tracing Analysis:

If the ShenYu log plugin is associated with the Tracing Analysis, the Tracing Analysis must access the ServerWebExchange in the Tracing Analysis plugin (such as SkyWalking). The plugin can save the Trace ID in the context of the ServerWebExchange. The ShenYu log plugin can read the Trace ID to implement association.
If you only use the SkyWalking log toolset, it comes with an association. It configures the Trace ID variable in the log configuration file, uses bytecode enhancement technology to intercept the convert method, and sets the Trace ID to the output log.

As shown in the figure on the right, each person has a Trace ID field. Click the Trace ID field to associate all links this request passes through.

The asynchronous log collection solution has two requirements:

Performance: The gateway is required to achieve the ultimate performance. Therefore, log collection and all auxiliary functions cannot affect the gateway performance.
Resources: The gateway processes a large number of requests with high concurrency. Log collection requires the lowest possible resource consumption to ensure no side effects on the gateway.

An Access Log is generated for a request. The Admin determines whether the log meets the configured conditions first. The conditions can be configured using certain fields, which are configured by the admin and sent to the Logging-Rock plugin. In addition, you can configure sampling. If the log does not meet the sampling requirements, it will be discarded.

After a series of judgments, you need to determine whether the cache queue has sufficient capacity. If the queue is full or the downstream is abnormal, the log will also be discarded. If logs are directly sent to the RocketMQ cluster, an I/O call is generated, which is time-consuming. However, if logs are put into the memory cache queue, the time-consuming is negligible. Therefore, a cache queue needs to be introduced here.

The log consumer in the backend continuously retrieves logs from the cache queue. The retrieved logs can be compressed and sent through OneWay, and the compression can be configured.

There are many ways to implement sampling. In this example, a random number is generated and used to obtain the remainder to determine whether the request log needs to be sampled. However, this is not done in the ShenYu gateway because generating random numbers is a time-consuming operation, and the gateway has requirements for high performance.

ShenYu log sampling is implemented by bitmap. Set bitmap to 100 bits and set the sampling percentage. You can set a specfic proportion of bits to true and others to false and then randomly disrupt them. You can maintain the auto-increment variable in the memory. Each time you perform auto-increment, perform the remainder on this layer of variables and then determine whether the corresponding bit is enabled. If enabled, sampling is performed, avoiding the time-consuming manner of generating random numbers.

Fields in log collection include client IP address, time, method, request header, response header, query parameter, request body, request URL, response body, response content length, RPC type, status, and upstream IP address. You can use these fields to determine which requests are time-consuming, the number of specific types of requests, and the request exceptions. When an exception occurs, you can use the upstream IP address to locate the upstream where the exception occurs. In many cases, any node in the cluster has a problem, but it is possible that only a few nodes in the cluster have a problem. As such, the advantages of the troubleshooting function of the upstream IP address can be great.

A major challenge in log collection is how to obtain the IP address of the gRPC service provider in the log plugin. Different from HTTP and Spring Cloud, gRPC load balancing is at the bottom layer, the IP address cannot be accessed at the business layer, and it cannot establish contact with responsive ServerWebExchange. In addition, in the Tracing Analysis, it is needed to obtain the upstream IP. How can we achieve it through SkyWalking?

SkyWalking uses bytecode enhancement technology to solve the problem that the business layer cannot access IP addresses, and there is no intrusion. However, it is not easy to obtain the IP address. The conventional method is to obtain the peer through Channel.authority(), but this method has several limitations:

It cannot adapt to the load balancing scenario.
It cannot adapt to the generic call scenario. The gateway passes generic calls, while generic calls do not have a channel.
It cannot adapt to the scenario of domain name resolution.

After analysis, the IP address corresponding to the underlying gRPC client stream can be obtained from the Netty client, which is the authority. Also, due to initialization latency, the first call to the method is a void method. Since gRPC requests have monitors, they can be processed in the onClose method. Calling onClose indicates that the request has been sent, which means the client has completed initialization. However, this method is still not accurate enough, so another method is needed.

ClientCallImpl has a getAttributes attribute that calls the attributes of the Netty client stream. This attribute contains the Remote Addr, which is the upstream IP address. ClientCallImpl is at the package level and cannot be accessed at the business layer. You can only access the object and obtain the IP address through reflection.

The integration of RocketMQ is based on the following two considerations:

Load Shifting: Trillions of messages may be consumed during peak hours. The larger the number of requests, the larger the number of responsive logs. Without the distributed message queue, the log system may crash or discard a large number of logs. RocketMQ can be used in load shifting. In the face of trillions of message throughput, a large number of logs can be sent to RocketMQ for temporary storage. Then, consumers can continuously consume logs from RocketMQ.
Decoupling: ShenYu is an open-source project. Each company or project may have its own log system. How do they connect ShenYu to these log systems? The answer is decoupling. After integrating with RocketMQ, ShenYu can send logs to message queues. Each project or company can consume logs from RocketMQ based on business characteristics and store the logs in its log system. This decoupling mode makes it easier to connect and maintain systems, and the ShenYu community provides a variety of message queue products.

RocketMQ provides the following excellent features:

Financial-Grade Reliability: Discarding logs may cause significant statistical problems. For example, the loss of order logs may result in inaccurate transaction volume.
Nanosecond Delay: The ShenYu gateway requires high performance. Therefore, log consumption requires high performance, and RocketMQ has a nanosecond delay.
Trillions of Message Throughput: Trillions of logs may be generated during peak hours. If the performance of RocketMQ is not strong, it cannot support a large number of logs.
Massive topics are supported. The ShenYu gateway can be multi-tenant, which means different services can share the gateway cluster, and different logs can be sent to different clusters, which can achieve better isolation.
Ultra-large-scale accumulation is supported. A large number of messages will be accumulated in RocketMQ during peak business hours. However, consumers cannot consume all messages in a short time. Therefore, the message queue must be able to support ultra-large-scale accumulation.

When you use RocketMQ to collect logs, the gateway requires high performance. Therefore, you need to consume logs with the lowest possible latency. The OneWay sending method is similar to UDP. This method does not wait for the broker to return confirmation. It has the maximum throughput but may cause data loss (acceptable) in extreme cases.

In terms of configuration management, the Admin backend configures themes and parameters. Configurations are issued to the client to control which API, sample rate, and various filtering conditions are collected (such as controlling the size of Body). The extra-large package body will be discarded. You can control the log collection policy in real-time. In addition, log compression is supported. You can enable or disable log collection in real-time.

Cluster deployment is required for RocketMQ. The availability of single-node deployment is low. Cluster deployment can achieve a better nanosecond delay and support trillions of message throughput, massive topics, and ultra-large-scale accumulation. These are the benefits of cluster deployment over single-node deployment.

After logs are collected, you need to consume the logs, mainly through visualization. ShenYu is an open-source project that can be connected to various open-source projects (such as Kibana, Grafana, and Loki). Grafana provides all-in-one support for Metrics, Logging, and Trace in terms of observability.

Visualization focuses on the request volume, request duration, and request exceptions of each interface, including network aspects (such as byte throughput and the number of sent/received bytes). You can also configure alerts to perform aggregation operations (such as determining which interfaces have a sudden increase in request volume, which interfaces have a sudden increase in duration, and which interfaces have a sudden increase in exceptions within a period).

Community

The Practice of Apache ShenYu Integrating with RocketMQ to Collect Massive Logs in Real-Time

Meet Apache ShenYu

ShenYu Observability Log

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

ApsaraMQ for RocketMQ

Best Practices

Message Queue for Apache Kafka

AliwareMQ for IoT