Key Technologies for High-Cost-Effective Intelligent Log Analysis with Elasticsearch Serverless

By Xinyu Jia
At the 2024 Cloud Habitat Conference, Mr. Jia Xinyu, our senior technical expert from Alibaba Cloud, gave us a comprehensive analysis of the key technologies, advantages and practical application value of Elasticsearch Serverless in the field of intelligent log analysis, as well as the efficient and cost-effective log data analysis.The release content is mainly divided into four parts: the core pain points of log analysis scenarios, Introduction to log analysis Serverless capabilities, interpretation of key technologies, and quick start.

I.Core pain points of log analysis scenarios

As we all know, ES is widely used in enterprise retrieval and log scenarios. A large number of enterprises use ELK and EFK as their log solutions. Alibaba Cloud also has a large number of users on the public cloud and within the group who use ES as their log analysis engine. In our long-term communication with these users,It is found that the core pain point of log analysis scenario is cost performance.

Specific performance has the following four points:

High resource costs: In order to store massive amounts of data, ensure safe water levels, and carry sudden peaks, a large amount of resources need to be reserved, which also causes high costs.
High Operational Costs: In this scenario with massive amounts of data, index shards, replica configurations, and Rollover policies are very complex. In particular, with hundreds of TB or even PB-level large-scale data, the status of hundreds of nodes is synchronized, and the indexes of hundreds of thousands of shards are relocated,The default configuration of these open source ES is no longer applicable because the operations team needs to invest a lot of manpower to solve these problems.
Read and write performance is difficult to guarantee: In the case of strict cost constraints, how to ensure read and write performance is a big challenge.
Stability is difficult to guarantee: In open source ES, read and write operation risks affect each other, and lack of fusing capability. When the load is high, for example, when the CPU exceeds 50%, its processing power will drop precipitously, and a single large aggregate query may cause the cluster to go down.

Since Aliyun ES went online, we have been working hard to solve these problems:

● In 2021, Alibaba Cloud launched the Indexing Service. Through read/write separation and write pooling, the write capability was improved by 800 under the same specifications, and the write-by-flow billing was realized. Let users no longer have to worry about write performance.

● In 2022, Aliyun launched the Open Store storage and calculation separation architecture. Through multi-level intelligent caching and query optimization, the storage cost is reduced by 70% without reducing the query performance, and the storage is charged by volume.

● In 2023, based on the characteristics of log scenarios, Alibaba Cloud combined with Indexing Service and OpenStore to launch the Serverless LogHouse log Service internally, further shielding users from complex configurations and achieving out-of-the-box usage,To a very good effect.

● 2024 this year, we launched Serverless log analysis on public cloud to help cloud users reduce costs and increase efficiency.

So what kind of results have we achieved under the internal LogHouse? Compared with the previous log full observation scheme, we reduced the cost by an average of 51.44 percent and saved the company millions of yuan per month. Among them, it was in Cainiao's business and Gaud's business that it fell by 78.04 and 64. 5%.86%. The effect is remarkable.

II.Introduction to the log analysis Serverless capability

What capabilities does log analysis Serverles have to help users reduce costs and increase efficiency? It mainly relies on the four key capabilities of Serverless.

● Open source compatible out of the box. All our APIs and SDKs are consistent with open source, so that users can use them directly without any code changes or special SDKs, greatly reducing the access cost of users.

● High performance and low cost. Through log analysis scenarios and custom optimization, we have greatly improved the processing power of single-core, while significantly reducing the cost of storage.

● True pay-as-you-go. Different from the previous pay-per-flow, we have implemented the charge-per-CPU usage in the Serverless version. This brings a lot of cost savings compared to the previous pay-per-flow.

● Intelligent scheduling is free of operation and maintenance. In our Serverless scenario, users do not need to care about cluster operation and maintenance, or even the shards and replicas of indexes. They only need to care about the number of days to save and the table structure, which greatly reduces the work of Operation and maintenance configuration.

2.1 Open source compatible out of the box

On log analysis Serverless, you can still use these familiar APIs, including some open source ES basic native APIs, which can be used directly. In order to be fully compatible with open source, we don't even use concepts such as data flow and ILM in Xpack,Instead, it is directly expanded on the native index. By adding special index configurations such as retention days and write optimization, users can directly apply the index without replacing the SDK or adapting special API. In addition, we also provide a simple and easy-to-use white screen operation page, even if the API will not be used,It can also be easily created and modified.

2.2 High performance and low cost

As can be seen from the following figure, the leftmost light blue is Serverless, the middle dark blue is Paas with the log enhanced version of Indexing Service, and the orange one is open source self-built under the best practice.

Under our benchmark dataset, our log enhancement more than tripled from open source best practices, from 2 Mb/s to 4.32 MB/s. And our Serverless version is more than twice as high as our log enhancement version, reaching 9.25 MB/s,Overall, it is 4.5 times better than the best practice of open source.

In terms of compression ratio, the Serverless version has increased from 1.93 to 5.8, which is more than three times higher. It should be noted that this is an open source self-built under best practices to achieve 1.93 effect. If you do not have rich experience in ES optimization and use ES directly,Then this is not a 1.93, but a 0.8, and the cost will be very high.

Previously, 1g benchmark data open source required 530MB of space, now only 176MB. These optimizations allow us to achieve higher write performance and lower storage costs with the same resources.

2.3 True Pay-Per-Use

What is the difference between paying by CU and paying by traffic? As shown in the image, under the same 1G traffic, the difference in computing resources consumed by different table structures is significant, ranging from a maximum of 110 CU to a minimum of 46 CU, a nearly 2.4 times difference. Therefore, in most scenarios, paying by CU can save us nearly 50% compared to paying by traffic.

This year, in addition to implementing pay-per-use for storage and writing, we have also achieved 100% actual usage-based payment for queries. As you can see in the image, the upper right shows the CPU usage for queries, while the lower right shows the QPS for queries. The two curves are essentially identical, which indicates that we are charging based on 100% of the usage.

2.4 Intelligent scheduling with zero maintenance.

The three lines above represent different metrics: the bottom line shows the actual cost incurred by users, the middle line indicates our quota limit, and the top green line represents our resource line. To provide a clearer visualization, we have set the resource line to a limit of 120, but in actual usage, there is no upper limit. As you can see, in a scenario where business levels are steadily increasing, it is almost impossible to encounter our quota limit. Overall, this is transparent and seamless for users. In addition to automatic resource scaling, we will also use automated methods to adjust cluster and index configurations to ensure that our indexes are always in an optimal state.

III.Key Technology Interpretation

3.1 Log Analysis-Based Serverless Overall Architecture

The overall architecture is shown in the diagram. On the left is our data plane, and on the right is our control plane.

The service layer forwards the corresponding traffic to the computing layer for read/write and metadata services, while the data layer at the bottom is a shared distributed storage. The control plane on the right includes the application management system, which is responsible for application lifecycle management, index metadata management, quota management, and the intelligent operation and maintenance system, which handles elastic decision-making and some internal automated operations. Open-source compatibility is achieved through the packaging of this service layer.

3.2 Key Factors for High Performance Cost

3.2.1 Engine Architecture

First, let's take a look at our engine architecture in the diagram. We employ a read-write separation architecture, routing read and write requests to the corresponding service nodes through the gateway layer. Subsequently, we utilize Openstore for storage-computation separation, meaning that data does not reside locally and does not require real-time replica synchronization. Instead, data is pulled from OSS at scheduled intervals. This architecture allows the index to be built only once, and local storage is no longer a bottleneck, making rapid elasticity possible.

3.2.2 Optimizing the Yitian Architecture

Next, we have optimized the underlying ECS by switching from Intel to Yitian's ARM architecture, which has allowed us to increase the safety threshold from 30% to 50%. Resource utilization has improved by 66%. So, why can we raise the safety threshold? As you can see in the three diagrams below, the dashed red line represents our Yitian architecture, while the solid yellow line represents our X86 architecture. It is evident that when CPU usage exceeds 40%, performance on the X86 architecture experiences a sharp decline, while performance on Yitian remains stable. This stability is the reason we can confidently raise the safety threshold.

3.2.3 Write Optimization

In addition to architectural optimizations, we have also made significant enhancements to the ES kernel. To facilitate understanding, we will highlight four main write optimizations.

● In the upper left corner, we have our Source Reuse DocValue. This is quite straightforward: we do not store the original text but instead reconstruct the Source using DocValue during retrieval. This greatly reduces our storage space and allows for faster write speeds.

● The upper right corner features our customized LineAnalyzer tokenizer. The native ES tokenizer, due to its general-purpose design, involves a lot of redundant computations. We have rewritten the tokenization logic specifically for log scenarios, resulting in a 20% improvement in single-core performance just from this feature.

● The lower left corner indicates that we use ZSTD compression to replace the native ES compression, and through targeted routing technology, we ensure data continuity, further improving the compression ratio.

● Lastly, in the lower right, we do not create inverted indexes for Keyword or numeric fields, storing only DocValue. During queries, we utilize a Bloom filter for quick pruning, further enhancing write performance.

Through this series of kernel optimizations, we have ultimately achieved a 300% increase in single-core write capability and a 70% reduction in storage size.

3.2.4 Query Optimization

As we all know, ensuring query performance in a storage-computation separation scenario is a significant challenge. We address this by utilizing concurrent queries, trading space for time. This approach consolidates multiple I/O operations into a single I/O operation, temporarily storing the results in memory, allowing for subsequent direct reads from memory. This method accelerates OSS queries by 200%, as illustrated in the diagram below.

In the lower left corner is the Kibana Discovery page. Those who have used ES may know that when ES is handling large data ranges, the query speed can be very slow, often taking several minutes to refresh the page. To address this scenario, we have implemented a series of custom optimizations, including query pruning and concurrent queries within shards, which have optimized the display of billions of data to under 10 seconds, achieving over a 10-fold increase in query speed.

In the lower right corner, we have implemented adaptive degradation for slow queries to enhance query stability. We have rewritten the thread pool logic of ES and introduced a priority queue that automatically degrades slow queries based on historical query performance, thereby preventing them from affecting faster queries.

It is this series of custom optimizations, spanning architecture, host machines, and kernel levels, that has enabled us to achieve a performance that is 4.5 times stronger than open source in single-core capability and 3 times better in compression ratio, resulting in high performance at a low cost.

3.3 True Pay-As-You-Go

The left side of the diagram illustrates the timeline for implementing pay-as-you-go. Before a request begins, we tag the corresponding application and index information. We then use an asynchronous reporting link for statistics. For different CPU specifications, we normalize them through frequency suppression and statistical adjustments to avoid discrepancies between machine types.

In the upper right corner of the diagram, we have hijacked and recorded the corresponding thread pool status. As you can see, we have hijacked a large number of thread pools, but not all of them. This is because we do not account for system overhead, such as GC (Garbage Collection) and data migration; these costs are not charged to users but are instead borne by our Alibaba Cloud system. This approach actually reduces costs by about 20% compared to charging based on machine CU. Through these mechanisms, we enable users to truly pay only for their actual usage.

3.4 Intelligent Scheduling for Zero Operations

Operations and maintenance (O&M) can be primarily divided into two types of automation: one for resource scheduling and the other for configuration tuning.

First, resource scheduling is aimed at ensuring resource utilization. Specifically, we implement rebalance based on Shard CU consumption. We have introduced Shard CU consumption into the rebalance strategy in ES, which ensures that the water levels across various nodes are balanced. In addition to achieving this balance within ES nodes, we also apply this method across multiple service clusters to ensure overall equilibrium.

The second aspect is configuration tuning, which aims to ensure that instances reach optimal performance. Configuration includes cluster configuration and index configuration. Since our default configurations may not be suitable for all table structures, we continuously gather basic information from each node, such as CU usage, memory usage, I/O usage, and network throughput, among more than ten metrics. Based on the utilization of these resources and thread conditions, we assess whether the current cluster and index are experiencing bottlenecks. If any bottlenecks are detected, corresponding adjustments will be made.

IV.The way to get started quickly

Finally, let me briefly introduce how to get started in just three steps.

Step 1: Open the console and create an application.

Step 2: Fill in the corresponding basic information.

Step 3: Wait for 1 to 2 minutes, and the creation will be completed.

Through the above introduction, you should now have an understanding of the four core capabilities of log analysis-based Serverless and the key technological implementations behind them. We believe that through a high-performance, low-cost, true pay-as-you-go, zero-operations Serverless product, we can help you effectively reduce costs in log scenarios.

Community

Key Technologies for High-Cost-Effective Intelligent Log Analysis with Elasticsearch Serverless

I.Core pain points of log analysis scenarios

II.Introduction to the log analysis Serverless capability

2.1 Open source compatible out of the box

2.2 High performance and low cost

2.3 True Pay-Per-Use

2.4 Intelligent scheduling with zero maintenance.

III.Key Technology Interpretation

3.1 Log Analysis-Based Serverless Overall Architecture

3.2 Key Factors for High Performance Cost

3.2.1 Engine Architecture

3.2.2 Optimizing the Yitian Architecture

3.2.3 Write Optimization

3.2.4 Query Optimization

3.3 True Pay-As-You-Go

3.4 Intelligent Scheduling for Zero Operations

IV.The way to get started quickly

Read previous post:

Read next post:

Data Geek

You may also like

Comments

Data Geek

Related Products

Alibaba Cloud Elasticsearch

Simple Log Service

Serverless Workflow

Log Management for AIOps Solution