High-cardinality GroupBy Query Acceleration in SLS SQL

By Hanjie Gu (Zhishao)

1. High-Cardinality GroupBy

In simple terms, high-cardinality GroupBy refers to data with an extremely high number of Distinct Count, and we need to perform group analysis on this data, such as counting, ranking, and calculating averages and quantiles.

High-cardinality aggregation calculations are essential in many operational analysis scenarios. They involve grouping and aggregating massive data with unique values to gain insights into key metrics such as user behavior, player paths in games, market trends, or product performance. For example, on an e-commerce platform, analyzing the sales distribution of different product categories across various regions over a period, or in a game operational analysis scenario, tracking the unique actions and paths of players in the game, both require handling data with extremely high cardinality, such as ItemId, RequestId, and TraceId, with the cardinality of tens of millions or even billions.

The current challenge is that when users perform such analyses, due to the varying volumes and complexities of the data, the SQL execution time can range from seconds to minutes or even hours. The perception that high-cardinality GroupBy is too slow has become common among users and is a critical focus for many databases and OLAP engines. SLS SQL has also been continuously addressing this issue, implementing corresponding performance optimizations. This article aims to introduce users to the implementation principles, query acceleration methods, and applicable scenarios in SLS.

2. Implementation Principles of GroupBy

GroupBy is a fundamental aggregation capability in almost all OLAP engines. Distributed computing engines typically distribute massive amounts of data to different nodes using a hash-based method for grouping (bucketing). Each group then performs data aggregation, followed by sorting or limiting based on a heap (often using a PriorityQueue), and finally outputs the data needed by the user, such as Top N rankings.

During this process, the pre-aggregation technology may also be used: leveraging the principle of data locality, the raw data is pre-aggregated (PartialAgg) before being sent to the final aggregation node (FinalAgg), reducing the cost of data transmission over the network.

Overall, the GroupBy aggregation process generally involves the following four stages: DataSource -> PartialAgg -> FinalAgg -> Output.

In this process, DataSource and PartialAgg are usually executed together, while FinalAgg and Output are handled by other nodes in the distributed network.

3. High-cardinality GroupBy Query Acceleration in SLS

With the foundational knowledge and understanding from above, let's dive right into experiencing the high-cardinality GroupBy query acceleration in SLS, which can handle up to hundreds of millions of data.

To more objectively evaluate and analyze the performance improvements below, we need to first clarify the details of our test data and test cases.

Test data

We used simulated data similar to Nginx service access logs and stored it in a Project/Logstore, with the number of CUs for the SQL instance set to 5000. The schema of the test data is as follows:

{
  RequestId: varchar. /*Test data ensures each RequestId is globally unique*/
  ClientIP: varchar,
  Method: varchar,
  Latency: int,
  Status: int,
  ...
}

Test cases

We prepared three test cases, each corresponding to a different business analysis scenario:

High-cardinality single-column aggregation: Perform a GroupBy count on 2.8 billion request logs by the RequestId field (actual cardinality is 2.8 billion).
High-cardinality multi-column aggregation: Perform three-column GroupBy counts on 4.5 billion request logs by the ClientIP, Status, and Latency fields (actual cardinality is 1.5 billion).
Low-cardinality aggregation: Perform a frequency count of TOP 100 on 1.5 trillion request logs by the Latency field (actual cardinality is 7.35 million).

Test description

Our system is designed with multi-level caching. To avoid the influence of caching on the tests, we add a keyword filter of not in each query. This ensures that each query undergoes a full physical execution, providing a fair comparison of overall performance.
The tests use a real online service (located in Shanghai). The test data is stored in the SLS Logstore. However, due to variations in the number of shards and data distribution characteristics, actual results may vary slightly for different users. Nevertheless, results for similar scales should be comparable.

Case 1:

High-cardinality single-column aggregation: Perform a GroupBy count on 2.8 billion request logs by the RequestId field (actual cardinality is 2.8 billion).

Firstly, we conducted a benchmark test using the basic SQL mode. The query took approximately 17 seconds.

Secondly, we switched to the enhanced SQL mode and set the session parameter to set session hash_partition_count=40 (the default maximum value is 20). The query time decreased to 10 seconds.

What is the underlying mechanism of this improvement?

As mentioned earlier, the underlying parallelism was already well-distributed. However, with a 2.8 billion high-cardinality test dataset where each RequestId is unique, even with sufficient underlying parallelism, the FinalAgg stage still faces significant aggregation pressure. By default, the parallelism in the FinalAgg stage is related to the number of shards but is capped at 20. We opened up this setting to allow users to dynamically adjust it based on their specific needs.

Thirdly, we set the session parameter to set session hash_partition_count=64/128/200, which further reduced the query time to 7 seconds/4.5 seconds/3.7 seconds respectively.

The acceleration effect looks great.

By increasing the parallelism in the FinalAgg stage, we saw the query time drop from 17 seconds to 3.7 seconds. Additionally, we observed that the acceleration effect gradually converges as parallelism increases, due to the added network communication and scheduling overhead. Therefore, we set a system upper limit of 200 for this session parameter, as further increases may not yield significant additional benefits.

Actually, the query time could be further reduced.

Fourthly, we set the session parameter to set session high_cardinality_agg=true, and the query time dropped to 2.1 seconds!

What happened here?

In terms of the data feature in high-cardinality scenarios, we implemented a different data flow pattern at the underlying level, which significantly improved efficiency. We also made the corresponding session parameter available for users to use as needed.

In the end, by using high-cardinality single-column aggregation, we reduced the query time from 17 seconds to 2 seconds, achieving eightfold acceleration.

Case 2:

High-cardinality multi-column aggregation: Perform multi-column GroupBy counts on 4.5 billion request logs (actual combined cardinality is 1.5 billion).

Firstly, we conducted a benchmark test using the basic SQL mode. The query took approximately 24 seconds.

Secondly, we switched to the enhanced SQL mode and set the session parameter to set session hash_partition_count=40 (the default maximum value is 20). The query time decreased to 11 seconds.

Thirdly, we increased the parallelism by setting the session parameter to set session hash_partition_count=64/128/200, which further reduced the query time to 7.3 seconds/5.9 seconds/5.8 seconds respectively.

Fourthly, we set the session parameter to set session high_cardinality_agg=true, and the query time dropped to 2.9 seconds!

In the end, by using high-cardinality multi-column aggregation, we reduced the query time from 24 seconds to 3 seconds, achieving eightfold acceleration as well.

Case 3:

Low-cardinality aggregation: Perform a frequency count of TOP 100 on 1.5 trillion request logs by the Latency field (actual cardinality is 7.35 million).

Firstly, we conducted a benchmark test using the basic SQL mode. Although the query only took 4.3 seconds, the result was truncated and not accurate due to data size limitations.

Secondly, we switched to the enhanced SQL mode, and the query took 23.4 seconds to return an accurate result. This demonstrates the power of the enhanced SQL mode when handling extremely large amounts of data.

Thirdly, we set the session parameter to set session hash_partition_count=40/64/128/200 (the default maximum value is 20), and the query latency remained between 22-23 seconds. This shows that in low-cardinality aggregation scenarios, increasing the parallelism in the FinalAgg stage has a limited effect on query performance.

Through system monitoring and analysis, we found that the computational pressure was primarily in the PartialAgg stage, and there was no bottleneck in the FinalAgg stage. Therefore, increasing the parallelism in FinalAgg did not significantly improve overall query performance.

Fourthly, we tried the high-cardinality optimization parameter set session high_cardinality_agg=true, but the query resulted in a timeout.

This indicates that the high-cardinality optimization parameter is not suitable for low-cardinality aggregation scenarios. For low-cardinality dimensions with extremely large datasets (over 1.5 trillion test data), the default pre-aggregation technology (PartialAgg + FinalAgg) remains the most effective choice for efficient computation.

4. Conclusion and Recommendations

This article primarily introduces the high-cardinality GroupBy query acceleration principles in SLS and designs three typical test cases. Through actual testing on the SLS online service, we demonstrate eightfold acceleration in high-cardinality aggregation queries and the ability to query trillions of data within 20 seconds.

During the testing process, we explained the specific details and principles behind the performance, including why certain improvements work, why queries can be accelerated, and why some accelerations are not as significant.

Here are some practical recommendations for SLS users regarding query optimization:

1. For small datasets, the default mode is enough.

2. For large datasets with few shards, enable the enhanced SQL mode, which can significantly increase the underlying parallelism of the data processing.

3. For high-cardinality aggregation, try the following two session parameters, which may accelerate the query several times:

One is set session hash_partition_count=, and you need to set an appropriate value. We recommend you set the value between 20 and 64. Too high a value may not provide additional benefits.
The other is set session high_cardinality_agg=true/false, and whether to set this parameter depends on the data dispersion.

4. For low-cardinality aggregation, the above two session parameters are not quite suitable, but query acceleration and accuracy can be achieved in the default enhanced SQL mode.

Community

High-cardinality GroupBy Query Acceleration in SLS SQL

1. High-Cardinality GroupBy

2. Implementation Principles of GroupBy

3. High-cardinality GroupBy Query Acceleration in SLS

Case 1:

Case 2:

Case 3:

4. Conclusion and Recommendations

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Storage Capacity Unit

Cloud-Native Applications Management Solution

Hybrid Cloud Storage

Managed Service for Prometheus