Methods to accelerate the query and analysis of logs - Simple Log Service

You can use the following methods to accelerate the query and analysis of logs.

Increase the number of shards or enable Dedicated SQL

You can increase the number of shards to improve data read and write capabilities. However, this method takes effect only for incremental data. Shards are computing resources. The calculation speed increases based on the number of shards. Make sure that the number of data entries scanned per shard does not exceed 50 million. You can split a shard to increase the number of shards. For more information, see Split a shard. For more information about the billing of shards, see Billing examples of active shards based on billing methods.
Dedicated SQL supports a larger number of concurrent analysis operations and a larger amount of data for scanning. For more information, see Enable Dedicated SQL.

Shorten the time range and reduce the data amount of a query operation

A long time range slows down a query operation.
If you want to accelerate the calculation speed, you can shorten the time range of a query operation.
A large amount of data slows down a query operation.
We recommend that you reduce the data amount of a query operation.

Repeatedly query data

If query results are inaccurate, you can repeatedly query data. Each time you perform a query operation, the underlying acceleration mechanism analyzes existing query results. After you repeatedly query data, the system can return more accurate results.

Optimize analytic statements

A time-consuming query statement features the following characteristics:

The GROUP BY clause is used to group analysis results based on one or more columns of the string type.
The GROUP BY clause is used to group analysis results based on more than five columns.
Operations that generate strings are included in the statement.

You can use the following methods to optimize analytic statements:

Do not include operations that generate strings.
For example, if you use the date_format function to generate a timestamp in a specified format, the query efficiency is low. We recommend that you use the date_trunc or time_series function to generate a timestamp. Example:
```
* | select date_format(from_unixtime(__time__) , '%H_%i') as t, count(1) group by t
```
Do not group analysis results based on one or more columns of the string type.
For example, if you use the GROUP BY clause to group analysis results based on one or more columns of the string type, the workload for hash calculation is heavy. The workload for hash calculation accounts for more than 50% of the overall workload for calculation. Examples:
- Efficient query statement
```
* | select count(1) as pv , from_unixtime(__time__-__time__%3600) as time group by __time__-__time__%3600
```
- Inefficient query statement
```
* | select count(1) as pv , date_trunc('hour',__time__) as time group by time
```
The two query statements are used to calculate the number of logs per hour. In the second statement, the timestamps are converted into strings. Then, the analysis results are grouped based on the strings. For example, 2021-12-12 00:00:00 is a timestamp in the string format. In the first statement, the timestamps on the hour are obtained, the analysis results are grouped based on the timestamps, and then the timestamps are converted into strings.
If you want to group analysis results based on multiple columns, we recommend that you place a field that has a larger number of values before a field that has a smaller number of values.
For example, if the number of values for a specific field is 13 and the number of values for the uid field is 100 million, we recommend that you place the uid field before the specific field in the GROUP BY clause. Examples:
- Efficient query statement
```
* | select province,uid,count(1) group by uid,province
```
- Inefficient query statement
```
 * | select province,uid,count(1) group by province,uid
```
Use approximate functions.
The performance of approximate functions is better than the performance of functions that have specified precision. Approximate functions sacrifice precision for speed. Examples:
- Efficient query statement
```
 * |select approx_distinct(ip)
```
- Inefficient query statement
```
 * | select count(distinct(ip))
```
Specify only the columns that you want to query in an analytic statement. Do not query all columns.
In an analytic statement, we recommend that you query only the columns that are required in the calculation. If you want to query all columns, use the search syntax. Examples:
- Efficient query statement
```
* |select a,b c 
```
- Inefficient query statement
```
* |select *
```
Place the columns that are not used for grouping in an aggregate function.
For example, the values of the userid and username columns must match each other. You need to only specify the userid column in the GROUP BY clause to group data. Examples:
- Efficient query statement
```
 * | select userid, arbitrary(username), count(1) group by userid 
```
- Inefficient query statement
```
* | select userid, username, count(1) group by userid,username
```
Do not use the IN clause.
We recommend that you do not use the IN clause in an analytic statement. You can use the OR clause in a search statement. Examples:
- Efficient query statement
```
key: a or key: b or key: c | select count(1)
```
- Inefficient query statement
```
* | select count(1) where key in ('a','b')
```