aliyun-codec is an index compression plug-in that is developed by Alibaba Cloud. You can use the plug-in to compress various types of documents in an index at the underlying layer of Elasticsearch. You can also use the source_reuse_doc_values feature provided by the plug-in. The plug-in is suitable for scenarios in which a large amount of data is written or high storage costs are required for indexes, such as logging and time series data analysis. The plug-in can significantly reduce the storage costs of indexes in these scenarios.
Background information
The aliyun-codec plug-in supports various types of compression algorithms and the source_reuse_doc_values feature. This topic provides instructions on how to use the following features provided by the aliyun-codec plug-in:
The following description provides information about a performance test performed on the plug-in:
Test environment
Dataset: the cluster logs of an Alibaba Cloud Elasticsearch cluster.
Data volume: a single index that stores 1.2 TiB of data and has 22 primary shards.
Index configuration: Compression is enabled for row-oriented, column-oriented, and inverted documents. The zstd compression algorithm is used for the index.
Test results
Compared with a cluster for which the aliyun-codec plug-in is used but index compression is not enabled, the cluster for which the plug-in is used and index compression is enabled has the following improvements:
Write throughput: remains unchanged.
Overall size of the index: reduces by 40%.
Latency of I/O-intensive queries: reduces by 50%.
Compared with a cluster for which the aliyun-codec plug-in is used but source_reuse_doc_values is not enabled, the cluster for which the plug-in is used and source_reuse_doc_values is enabled has the following improvements:
Write throughput: remains unchanged.
Overall size of the index: reduces by up to 40%. The percentage that the overall size of the index is reduced is related to the proportion of fields on which the source_reuse_doc_values feature takes effect in the index.
Latency of I/O-intensive queries: The latency is related to factors such as the proportion of fields on which the source_reuse_doc_values feature takes effect in the index and the disk types of nodes. The actual test results prevail.
Prerequisites
An Alibaba Cloud Elasticsearch V7.10.0 cluster is created.
For more information, see Create an Alibaba Cloud Elasticsearch cluster.
The kernel version of the Elasticsearch cluster is upgraded based on your business requirements.
The kernel version of the Elasticsearch cluster is upgraded to V1.5.0 or later to use the index compression feature.
The kernel version of the Elasticsearch cluster is upgraded to V1.6.0 or later to use both the index compression and source_reuse_doc_values features.
For more information about how to upgrade the kernel version of an Elasticsearch cluster, see Upgrade the version of a cluster.
The aliyun-codec plug-in is installed for the Elasticsearch cluster. By default, the aliyun-codec plug-in is installed for an Elasticsearch V7.10.0 cluster.
You can check whether the plug-in is installed for the cluster on the Plug-ins page in the Elasticsearch console. If the plug-in is not installed, install it first. For more information, see Install and remove a built-in plug-in.
Limits
Only Alibaba Cloud Elasticsearch V7.10.0 clusters whose kernel versions are V1.5.0 or later support the index compression feature of the aliyun-codec plug-in. If you use an Alibaba Cloud Elasticsearch V6.7.0 cluster, only the codec-compression plug-in can be used for compression. For more information, see Use the codec-compression plug-in of the beta version.
Only Alibaba Cloud Elasticsearch V7.10.0 clusters whose kernel versions are V1.6.0 or later support the source_reuse_doc_values feature of the aliyun-codec plug-in. By default, the index compression feature is enabled in the default index template aliyun_default_index_template of such clusters. This indicates that
index.codec
in the default index template of such clusters is set to true.
Use the index compression feature
- Log on to the Kibana console of your Elasticsearch cluster and go to the homepage of the Kibana console as prompted. For more information about how to log on to the Kibana console, see Log on to the Kibana console.Note In this example, an Elasticsearch V7.10.0 cluster is used. Operations on clusters of other versions may differ. The actual operations in the console prevail.
- In the upper-right corner of the page that appears, click Dev tools.
On the Console tab, run a command to enable index compression.
For example, you can run the following command to enable index compression for an existing index named test:
PUT test/_settings { "index.codec" : "ali" }
By default, after index compression is enabled for the index, the system uses the zstd compression algorithm to compress the row-oriented, column-oriented, and inverted documents in the index.
You can also use another compression algorithm to compress a specific type of document in the index. The following code provides an example on how to use the zstd algorithm to compress row-oriented documents and column-oriented documents, but do not enable index compression for inverted documents in the test index.
NoteIf you want to disable index compression for a specific type of index, you can set the related parameter to
""
. For example, theindex.postings.compression
parameter is set to "" in the following code.PUT test/_settings { "index.codec":"ali", "index.doc_value.compression.default":"zstd", "index.postings.compression":"", "index.source.compression":"zstd" }
The following table describes the parameters related to index compression.
Parameter
Value description
index.doc_value.compression.default
lz4: indicates that the Iz4 compression algorithm is used to compress column-oriented documents.
zstd: indicates that the zstd compression algorithm is used to compress column-oriented documents.
NoteThe aliyun-codec plug-in can compress only the column-oriented documents that contain fields of the number, date, keyword, and ip types.
index.postings.compression
zstd: indicates that the zstd compression algorithm is used to compress inverted documents.
index.source.compression
zstd: indicates that the zstd compression algorithm is used to compress row-oriented documents. The block size is 128 KB.
zstd_1024: indicates that the zstd compression algorithm is used to compress row-oriented documents. The block size is 1024 KB.
zstd_dict: indicates that the zstd compression algorithm is used to compress row-oriented documents and the dict feature is used to store data in the documents. zstd_dict provides a higher compression ratio but lower read and write performance than zstd.
best_compression: indicates that the best_compression compression algorithm provided by open source Elasticsearch is used to compress row-oriented documents.
default: indicates that the default compression algorithm provided by open source Elasticsearch is used to compress row-oriented documents.
index.postings.pfor.enabled
Specifies whether to optimize encoding for inverted documents. Valid values:
true
false
This feature is provided by open source Elasticsearch in 8.0. It can reduce storage space by 14.4% for the keyword, match_only_text, and text fields and reduce the overall disk size by 3.5%. Alibaba Elasticsearch clusters of earlier versions provide this feature.
Use the source_reuse_doc_values feature
Enable the source_reuse_doc_values feature
Run the following command to enable the source_reuse_doc_values feature when you create an index:
PUT test
{
"settings": {
"index": {
"ali_codec_service": {
"source_reuse_doc_values": {
"enabled": true
}
}
}
}
}
You can enable the source_reuse_doc_values feature only when you create an index. The feature cannot be disabled after it is enabled.
Modify the configurations related to the source_reuse_doc_values feature
The underlying layer of open source Elasticsearch stores multiple copies of data. For example, data is stored in _source, inverted documents, and doc_values at the same time. The source_reuse_doc_values feature prunes the JSON-formatted data stored in _source to reduce the size of the overall index.
After you enable the source_reuse_doc_values feature, you can modify the configurations related to this feature based on your business requirements.
Modify the threshold for the number of fields on which the source_reuse_doc_values feature can take effect.
If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch reports an error or disables the source_reuse_doc_values feature. The default threshold is 50. You can run the following command to modify the threshold:
PUT _cluster/settings { "persistent": { "apack.ali_codec_service.source_reuse_doc_values.max_fields": 100 } }
Specify whether the number of fields on which the source_reuse_doc_values feature takes effect must be less than or equal to the threshold that you specify.
true: If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch reports an error.
false: If the number of fields on which the source_reuse_doc_values feature takes effect exceeds the threshold that you specify, Elasticsearch disables the source_reuse_doc_values feature.
PUT _cluster/settings { "persistent": { "apack.ali_codec_service.source_reuse_doc_values.strict_max_fields": true } }
Modify the number of concurrent threads used to read the values of fields on which the source_reuse_doc_values feature takes effect.
When you read data from a document, the system uses concurrent threads to read the values of fields on which the source_reuse_doc_values feature takes effect in the document and combines the values. To reduce time costs, you can modify the number of concurrent threads used to read the values of fields on which the source_reuse_doc_values feature takes effect. The default number of concurrent threads is 5. You can run the following command to modify the number:
PUT test/_settings { "index": { "ali_codec_service": { "source_reuse_doc_values": { "fetch_slice": 2 } } } }
Modify the sizes of the thread pool and queue that are used to read the values of fields on which the source_reuse_doc_values feature takes effect.
The default size of the thread pool is the same as the total number of vCPUs of data nodes in the cluster. The default size of the queue is 1,000. You can modify the two configurations only by modifying the YML configuration file of your cluster. For more information about how to modify a YML configuration file, see Configure the YML file. You can add the following configuration information to the YML configuration file of your cluster to modify the configurations:
apack.doc_values_fetch: size: 8 queue_size: 1000