Configure an index table for the online search system - OpenSearch

Sample configuration

{
  "realtime": true,
  "cluster_config": {
    "table_name": "my_table"
  },
  "online_index_config": {
    "online_keep_version_count": 2,
		"on_disk_flush_realtime_index": true,
		"enable_async_dump_segment": true,
    "build_config": {
      "build_total_memory": 2048
    },
    "max_realtime_memory_use": 4096,
  	"need_read_remote_index": true,
  	"need_deploy_index": true,
    "load_config": [
      {
        "file_patterns": [
          "/index/title",
          "/attribute/.*",
        ],
        "load_strategy": "mmap",
        "load_strategy_param": {
          "slice": 409600,
          "lock": true,
          "interval": 2
        },
        "remote": false,
        "deploy": true
      }
    ],
    "speedup_primary_key_reader": true
  },
  "build_option_config": {
    "async_queue_size": 10000,
    "max_recover_time": 600,
    "async_build": true
  }
}

Parameters

realtime: specifies whether to enable the real-time mode. Valid values: true and false.
cluster_config:
- table_name: the name of the index table to be configured.
online_index_config: the online configuration parameters of the index table. The following parameters are included:
- online_keep_version_count: the maximum number of incremental index versions that can be retained for the online search system. If the number of incremental index versions that are retained for the online search system exceeds the value of this parameter, the earlier incremental index versions are deleted until the number of incremental index versions is not greater than the value of this parameter.
- on_disk_flush_realtime_index: specifies whether to dump data to a disk in real time. Valid values: true and false. Default value: false. After real-time disk dump is enabled, if the size of the indexes that are being built in real time exceeds the value of the build_total_memory parameter, the system dumps the indexes to local disks and loads the indexes based on the loading policy specified by the load_config parameter. The real-time disk dump feature allows you to build more real-time indexes by using the same size of real-time memory. We recommend that you enable this feature for applications that have a large amount of data to be updated.
- enable_async_dump_segment: specifies whether to enable the asynchronous dump feature for the real-time indexes. Valid values: true and false. Default value: false. If you enable this feature, the index dump does not block the real-time index building. We recommend that you set this parameter to true.
- build_config: the parameters that are used to configure the real-time index building.
  - build_total_memory: the maximum memory size that can be used to build the real-time indexes. If the size of the used memory exceeds the value of this parameter, the system dumps the real-time indexes. Unit: MB. The on_disk_flush_realtime_index parameter specifies whether to dump the indexes to the memory or local disks. The value of the build_total_memory parameter must be less than that of the max_realtime_memory_use parameter.
- max_realtime_memory_use: the maximum memory size that can be occupied by the real-time indexes. Unit: MB. The memory occupied by the real-time indexes includes the memory occupied by the real-time indexes that are being built and the real-time indexes that are dumped. If the size of the memory occupied by the real-time indexes exceeds the value of the max_realtime_memory_use parameter, the real-time index building stops. After real-time data is applied to the online search system by using incremental indexes and index merging, the real-time data is cleared from the real-time indexes and the memory occupied by the real-time indexes is released. The value of the max_realtime_memory_use parameter must be greater than that of the build_total_memory parameter.
- need_read_remote_index: specifies whether to enable the storage-computing separation feature. Default value: false. After the feature is enabled, Searcher workers can directly read index data from the distributed storage system. This reduces startup time of Searcher workers, increases the data capacity of Searcher workers, and saves costs. However, query performance is compromised.
- need_deploy_index: specifies whether to distribute indexes to local disks. Default value: true. To improve retrieval performance, you can distribute indexes to local disks in advance.
- load_config: the parameters that are used to configure the index table loading policy, including how to load the indexes to the memory to achieve fine-grained control on memory, reduce memory overhead, and improve retrieval performance. For more information, see Configure an index table loading policy. The need_read_remote_index, need_deploy_index, and load_config parameters can be used together to reduce costs and minimize the impact of storage-computing separation on retrieval performance.
build_option_config: the parameters that are used to configure the process of loading the index table for the online search system.
- async_build: specifies whether to read real-time data and build indexes in asynchronous mode. Default value: false. We recommend that you set this parameter to true in scenarios where data needs to be updated in real time or a large amount of data needs to be updated.
- async_queue_size: the size of the task queue. This parameter is available only when the async_build parameter is set to true. Default value: 1000.
- max_recover_time: the maximum amount of time that can be consumed to trace real-time data from the backtracked message queues. Unit: seconds. When the process of a Searcher worker starts, the system traces data from the backtracked message queues. For example, if you set the max_recover_time parameter to 600 seconds, the system traces real-time data for up to 10 minutes after the process starts and the indexes are built. After 10 minutes, the system provides services even if the real-time data is still delayed. If the amount of real-time data is small and the process detects that the real-time data is not delayed, the system immediately sets the status of the process to normal and provides services.