All Products
Search
Document Center

OpenSearch:Configure an index table loading policy

Last Updated:Feb 28, 2024

Overview

An index table loading policy consists of the loading policies of multiple index files. An index table loading policy describes how to load a set of index files. When the system loads an index table, the system loads each index file of the index table based on the policy that matches first for the index file among all policies.

Sample configuration

{
    "load_config":[
        {
            "file_patterns":[
                "_ATTRIBUTE_",
                "/index/title/.*",
                "/index/body/dictionary"
            ],
            "load_strategy":"mmap",
            "lifecycle":"hot",
            "load_strategy_param":{
                "lock":true,
                "partial_lock":true,
                "advise_random":false,
                "slice":4194304,
                "interval":2
            },
            "remote" : false,
            "deploy" : true,
            "warmup_strategy":"sequential"
        },
        {
            "file_patterns":[
                "_SUMMARY_"
            ],
            "load_strategy":"cache",
            "load_strategy_param":{
                "global_cache":false,
                "direct_io":true,
                "cache_size":4096
            },
            "remote" : true,
            "deploy" : false
        },
        {
            "warmup_strategy":"none",
            "file_patterns":[
                ".*"
            ],
            "load_strategy":"mmap",
            "load_strategy_param":{
                "lock":false
            }
        }
    ]
}

Parameters

  • file_patterns: the pattern that is used to match the loading policies for an index file. Specify this parameter by using regular expressions. For more information about the directory structure of index tables, see the Directory structure of index files section of this topic. The matching pattern is the regular expression of the file name relative to the segment directory. For example, if you want to set an independent loading policy for the inverted index named title, the matching pattern for the title index is the /index/title/.* regular expression. The title index name indicates that the title directory exists in the index directory. All files to be matched in the title directory are expressed with .*. The following built-in macro definitions are provided to simplify the matching pattern configurations of an index file:

    • _ATTRIBUTE_: equivalent to /attribute/.*, which indicates all forward indexes.

    • _INDEX_: equivalent to /index/.*, which indicates all inverted indexes.

    • _SUMMARY_: equivalent to /summary/, which indicates all summary indexes.

  • load_strategy: the loading policy. Valid values: mmap and cache.

  • load_strategy_param: the parameters that are used to configure the loading policy.

    • Parameters that are used to configure the mmap loading policy

      • lock: specifies whether to enable the lock mode for the mmap loading policy. Default value: false. After the lock mode is enabled, the indexes are loaded to the memory without being swapped out. This ensures query performance but leads to more memory overheads.

      • partial_lock: specifies whether to enable the partial lock mode for inverted indexes. Default value: false. After the partial lock mode is enabled, only the first-level dictionary of inverted indexes is locked into the memory. The second-level dictionary is not locked to save memory.

      • advise_random: specifies whether to reduce the number of read-ahead requests to the disk. Default value: false. In scenarios where the indexes are oversized and some indexes cannot be loaded into the memory, disk I/O may be a performance bottleneck for queries. If you set this parameter to true, the number of read-ahead requests to the disk can be significantly reduced and the query performance can be improved.

      • slice and interval: The two parameters are specified to control the speed at which indexes are prefetched and loaded. The system reads data of the size specified by the slice parameter at a time and sleeps at an interval specified by the interval parameter. Unit of the slice parameter: bytes. Unit of the interval parameter: milliseconds. The slice and interval parameters must be used in combination. The default value of the slice parameter is 4194304. The default value of the interval parameter is 0, which indicates that the throttling is disabled.

    • Parameters that are used to configure the cache loading policy

      • direct_io: specifies whether to read files in Direct I/O mode. Default value: false. If you read data from an SSD in Direct I/O mode, query performance is improved.

      • global_cache: specifies whether to enable the global block cache. Default value: false. The size of the global block cache is specified by using an environment variable. The global block cache is unavailable. We recommend that you set this parameter to false.

      • cache_size: the size of the block cache. This parameter takes effect only when the global_cache parameter is set to false. Default value: 1. Unit: MB.

      • block_size: the size of a block. Default value: 4096. Unit: bytes.

  • remote: specifies whether to read the index files that match the value of the file_patterns parameter from the remote distributed storage system. Valid values: true and false. This parameter takes effect only when the need_read_remote_index parameter is set to true. If the need_read_remote_index parameter is set to false, the remote parameter is fixed as false.

  • deploy: specifies whether to distribute the index files that match the value of the file_patterns parameter to local disks. Valid values: true and false. This parameter takes effect only when the need_deploy_index parameter is set to true. If the need_deploy_index parameter is set to false, the deploy parameter is fixed as false.

  • warmup_strategy: the prefetching policy. This parameter takes effect only when the load_strategy parameter is set to mmap. The default value is none, which indicates that the system does not prefetch data. To prefetch data, set the parameter to sequential. A value of sequential specifies that the system prefetches data in sequence.

Examples

Example of the mmap loading policy

{
    "load_config":[
        {
            "file_patterns":[
                "/attribute/price/.*", # The attribute field named price.
                "/index/title/.*", # The inverted index named title.
                "/index/body/dictionary", # The dictionary of the inverted index named body.
                "/index/vector/aitheta.*" # The vector index named vector.
            ],
            "load_strategy":"mmap",
            "load_strategy_param":{
                "lock":true, # The lock mode is enabled for the mmap loading policy.
                "partial_lock":true, # The partial lock mode is enabled. Only the first-level dictionary of inverted indexes is locked.
                "slice":4194304, # During the prefetching, the system reads 4 MB of data at a time at an interval of 2 ms.
                "interval":2
            },
            "remote" : false, # The system does not read the index files that match the value of the file_patterns parameter from the remote distributed storage system.
            "deploy" : true, # The system distributes the indexes to local disks.
            "warmup_strategy":"sequential" # The system prefetches data in sequence.
        },
        {
            "file_patterns":[
                "/attribute/tags", # The attribute field named tags.
                "/index/description/.*" # The inverted index named description.
            ],
            "load_strategy":"mmap",
            "load_strategy_param":{
                "lock":false,
            },
            "remote" : false,
            "deploy" : true,
            "warmup_strategy":"none"
        }
    ]
}

Example of the cache loading policy

{
    "load_config":[
        {
            "file_patterns":[
                "_ATTRIBUTE_" # All attribute fields.
            ],
            "load_strategy":"cache",
            "load_strategy_param":{
                "global_cache":false, # The global block cache is disabled.
                "direct_io":true, # The system reads files in Direct I/O mode.
                "cache_size":20480 # The cache size is 20 GB.
            },
            "remote" : false, # The system does not read the index files that match the value of the file_patterns parameter from the remote distributed storage system.
            "deploy" : true # The system distributes the indexes to local disks.
        },
        {
            "file_patterns":[
                "/summary/data" # The data files of a summary index.
            ],
            "load_strategy":"cache",
            "load_strategy_param":{
                "global_cache":false,
                "direct_io":true,
                "cache_size":4096
            },
            "remote" : false,
            "deploy" : true
        },
        {
            "warmup_strategy":"none",
            "file_patterns":[
                ".*"
            ],
            "load_strategy":"mmap",
            "load_strategy_param":{
                "lock":false
            }
        }
    ]
}

Storage-computing separation

# To enable storage-computing separation, set the need_read_remote_index parameter to true.
{
    "load_config":[
        {
            "file_patterns":[
                "/index/title/.*" # The inverted index named title.
            ],
            "load_strategy":"mmap",
            "load_strategy_param":{
                "lock":true, # The lock mode is enabled for the mmap loading policy.
                "partial_lock":true, # The partial lock mode is enabled. Only the first-level dictionary of inverted indexes is locked.
                "slice":4194304, # During the prefetching, the system reads 4 MB of data at a time at an interval of 2 ms.
                "interval":2
            },
            "remote" : false, # The system does not read the index files that match the value of the file_patterns parameter from the remote distributed storage system.
            "deploy" : true, # The system distributes the indexes to local disks.
            "warmup_strategy":"sequential" # The system prefetches data in sequence.
        },
        {
            "file_patterns":[
                "_ATTRIBUTE_" # All attribute fields.
            ],
            "load_strategy":"cache",
            "load_strategy_param":{
                "global_cache":false, # The global block cache is disabled.
                "direct_io":true, # The system reads files in Direct I/O mode.
                "cache_size":20480 # The cache size is 20 GB.
            },
            "remote": true, # The system reads the index files that match the value of the file_patterns parameter from the remote distributed storage system.
            "deploy" : false # The system does not distribute the indexes to local disks.
        },
        {
            "file_patterns":[
                "/summary/data" # The data files of a summary index.
            ],
            "load_strategy":"cache",
            "load_strategy_param":{
                "global_cache":false,
                "direct_io":true,
                "cache_size":4096
            },
            "remote": true, # The system reads the index files that match the value of the file_patterns parameter from the remote distributed storage system.
            "deploy" : false # The system does not distribute the indexes to local disks.
        },
        {
            "warmup_strategy":"none",
            "file_patterns":[
                ".*"
            ],
            "load_strategy":"mmap",
            "load_strategy_param":{
                "lock":false
            }
        }
    ]
}

Directory structure of index files

  |-- generation_0
      |-- partition_0_65535
          |-- index_format_version
          |-- index_partition_meta
          |-- schema.json
          |-- segment_0
              |-- attribute
                  `--attribute_name
                     `--data   
              |-- deletionmap
              |-- deploy_index
              |-- index
                 `--index_name
                    |-- bitmap_dictionary
                    |-- bitmap_posting
                    |-- dictionary
                    `-- posting
                  `--vector_index_name
                    |-- aitheta.index
                  	|-- aitheta.index.addr
              |-- summary
                	|-- data
                	|-- offset
              `-- segment_info
          |-- adaptive_bitmap__meta
              |--deploy_index
              |--dictionary_name
          |-- truncate_meta
              |-- deploy_index
              `-- truncate_meta_file 
          `-- version.0

Item

Description

generation

The identifier that is used by OpenSearch Retrieval Engine Edition to distinguish the versions of full indexes.

partition

The basic unit for a Searcher worker to load indexes. If a partition contains an excessive amount of data, the performance of a Searcher worker decreases. You can split online data into multiple partitions to ensure the retrieval efficiency of each Searcher worker.

segment

The basic unit of an index. A segment stores data for inverted indexes and forward indexes. The builder generates a segment for each index dump. Segments can be merged based on merge policies. The segments that are available in a partition are displayed in the version field.

index

The basic unit of an inverted index.

attribute

The basic unit of a forward index.

deletionmap

The information about the documents that are deleted.

index_format_version

The version of an index. The index version is used to check whether the index file meets the binary requirements.

index_partition_meta

The global sorting information of an index. The global sorting information includes sorting fields and sorting orders, such as ascending order and descending order.

schema.json

The configuration file that is used to configure indexes. The file contains information about fields, indexes, attributes, and summaries. OpenSearch Retrieval Engine Edition uses this file to load indexes.

version.0

The version number of the index file. This field contains the segment that OpenSearch Retrieval Engine Edition needs to load and the timestamp of the most recent document in the partition. When OpenSearch Retrieval Engine Edition builds indexes for real-time data, the system filters out the outdated original documents based on the timestamps in the incremental index.

segment_info

The summary information about segments. In this field, you can obtain information about the number of documents in the segment, whether the segment is merged, the locator information, and the timestamp of the most recent document.

dictionary

The dictionary of an inverted index.

posting

The posting lists of an inverted index.

bitmap_dictionary

The dictionary of high-frequency words if you create a bitmap index for high-frequency words.

bitmap_posting

The posting lists of high-frequency words if you create a bitmap index for high-frequency words.

aitheta.index

The vector index files.

aitheta.index.addr

The metadata of the vector indexes.