Configure an index table

Overview

The configuration of an index table is one of the most important configurations in OpenSearch Vector Search Edition. When you configure an index table, you can specify the data format of the original document and how to create indexes based on the data in the document. The indexes include inverted indexes, forward indexes, and summary indexes.

Configuration overview

{
    "table_name":"sample",
    "fields":[

    ],
    "indexs":[

    ],
    "attributes":[

    ],
    "summarys":{

    },
    "dictionaries":[

    ],
          "adaptive_dictionaries":[
      
    ],
    "enable_ttl":true,
    "ttl_field_name":"ttl_filed",
    "default_ttl":86400
}

table_name: the name of an index table. This parameter is used to identify an index table. The system generates a configuration file named $table_name_schema.json for the index table.
fields: the fields on which an index is created.
indexs: the configurations of inverted indexes.
attributes: the configurations of forward indexes.
summarys: the configurations of summary indexes.
dictionaries: the dictionaries that are required to create a bitmap index. If you do not create a bitmap index, you do not need to set this parameter. After you create high-frequency dictionaries or adaptive dictionaries for high-frequency terms, index space can be reduced and retrieval performance can be improved.
adaptive_dictionaries: the dictionaries that are required to create an adaptive bitmap index. High-frequency terms and the bitmap index are generated based on the rules that you configure. You can also leave this parameter empty as needed.
file_compress: declares the parameters and aliases for various file compression methods. These methods are used to compress files for forward, inverted, and summary indexes.
enable_ttl: specifies whether to use the time to live (TTL) feature. Expired data is automatically cleaned. The default value is false.
ttl_field_name: the name of the field for which TTL is enabled in the table. If this field is left empty in the original document, the default value of the ttl_field_name parameter is used. If you do not set the ttl_field_name parameter, the built-in field ops_doc_time_to_live_in_seconds is used. If you set the ttl_field_name parameter, the field must be of the UINT32 type and single-value.
default_ttl: the default TTL. If the enable_ttl parameter is configured and the default_ttl parameter is not configured, use the default value std::numeric_limits<int64_t>::max() >> 20. If the default_ttl parameter is configured and the enable_ttl parameter is not configured, the system automatically sets the enable_ttl parameter to true.

fields configuration

 "fileds":[
        {
            "field_name":"title",
            "field_type":"TEXT",
            "analyzer":"chn_standard"
        },
        {
            "field_name":"dup_title",
            "field_type":"TEXT",
            "analyzer":"fuzzy"
            "user_defined_param" : {
                "copy_from" : "title"
            }
        },
        {
            "field_name":"category",
            "field_type":"INTEGER",
            "multi_value":true,
            "compress_type":"uniq|equal"
        },
        {
            "field_name":"mlr_features",
            "field_type":"INTEGER",
            "multi_value":true,
            "updatable_multi_value":true
        },
        {
            "field_name":"feature",
            "field_type":"float",
            "multi_value":true,
            "fixed_multi_value_count":32,
            "compress_type":"uniq|fp16",
            "updatable_multi_value":true
        },
        {
            "field_name":"user_id",
            "field_type":"INTEGER"
        },
        {
            "field_name":"price",
            "field_type":"INTEGER",
            "enable_null":true,
            "default_null_string":"default_null"
        },
        {
            "field_name":"product_id",
            "field_type":"LONG"
        },
        {
            "field_name":"product_type",
            "field_type":"UINT8",
            "compress_type":"equal"
        },
        {
            "field_name":"bitwords",
            "field_type":"STRING",
            "multi_value":true
        },
        {
            "field_name":"date",
            "field_type":"DATE"
        },
        {
            "field_name":"time",
            "field_type":"TIME"
        },
        {
            "field_name":"timestamp",
            "field_type":"TIMESTAMP",
            "default_time_zone":"+0800"
        }
    ]

field_name: the name of a field.
field_type: the field type. For more information, see Built-in field types in OpenSearch Vector Search Edition.
analyzer: the analyzer used for fields of the TEXT type. An analyzer must be specified for fields of the TEXT type, but cannot be specified for fields of other types. For information about the built-in analyzers supported by OpenSearch Vector Search Edition, see Analyzers. If you want to specify a different analyzer for a field of the TEXT type, extend a new field. You can add the field in the schema and set the user_defined_param parameter. For more information, see the configuration of the dup_title field in the preceding code block.
multi_value: specifies that the field is multivalued. If you specify the field as an attribute, a multivalued attribute is specified. The default value is false.
updatable_multi_value: specifies that the multivalued field can be updated. A single-value field of the STRING type can also be updated. The default value is false. You can set the updatable_multi_value parameter for multivalued fields of the following types: INT8, UINT8, INT16, UINT16, INTEGER (32-bit integer), UINT32, LONG (64-bit integer), UINT64, FLOAT, DOUBLE, and STRING. In addition to the preceding multivalued fields, you can also set this parameter for single-value fields of the STRING type. If you set the updatable_multi_value parameter to true, you can set the u32offset_threshold parameter. The default value of the u32offset_threshold parameter is 0xFFFFFFFFL. In most cases, you can ignore this parameter. If the maximum offset exceeds the value of the u32offset_threshold parameter, the file format is 8-byte offset. Otherwise, the file format is 4-byte offset.
fixed_multi_value_count: a fixed number of multiple values in a field. If this field is configured, it is a fixed-length multivalued field in the attribute index. Fields of the following types can be configured as fixed-length multivalued attributes: INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, FLOAT, and DOUBLE. You can set the fixed_multi_value_count parameter to specify a fixed length for single-value fields of the STRING type. This parameter cannot be set for multivalued fields of the STRING type.
compress_type: the compression method that is used when the field is stored as an attribute. The value can be uniq, equal, or uniq|equal. This parameter is left empty by default, which indicates that files are not compressed.
If you set this parameter to uniq for multivalued attributes or string attributes, data is stored after being compressed by removing duplicates. If you set this parameter to equal, equal-value compression is implemented for offset data.
For single-value attributes of the INTEGER type, you cannot set this parameter to uniq, but you can set it to equal to enable equal value compression. You can also configure equal value compression for single-value attributes of the FLOAT or DOUBLE floating-point type.
Other compression methods: For fixed-length multivalued fields of the FLOAT type, in addition to uniq and equal, you can also select one of the following compression methods that compromise precision for compression and storage: fp16, block_fp, and int8#[absMax]. For single-value fields of the FLOAT type, you can also select fp16 or int8#[absMax] for compression.
The compression rate is close to 50% for fp16 and block_fp. The encoding space of fp16 is slightly smaller than the encoding space of block_fp, but the precision loss of fp16 is greater than the precision loss of block_fp.
int8#[absMax]: You must specify the maximum absolute value for the encoded value. For example, if int8#[absMax] is set to int8#1.5, the range of the encoded value is [-1.5, 1.5]. The value after compression is 25% of the initial value. The precision loss is related to the value of absMax. A larger absMax causes greater precision loss.
enable_null: specifies whether null values are allowed for this field. If enable_null is true, the field that corresponds to the index cannot be fixed-length multivalued or single-value and the compression type for the field cannot be set to equal.
default_null_string: the literal value of a null value. The default value is __NULL__.
You can specify default_time_zone in the +/-HHMM format for a field of the TIMESTAMP type. The time is in Coordinated Universal Time (UTC). For example, the value +0800 corresponds to UTC+8. After the default time zone is configured, field values without time zone information are converted into timestamps in UTC based on the specified default time zone when the field of the TIMESTAMP type is parsed. If the field is stored as a summary field, it is displayed based on the default time zone.

indexs configuration

"indexs":
[
        {index1},
        {index2},
        ...
        {indexn}
]

In the indexs configuration, a list of indexes are configured. Each item represents a complete index configuration. For more about index types and configurations supported by OpenSearch Vector Search Edition, see Inverted indexes.

attributes configuration

"attributes": [
  "user_id", 
  "product_id", 
  "category"
]

In the attributes configuration, a list of fields is configured. These fields must be declared in the fields configuration. Fields of all types can be configured as attributes except the TEXT type.

Note:

The following section describes the values of fields of the TIME, DATE, and TIMESTAMP types stored in forward indexes:

DATE: the number of days elapsed from January 1, 1970 to the specified date. A single value is stored in four bytes. If necessary, you can multiply the returned value by 86,400,000 (the total number of milliseconds in a single day) to convert the returned value into a timestamp.
TIME: the number of milliseconds elapsed from 00:00:00 to the specified time. A single value is stored in four bytes.
TIMESTAMP: the number of milliseconds elapsed from January 1, 1970 to the specified timestamp. A single value is stored in eight bytes.

summarys configuration

"summarys":
{
        "summary_fields":["id", "company_id", "subject", "cat_id"],
        "compress":false 
}

summary_fields: the fields that are contained in the summary. Fields of all types can be contained in the summary. The fields must be declared in the fields configuration.
compress: specifies whether to compress the summary by using zlib. The value true specifies that the summary is compressed. The value false specifies that the summary is not compressed. The default value is false.

dictionaries configuration

"dictionaries":[
    {
        "dictionary_name":"bitmap1",
        "content":"a;an"
    },
    {
        "dictionary_name":"bitmap2",
        "content":"of;and"
    }
]

dictionary_name: the name of a dictionary.
content: all terms in the dictionary. The terms are separated with semicolons (;).

adaptive_dictionaries configuration

"adaptive_dictionaries":[
    {
        "adaptive_dictionary_name":"df",
        "dict_type":"DOC_FREQUENCY",
        "threshold":1500000
    },
    {
        "adaptive_dictionary_name":"percent",
        "dict_type":"PERCENT",
        "threshold":30
    },
    {
        "adaptive_dictionary_name":"size",
        "dict_type":"INDEX_SIZE"
    }
]

adaptive_dictionary_name: the name of the rule for generating an adaptive high-frequency dictionary.
dict_type: the type of the rule for generating adaptive high-frequency terms. The following three types are supported:
DOC_FREQUENCY: specifies that terms whose document frequency (df) is equal to or greater than the specified threshold are used as high-frequency terms.
PERCENT: specifies that terms whose df or totalDocCount multiplied by 100 is equal to or greater than the specified threshold are used as high-frequency terms.
INDEX_SIZE: compares the size of the bitmap index generated based on terms and the size of the original index. If the size of the bitmap index is smaller, the terms are used as high-frequency terms.

Note:

In most cases, if index terms are enumerable, such as terms a, b, and c, and the inverted index is not frequently queried in the query, we recommend that you set the dict_type parameter to INDEX_SIZE. If terms are not enumerable and the inverted index is frequently queried, we recommend that you set the dict_type parameter to PERCENT or DOC_FREQUENCY. We recommend that you specify a threshold based on the performance testing result. The empiric threshold is 5% of the total number of documents. For example, if the total number of documents is 10 million, you can set the threshold to 500,000 for DOC_FREQUENCY and 5 for PERCENT.