All Products
Search
Document Center

OpenSearch:Configure an index table

Last Updated:Feb 28, 2024

Overview

The configuration of an index table is one of the most important configurations in OpenSearch Retrieval Engine Edition. When you configure an index table, you can specify the data format of the original document and how to create indexes based on the data in the document. The indexes include inverted indexes, forward indexes, and summary indexes.

Configure an index table

Configuration overview

{
    "table_name":"sample",
    "fields":[

    ],
    "indexs":[

    ],
    "attributes":[

    ],
    "summarys":{

    },
    "dictionaries":[

    ],
          "adaptive_dictionaries":[
      
    ],
    "enable_ttl":true,
    "ttl_field_name":"ttl_filed",
    "default_ttl":86400
}
  • table_name: the name of an index table. This parameter is used to identify an index table. The system generates a configuration file named $table_name_schema.json for the index table.

  • fields: the fields on which an index is created.

  • indexs: the configuration of inverted indexes.

  • attributes: the configurations of forward indexes.

  • summarys: the configurations of summary indexes.

  • dictionaries: the dictionaries that are required to create a bitmap index. If you do not create a bitmap index, you do not need to set this parameter. After you create high-frequency dictionaries or adaptive dictionaries for high-frequency terms, index space can be reduced and retrieval performance can be improved.

  • adaptive_dictionaries: creates an adaptive bitmap index. High-frequency terms and the bitmap index are generated based on the rules that you configure. You can also leave this parameter empty as needed.

  • file_compress: declares the parameters and aliases for various file compression methods. These methods are used to compress files for forward, inverted, and summary indexes.

  • enable_ttl: specifies whether to use the time to live (TTL) feature. Expired data is automatically cleaned. The default value is false.

  • ttl_field_name: specifies the name of the field for which TTL is enabled in the table. If this field is left empty in the original document, the default value of the ttl_field_name parameter is used. If the user does not set the ttl_field_name parameter, the built-in field ops_doc_time_to_live_in_seconds is used. If the user sets the ttl_field_name parameter, the field must be of the UINT32 type and single-value.

  • default_ttl: specifies the default TTL. If the enable_ttl parameter is set and the default_ttl parameter is not set, the default value std::numeric_limits<int64_t>::max() >> 20 is used for the default_ttl parameter. If the default_ttl parameter is set and the enable_ttl parameter is not set, the system automatically sets the enable_ttl parameter to true.

fields configuration

 "fileds":[
        {
            "field_name":"title",
            "field_type":"TEXT",
            "analyzer":"chn_standard"
        },
        {
            "field_name":"dup_title",
            "field_type":"TEXT",
            "analyzer":"fuzzy"
            "user_defined_param" : {
                "copy_from" : "title"
            }
        },
        {
            "field_name":"category",
            "field_type":"INTEGER",
            "multi_value":true,
            "compress_type":"uniq|equal"
        },
        {
            "field_name":"mlr_features",
            "field_type":"INTEGER",
            "multi_value":true,
            "updatable_multi_value":true
        },
        {
            "field_name":"feature",
            "field_type":"float",
            "multi_value":true,
            "fixed_multi_value_count":32,
            "compress_type":"uniq|fp16",
            "updatable_multi_value":true
        },
        {
            "field_name":"user_id",
            "field_type":"INTEGER"
        },
        {
            "field_name":"price",
            "field_type":"INTEGER",
            "enable_null":true,
            "default_null_string":"default_null"
        },
        {
            "field_name":"product_id",
            "field_type":"LONG"
        },
        {
            "field_name":"product_type",
            "field_type":"UINT8",
            "compress_type":"equal"
        },
        {
            "field_name":"bitwords",
            "field_type":"STRING",
            "multi_value":true
        },
        {
            "field_name":"date",
            "field_type":"DATE"
        },
        {
            "field_name":"time",
            "field_type":"TIME"
        },
        {
            "field_name":"timestamp",
            "field_type":"TIMESTAMP",
            "default_time_zone":"+0800"
        }
    ]
  • field_name: the name of the field.

  • field_type: the field type. For more information, see Built-in field types in OpenSearch Retrieval Engine Edition.

  • analyzer: the analyzer used for fields of the TEXT type. An analyzer must be specified for fields of the TEXT type. An analyzer cannot be specified for fields of other types. For information about the built-in analyzers supported by OpenSearch Retrieval Engine Edition, see Analyzers. If you want to specify a different analyzer for a field of the TEXT type, extend a new field. You can add a field in the schema and set the user_defined_param parameter. For more information, see the configuration of the dup_title field in the preceding code block.

  • multi_value: specifies that the field is multi-value. If you specify the field as an attribute, a multi-value attribute is specified. The default value is false.

  • updatable_multi_value: specifies that the multi-value field can be updated. A single-value field of the STRING type can also be updated. The default value is false. You can set the updatable_multi_value parameter for the multi-value fields of the following types: INT8, UINT8, INT16, UINT16, INTEGER (32-bit integer), UINT32, LONG (64-bit integer), UINT64, FLOAT, DOUBLE, and STRING. In addition to the preceding multi-value fields, you can also set this parameter for single-value fields of the STRING type. If you set the updatable_multi_value parameter to true, you can set the u32offset_threshold parameter. The default value of the u32offset_threshold parameter is 0xFFFFFFFFL. In most cases, users can ignore this parameter. If the maximum offset exceeds the value of the u32offset_threshold parameter, the file format is 8-byte offset. Otherwise, the file format is 4-byte offset.

  • fixed_multi_value_count: specifies a fixed number of multiple values in a field. If this field is configured, the field is a fixed-length multi-value field in the attribute index. Fields of the following types can be configured as fixed-length multi-value attributes: INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, FLOAT, and DOUBLE. You can set the fixed_multi_value_count parameter to specify a fixed length for single-value fields of the STRING type. This parameter cannot be set for multi-value fields of the STRING type.

  • compress_type: specifies the compression method that is used when the field is stored as an attribute. The value can be uniq, equal, or uniq|equal. The default value is an empty string and specifies that the files are not compressed.

  • If you set this parameter to uniq for multi-value attributes or string attributes, data is stored after being compressed by removing duplicates. If you set this parameter to equal, equal-value compression is implemented for offset data.

  • For single-value attributes of the INTEGER type, you cannot set this parameter to uniq, but you can set this parameter to equal to enable equal value compression. You can also configure equal value compression for single-value attributes of the FLOAT/DOUBLE floating-point type.

  • Other compression methods: For fixed-length multi-value fields of the FLOAT type, in addition to uniq and equal, you can also select one of the following compression methods that compromise precision for compression and storage: fp16, block_fp, and int8#[absMax]. For single-value fields of the FLOAT type, you can also select fp16 or int8#[absMax] for compression.

  • The compression rate is close to 50% for fp16 and block_fp. The encoding space of fp16 is slightly smaller than the encoding space of block_fp, but the precision loss of fp16 is greater than the precision loss of block_fp.

  • int8#[absMax]: You must specify the maximum absolute value for the encoded value. For example, if int8#[absMax] is set to int8#1.5, the range of the encoded value is [-1.5, 1.5]. The value after compression is 25% of the initial value. The precision loss is related to the value of absMax. A larger absMax causes greater precision loss.

  • enable_null: specifies whether null values are allowed for this field. If enable_null is true, the field that corresponds to the index cannot be fixed-length multi-value or single-value and the compression type for the field cannot be set to equal.

  • default_null_string: specifies the literal value of a null value. The default value is "__NULL__".

  • You can specify default_time_zone in the +/-HHMM format for a field of the TIMESTAMP type. The time is in Coordinated Universal Time (UTC). For example, the value +0800 corresponds to UTC+8. After the default time zone is configured, field values without time zone information are converted into timestamps in UTC based on the specified default time zone when the field of the TIMESTAMP type is parsed. If the field is stored as a summary field, the field is displayed based on the default time zone.

indexs configuration

"indexs":
[
        {index1},
        {index2},
        …
        {indexn}
]

In the indexs configuration, a list of indexes is configured. Each item in the list represents a complete index configuration. For more about index types and configurations supported by OpenSearch Retrieval Engine Edition, see Overview of inverted indexes.

attributes configuration

"attributes": [
  "user_id", 
  "product_id", 
  "category"
]

In the attributes configuration, a list of fields is configured. These fields must be declared in the fields configuration. Fields of all types can be configured as attributes except the TEXT type.

Note:

The following section describes the values of fields of the TIME, DATE, and TIMESTAMP types stored in the forward index:

  • DATE: the number of days from January 1, 1970 to the specified date. A single value is stored in four bytes. If necessary, you can multiply the returned value by 86,400,000 (the total number of milliseconds in a single day) to convert the returned value into a timestamp.

  • TIME: the number of milliseconds from 00:00:00 to the specified time. A single value is stored in four bytes.

  • TIMESTAMP: the number of milliseconds from January 1, 1970 to the specified timestamp. A single value is stored in eight bytes.

summarys configuration

"summarys":
{
        "summary_fields":["id", "company_id", "subject", "cat_id"],
        "compress":false 
}
  • summary_fields: specifies the fields that are contained in the summary. Fields of all types can be contained in the summary. The fields must be declared in the fields configuration.

  • compress: specifies whether to compress the summary by using zlib. The value true specifies that the summary is compressed. The value false specifies that the summary is not compressed. The default value is false.

dictionaries configuration

"dictionaries":[
    {
        "dictionary_name":"bitmap1",
        "content":"a;an"
    },
    {
        "dictionary_name":"bitmap2",
        "content":"of;and"
    }
]
  • dictionary_name: the name of the dictionary.

  • content: lists all terms in the dictionary. The terms are separated with semicolons (;).

adaptive_dictionaries configuration

"adaptive_dictionaries":[
    {
        "adaptive_dictionary_name":"df",
        "dict_type":"DOC_FREQUENCY",
        "threshold":1500000
    },
    {
        "adaptive_dictionary_name":"percent",
        "dict_type":"PERCENT",
        "threshold":30
    },
    {
        "adaptive_dictionary_name":"size",
        "dict_type":"INDEX_SIZE"
    }
]
  • adaptive_dictionary_name: the name of the rule for generating an adaptive high-frequency dictionary.

  • dict_type: the type of the rule for generating adaptive high-frequency terms. The following three types are supported:

  • DOC_FREQUENCY: specifies that high-frequency terms are the terms whose document frequency (df) is equal to or greater than the specified threshold.

  • PERCENT: specifies that high-frequency terms are the terms whose df or totalDocCount multiplied by 100 is equal to or greater than the specified threshold.

  • INDEX_SIZE: compares the size of the bitmap index generated based on terms and the size of the original index. If the size of the bitmap index is small, the terms are high-frequency terms.

Note:

In most cases, if index terms are enumerable, such as terms a, b, and c, and the inverted index is infrequently queried in the query, we recommend that you set the dict_type parameter to INDEX_SIZE. If terms are not enumerable and the inverted index is frequently queried, we recommend that you set the dict_type parameter to PERCENT or DOC_FREQUENCY. We recommend that you specify a threshold based on the performance testing result. We recommend that you set the threshold to 5% of the total number of documents. If the total number of documents is 10 million, we recommend that you set the threshold to 500,000 for DOC_FREQUENCY and to 5 for PERCENT.