All Products
Search
Document Center

OpenSearch:Vector indexes

Last Updated:Aug 27, 2024

Overview

The vector-based retrieval mechanism expresses commodity data and content data in the form of vectors and builds a vector index library. You can specify one or more user vectors or commodity vectors in a vector index library to retrieve a top-k list of commodities or content based on vector distance.

Sample code for configuring a vector index

Configure a vector index without categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    }
  ]
}

Configure a vector index with categories

{
  "table_name": "test_vector",
  "summarys": {
    "summary_fields": [
      "id",
      "vector_field",
      "category_id"
    ]
  },
  "indexs": [
    {
      "index_name": "pk",
      "index_type": "PRIMARYKEY64",
      "index_fields": "id",
      "has_primary_key_attribute": true,
      "is_primary_key_sorted": false
    },
    {
      "index_name": "embedding",
      "index_type": "CUSTOMIZED",
      "index_fields": [
        {
          "boost": 1,
          "field_name": "id"
        },
        {
          "field_name": "category_id",
          "boost": 1
        },
        {
          "boost": 1,
          "field_name": "vector_field"
        }
      ],
      "indexer": "aitheta2_indexer",
      "parameters": {
        "enable_rt_build": "false",
        "min_scan_doc_cnt": "20000",
        "vector_index_type": "Qc",
        "major_order": "col",
        "builder_name": "QcBuilder",
        "distance_type": "SquaredEuclidean",
        "embedding_delimiter": ",",
        "enable_recall_report": "false",
        "is_embedding_saved": "false",
        "linear_build_threshold": "5000",
        "dimension": "128",
        "search_index_params": "{\"proxima.qc.searcher.scan_ratio\":0.01}",
        "searcher_name": "QcSearcher",
        "build_index_params": "{\"proxima.qc.builder.quantizer_class\":\"Int8QuantizerConverter\",\"proxima.qc.builder.quantize_by_centroid\":true,\"proxima.qc.builder.optimizer_class\":\"BruteForceBuilder\",\"proxima.qc.builder.thread_count\":10,\"proxima.qc.builder.optimizer_params\":{\"proxima.linear.builder.column_major_order\":true},\"proxima.qc.builder.store_original_features\":false,\"proxima.qc.builder.train_sample_count\":3000000,\"proxima.qc.builder.train_sample_ratio\":0.5}"
      }
    }
  ],
  "attributes": [
    "id",
    "vector_field",
    "category_id"
  ],
  "fields": [
    {
      "field_name": "id",
      "field_type": "INTEGER"
    },
    {
      "user_defined_param": {
        "multi_value_sep": ","
      },
      "field_name": "vector_field",
      "field_type": "FLOAT",
      "multi_value": true
    },
    {
      "field_name": "category_id",
      "field_type": "INTEGER"
    }
  ]
}
Important
  • Categories are introduced to allow you to search for vectors based on categories. For example, an image belongs to different categories. If you do not build a vector index with categories and filter only the retrieved vectors, no results may be returned.

  • If you configure a vector index as an administrator, the escape characters in the values of the build_index_params and search_index_params parameters must be removed.

Parameter description

  • field_name: the fields that are used to build the vector index. The fields must be of the RAW data type. You must specify at least two fields for this parameter. One field must be a primary key of the INTEGER data type or the hash value of the primary key. The other field must include vectors. If you need to build a vector index based on categories, you can add a category field. The field type is RAW and the field value is of the INTEGER data type. The order of the fields in the index parameter must be configured in the same way as that in the fields parameter. If the category field exists, the order must be the primary key field, the category field, and the vector field.

  • index_name: the name of the vector index.

  • index_type: the type of the vector index. Set the value to CUSTOMIZED.

  • indexer: the plug-in that you want to use to build the vector index. Set the value to aitheta2_indexer.

  • parameters: the parameters that are used to configure a builder and a searcher for the vector index.

    • dimension: the number of dimensions.

    • embedding_delimiter: the vector delimiter. The default delimiter is a comma (,).

    • distance_type: the type of the distance. Valid values:

      • InnerProduct: calculates the inner product.

      • SquaredEuclidean: calculates the squared Euclidean distance. Specify SquaredEuclidean for data that is normalized.

    • major_order: the method that you want to use to store your data. Valid values:

      • col: uses a column store for the data. If you set the major_order parameter to col, you must set the dimension parameter to 2 to the power of n, where n must be a positive integer. If you use the col value, the system performance is better than that when you use the row value.

      • row: uses a row store for the data. This is the default value.

    • builder_name: the type of builder that you want to use for the vector index. We recommend that you set the parameter to one of the following values. For information about other parameter values, contact technical support.

      • QcBuilder

      • LinearBuilder: builds indexes in order. We recommend that you set the builder_name parameter to LinearBuilder if the number of documents is less than 10,000.

    • searcher_name: the type of searcher that you want to use for the vector index. The value of the searcher_name parameter must match the value of the builder_name parameter. If you want to use GPU resources, contact technical support.

      • QcSearcher: runs searches by using a CPU. Set the searcher_name parameter to QcSearcher if the builder_name parameter is set to QcBuilder.

      • LinearSearcher: runs full-text searches by using a CPU. Set the searcher_name parameter to LinearSearcher if the builder_name parameter is set to LinearBuilder.

    • build_index_params: the parameters that you want to configure for the builder type that you specified for the builder_name parameter. For more information, see Proxima Builder.

    • search_index_params: the parameters that you want to configure for the searcher type that you specified for the searcher_name parameter. For more information, see Proxima Searcher.

    • linear_build_threshold: the threshold value for operations that do not use LinearBuilder. If the number of documents is less than the specified threshold value, the system uses LinearBuilder and LinearSearcher. LinearBuilder can help you reduce memory usage and ensures lossless retrieval results. The performance of LinearBuilder is decreased if an excessive number of documents exist. Default value: 10000.

    • min_scan_doc_cnt: the minimum number of candidate sets that you want to retrieve. The min_scan_doc_cnt and proxima.qc.searcher.scan_ratio parameters have similar concepts. Default value: 10000. If you specify a value for the min_scan_doc_cnt parameter and specify a value for the proxima.qc.searcher.scan_ratio parameter, the larger value is used as the minimum number of candidate sets.

      • Do not specify an excessive value for the min_scan_doc_cnt or proxima.qc.searcher.scan_ratio parameter. If you specify an excessive value, the system performance is decreased and latency occurs.

      • In most cases, if you want to retrieve top-k vectors, we recommend that you use max(10000, 100*topk) as the value of the min_scan_doc_cnt parameter and use max(10000, 100*topk)/total_doc_cnt as the value of the proxima.qc.searcher.scan_ratio parameter. In addition, you must configure the parameters based on the performance, retrieval ratio, and number of your documents.

      • These two similar parameters are used to meet the requirements in real-time and multi-category scenarios. If you are a regular user, you can configure only the proxima.qc.searcher.scan_ratio parameter.

    • enable_recall_report: specifies whether to report a retrieval ratio. Default value: false.

    • is_embedding_saved: specifies whether to save an original vector. Default value: false. If you enable INT8 quantization or FP16 quantization and enable real-time retrieval, make sure that you set the is_embedding_saved parameter to true. Otherwise, incremental vectors fail to be built in batches.

Syntax for queries

Syntax for a regular query

HA3 syntax

query=index_name:'0.1,0.2,0.98,0.6;0.3,0.4,0.98,0.6...'
Note: The index_name parameter specifies the name of your vector index. Specify the vectors that you want to query after the colon (:).

SQL syntax

query=select proxima_score('index_name') as score,id from table_name where MATCHINDEX('index_name', ?) order by score asc limit 5&&kvpair=timeout:1000,iquan.plan.cache.enable:true;urlencode_data:false;iquan.plan.prepare.level:jni.post.optimize;dynamic_params:[["0.892704,0.783731"]]
Note: The index_name parameter specifies the name of your vector index. The dynamic_params parameter in the kvpair clause specifies the vectors that you want to query.

Syntax for a query for top n vectors

HA3 syntax

query=index_name:'0.1,0.2,0.98,0.6;0.3,0.4,0.98,0.6&n=10'
Note: The index_name parameter specifies the name of your vector index. Specify the vectors that you want to query after the colon (:) and before the ampersand (&). The n parameter specifies the top n vectors that can be returned.

SQL syntax

query=select proxima_score('index_name') as score,id from table_name where MATCHINDEX('index_name', ?) order by score asc limit 5&&kvpair=timeout:1000,iquan.plan.cache.enable:true;urlencode_data:false;iquan.plan.prepare.level:jni.post.optimize;dynamic_params:[["0.892704,0.783731&n=10"]]
Note: The index_name parameter specifies the name of your vector index. The dynamic_params parameter in the kvpair clause specifies the vectors that you want to query. The n parameter specifies the top n vectors that can be returned.

Syntax for a query that includes a specific threshold value

HA3 syntax

query=index_name:'0.1,0.2,0.98,0.6;0.3,0.4,0.98,0.6&n=10&sf=0.8'
Note: The index_name parameter specifies the name of your vector index. Specify the vectors that you want to query after the colon (:) and before the first ampersand (&). The sf parameter specifies the threshold value based on which the system filters documents. If you set the search_type parameter to ip in the schema.json file, documents whose inner product is less than 4.0 are filtered out. If you set the search_type parameter to l2 in the schema.json file, documents whose Euclidean distance is higher than 2.0 are filtered out. OpenSearch Vector Search Edition uses 2.0 as the threshold value to filter out documents because the square of a Euclidean distance is calculated based on performance.

SQL syntax

query=select proxima_score('index_name') as score,id from table_name where MATCHINDEX('index_name', ?) order by score asc limit 5&&kvpair=timeout:1000,iquan.plan.cache.enable:true;urlencode_data:false;iquan.plan.prepare.level:jni.post.optimize;dynamic_params:[["0.892704,0.783731&n=10&sf=0.8"]]
Note: The index_name parameter specifies the name of your vector index. The dynamic_params parameter in the kvpair clause specifies the vectors that you want to query. The sf parameter specifies the threshold value based on which the system filters documents.

Syntax for a query based on categories

HA3 syntax

query=aitheta_index_name:'16#0.1,0.2,0.98,0.6;1512#0.3,0.4,0.98,0.6&n=200'
// The query must be URL-encoded.
query=aitheta_index_name:'16%230.1%2c0.2%2c0.98%2c0.6%3b1512%230.3%2c0.4%2c0.98%2c0.6%26n%3d200'
Note: If you want to query vectors based on categories, you must specify the category IDs and the vectors to be queried. Separate a category ID and a vector with a number sign (#). The number signs in the query must be URL-encoded. Separate multiple categories with commas (,) and separate multiple vectors with semicolons (;).

SQL syntax

query=select proxima_score('index_name') as score,id from table_name where MATCHINDEX('index_name', ?) order by score asc limit 5&&kvpair=timeout:1000,iquan.plan.cache.enable:true;urlencode_data:false;iquan.plan.prepare.level:jni.post.optimize;dynamic_params:[["16%230.1%2c0.2%2c0.98%2c0.6%3b1512%230.3%2c0.4%2c0.98%2c0.6%26n%3d200"]]
Note: The index_name parameter specifies the name of your vector index. The dynamic_params parameter in the kvpair clause specifies the vectors that you want to query.
Note: The value of the dynamic_params parameter must be URL-encoded.
Note: If you want to query vectors based on categories, you must specify the category IDs and the vectors to be queried. Separate a category ID and a vector with a number sign (#). The number signs in the query must be URL-encoded. Separate multiple categories with commas (,) and separate multiple vectors with semicolons (;).

Syntax for a query with retrieval parameters

HA3 syntax

query=index_name:'0.1,0.2,0.98,0.6;0.3,0.4,0.98,0.6&n=10&sf=0.8&search_params={"proxima.qc.searcher.scan_ratio":0.001,"proxima.general.searcher.scan_count":10000}'
Note: The search_params parameter specifies the parameters that you want to configure for vector retrieval. The value must be in the JSON format. For more information about the proxima.qc.searcher.scan_ratio parameter, see the "Parameter description" section in this topic. The proxima.general.searcher.scan_count parameter is equivalent to the min_scan_doc_cnt parameter.
Note: The order of the n, sf, and search_params parameters cannot be changed.

SQL syntax

query=select proxima_score('index_name') as score,id from table_name where MATCHINDEX('index_name', ?) order by score asc limit 5&&kvpair=timeout:1000,iquan.plan.cache.enable:true;urlencode_data:false;iquan.plan.prepare.level:jni.post.optimize;dynamic_params:[["0.892704,0.783731&n=10&sf=0.8&search_params={"proxima.qc.searcher.scan_ratio":0.001,"proxima.general.searcher.scan_count":10000}"]]

Note: The index_name parameter specifies the name of your vector index. The dynamic_params parameter in the kvpair clause specifies the vectors that you want to query. The search_params parameter specifies the parameters that you want to configure for vector retrieval. The value must be in the JSON format. For more information about the proxima.qc.searcher.scan_ratio parameter, see the "Parameter description" section in this topic. The proxima.general.searcher.scan_count parameter is equivalent to the min_scan_doc_cnt parameter.
Note: The order of the n, sf, and search_params parameters cannot be changed.

Syntax for a query that sorts returned results by similarity

HA3 syntax

query=index_name:'0.1,0.2,0.98,0.6;0.3,0.4,0.98,0.6...'&&kvpairs=first_formula:proxima_score(index_name)&&sort=+RANK
Note: The index_name parameter specifies the name of your vector index. Specify the vectors that you want to query after the colon (:) and before the first ampersand (&). The kvpair clause specifies proxima_score (index_name) as the rough sort expression. The sort clause sorts query results in ascending order based on the similarity scores.

SQL syntax

query=select proxima_score('index_name') as score,id from table_name where MATCHINDEX('index_name', ?) order by score asc limit 5&&kvpair=timeout:1000,iquan.plan.cache.enable:true;urlencode_data:false;iquan.plan.prepare.level:jni.post.optimize;dynamic_params:[["0.892704,0.783731"]]
Note: The index_name parameter specifies the name of your vector index. The dynamic_params parameter in the kvpair clause specifies the vectors that you want to query. The proxima_score('index_name') function is used to obtain the scores of the vectors. The order by clause specifies that the vectors are sorted based on the scores. asc specifies ascending order. desc specifies descending order.