Perform a hybrid query

Background information

Hybrid query of OpenSearch Vector Search Edition

The hybrid query feature combines semantic search with keyword search to improve the performance of text search.

When you use OpenSearch Vector Search Edition, you can perform a hybrid query by using sparse and dense vectors. Compared with the traditional multimodal search feature based on text search and vector search, the hybrid query feature is provided based on sparse and dense vectors. OpenSearch combines a dense vector and a sparse vector into a single vector. The sparse vector is generated after you perform text vectorization. The dense vector is a traditional vector. Sparse and dense vectors represent different types of information and support different types of searches.

To perform a hybrid query, perform the following steps:

Use a dense vector model to generate dense vectors.
Use a sparse vector model to generate sparse vectors.
Purchase an OpenSearch Vector Search Edition instance and create a sparse and dense vector-based index table for the instance.
Add the dense and sparse vector data to the index table.
Perform a hybrid query by using the sparse and dense vectors.
OpenSearch Vector Search Edition returns the queried results.

Dense vectors

The basic vectors in OpenSearch Vector Search Edition are dense vectors. Semantic search is provided based on dense vectors. Semantic search returns results with the highest similarity based on the calculated distance if no result is exactly matched. If you want to use traditional text search, OpenSearch performs text analysis on the search query and documents based on a specific analyzer, matches the analysis results, and then returns only the documents whose analysis results exactly match the analysis results of the search query.

The limits of the text search feature do not apply to the semantic search feature. The semantic search feature can return documents that are semantically similar to the search query.

Sparse vectors

A sparse vector has a large number of dimensions. Values in only a small number of dimensions are non-zero values. When you use sparse vectors for keyword search, each sparse vector represents a document, each dimension represents a word in the dictionary, and the value of the dimension indicates the importance of the word in the document. The keyword search algorithm calculates the relevance of a document based on metrics such as the number and frequency of the matched keywords in the document.

Representation of a sparse vector:

V=[0,0,0,0,2,0,4,0,0,0]

The sparse representation of the V vector is (10,[4,6],[2,4]).

10 represents the number of elements in the vector, [4,6] represents the subscripts of non-zero elements in the vector, and [2,4] represents the values of non-zero elements in the vector.

You can perform text vectorization by using a sparse vector model. The following sample code shows a sparse vector that is generated after specific text is vectorized:

{
  "indices": [0, 100, 40, 50, 20],
  "values": [0.5, 0.9, 0.3, 0.7, 0.6]
}

In OpenSearch Vector Search Edition, two independent multi-value fields are used to represent sparse vector subscripts and values. The number of values must be the same in the two fields. The positions of the values in the two fields have a one-to-one correspondence.

In the preceding example, indices represent the subscripts of non-zero elements in the array of the sparse vector.

values represent the values of non-zero elements in the array of the sparse vector.

When you write or query data, subscripts must be arranged in ascending order, and values need to be adjusted based on the subscripts. The following code shows the adjusted multi-value fields:

{
  "indices": [0, 20, 40, 50, 100],
  "values": [0.5, 0.6, 0.3, 0.7, 0.9]
}

Query weight

In a hybrid query, the final score of the same document is the sum of the distance of the dense vector and the distance of the sparse vector. If you want to configure different weights for the sparse vector and dense vector, you can perform configurations based on the following code:

{
    "vector": [v * weight for v in dense_vector],
    "sparseData": {
        "indices": sparse_data["indices"],
        "values": [v * (1 - weight) for v in sparse_data["values"]]
    }
}

Purchase an OpenSearch Vector Search Edition instance

For more information, see Purchase an OpenSearch Vector Search Edition instance.

Configure an instance

On the details page of the purchased instance, the instance is in the Pending Configuration state. The system automatically deploys an instance that contains no data. The number and specifications of Query Result Searcher (QRS) workers and Searcher workers are the same as those you purchase. Before you can use the instance for search, perform the following steps: configure a table, add a data source, configure fields, configure an index schema, and then perform reindexing for the instance.

1. Configure the basic information of a table

In the left-side pane on the instance details page, click Table Management. On the Table Management page, click Add Table. In the Basic Table Information step of the Create wizard, configure the Table Name, Data Shards, Number of Resources for Data Updates, and Scenario Template parameters. In this topic, the Scenario Template parameter is set to Vector: Semantic Search for Text, and Hybrid Search by Dense and Sparse Vectors is selected.

Parameters:

Table Name: the name of the table. You can enter a custom table name.
Data Shards: the number of data shards contained in the table. Enter a positive integer in the range of 1 to 256. You can perform sharding to accelerate the full indexing and improve the performance of a single query. If you create multiple index tables in an existing OpenSearch instance, make sure that the index tables contain the same number of shards. Alternatively, make sure that at least one index table contains one shard and other index tables contain the same number of shards.
Number of Resources for Data Updates: the number of resources that are used for data updates. By default, a free quota of two resources for data updates is provided for each data source. Each resource consists of 4 CPU cores and 8 GB of memory. You are charged for resources that exceed the free quota. For more information, see Billing overview of OpenSearch Vector Search Edition.

2. Add a data source

In the Data Synchronization step, add a data source. You can add an Object Storage Service (OSS) data source, a MaxCompute data source, a Data Lake Formation (DLF) data source, or an API data source. In this example, MaxCompute + API is selected for Full Data Source. Configure the Project Name, AccessKey, AccessKey Secret, Table name, Partition, Timestamp, and Automatic Reindexing parameters. Then, click Check to check the data source information. If the check is passed, click Next.

For more information about MaxCompute data sources, see Create a table for a MaxCompute data source.
For more information about API data sources, see Create a table for an API data source.
For more information about OSS data sources, see Create a table for an OSS data source.
For more information about DLF data sources, see Create a table for a DLF data source.

3. Configure fields

In the Field Configuration step, configure fields. If you select Hybrid Search by Dense and Sparse Vectors in the Basic Table Information step, you must define id as the primary key field, vector as the dense vector field, sparse_values as the sparse vector value field, and sparse_indices as the sparse vector subscript field.

Take note of the following information:

The primary key field and vector field are required. For the primary key field, you must set the Type parameter to INT or STRING and select the Primary Key column. For the vector field, you must set the Type parameter to FLOAT and select the Vector Field column.
The vector field is required. You can specify multiple vectors in the vector field. You must define the vector field as a multi-value field of the FLOAT type.
You must configure the sparse vector subscript and sparse vector value fields in pairs. The sparse vector subscript field is defined as a multi-value field of the INT32 type, and the sparse vector value field is defined as a multi-value field of the FLOAT type.
When you configure a vector index, you must specify the fields in the order of the primary key field, namespace field, and vector field. The namespace field is optional. The preceding figure provides an example.

4. Configure the index schema

In the Index Schema step, configure the index schema. You must set the Hybrid Search parameter to Enable and configure the primary key, vector, sparse vector subscript, and sparse vector value fields in the Fields Contained parameter.

Parameters:

Index Name: This parameter is required. You can enter a custom index name.
Hybrid Search: Set the value to Enable.
After you enable Hybrid Search, configure the sparse vector subscript and sparse vector value fields.
Vector Dimension: the dimensions of the dense vector.
By default, after you enable Hybrid Search, the Distance Type parameter is set to InnerProduct, and the Vector Index Algorithm parameter is set to HNSW. You cannot modify these parameters.
You must separately configure parameters for the advanced configurations of the vector index. For more information, see Common configurations of vector indexes.

Note

If you use the test data provided in this topic, set the Vector Dimension parameter to 1536.

5. Confirm the creation

In the Confirm step, click Confirm.

6. View the change history

In the left-side pane on the instance details page, click Change History. On the page that appears, you can view all finite-state machines (FSMs) related to the processes of creating a table, creating indexes, and performing reindexing for full data. After the search engine is built, you can perform query tests in the instance.

7. Perform query tests

{
    "tableName": "dense_sparse_tb",
    "indexName": "vector",
    "vector": [
        0.1,
        0.2,
        0.3,
        0.4,
        0.5
    ],
    "sparseData": {
        "indices": [
            0,
            2
        ],
        "values": [
            1.2,
            2.4
        ]
    },
    "topK": 2,
    "order": "DESC"
}

tableName: the name of the table.
indexName: the name of the index. In this example, the indexName parameter is set to vector.
vector: the dense vector to be queried.
sparseData: the sparse vector to be queried.
- indices: the subscripts of the elements in the sparse vector.
- values: the values of the elements in the sparse vector.
topK: the Top K results to be returned.
order: the order in which the results are sorted. A value of DESC specifies that the results are sorted in descending order.

Syntax

For more information, see Hybrid query.

Use SDKs to perform vector-based queries

Use an SDK to perform vector-based queries or primary key-based queries. For more information, see Query data.
Use an SDK to upload or delete documents. For more information, see Update data.