Build an end-to-end image search service by using OpenSearch Vector Search Edition - OpenSearch

This topic describes how to build an image search service by using OpenSearch Vector Search Edition if no vector data is available.

To implement image search capabilities such as image search based on specified images or text, you can import the image source data to OpenSearch and perform operations such as image vectorization and vector search in OpenSearch.

Architecture

image_409b3ae0cc76

You can use one of the following three service portfolios to upload images and build an image search engine:

OSS + MaxCompute + OpenSearch Vector Search Edition: You can upload images to an Object Storage Service (OSS) bucket and store the business table data and the image URL that corresponds to each data entry in MaxCompute. The image URL refers to the URL of each image in the OSS bucket. Example: /image/1.jpg.
MaxCompute + OpenSearch Vector Search Edition: You can store Base64-encoded images and the corresponding table data in MaxCompute.
API + OpenSearch Vector Search Edition: You can call an operation provided by OpenSearch Vector Search Edition to push Base64-encoded images and the corresponding table data to an OpenSearch Vector Search Edition instance.

In this example, the first service portfolio is used to build an image search engine.

Preparations

1. Create an AccessKey pair

When you create an Alibaba Cloud account and log on to the console for the first time, the system prompts you to create an AccessKey pair before you perform subsequent operations.

You must create an AccessKey pair for your Alibaba Cloud account because the AccessKey pair is required when you create and use an OpenSearch instance.
After you create an AccessKey pair for your Alibaba Cloud account, you can create an AccessKey pair for a Resource Access Management (RAM) user. This way, you can access the OpenSearch instance as the RAM user. For more information about how to grant permissions to RAM users, see Create and authorize RAM users.

2. Create an OSS bucket

image_409ce896ccx9

Create the opensearch tag for the OSS bucket. Both the key and value of the tag are opensearch.

In this example, 1,000 images are uploaded to the OSS bucket.

The following figure shows some of the uploaded images.

image_409d84d0cc9c

Purchase an OpenSearch Vector Search Edition instance

For more information, see Purchase an OpenSearch Vector Search Edition instance.

Configure the instance

The purchased instance is in the Pending Configuration state. Before you can perform search operations, you must configure a table for the instance.

In the Basic Table Information step, configure the following parameters and click Next.

Note:

Table Name: the name of the table. You can customize the table name.
Data Shards: the number of data shards contained in the table. Enter a positive integer in the range of 1 to 256. You can perform sharding to accelerate the full indexing and improve the performance of a single query. If you create multiple index tables in an existing OpenSearch instance, make sure that the index tables contain the same number of shards. Alternatively, make sure that at least one index table contains one shard and other index tables contain the same number of shards.
Number of Resources for Data Updates: the number of resources used for data updates. By default, OpenSearch provides a free quota of two resources for data updates for each data source in an OpenSearch Vector Search Edition instance. Each resource consists of 4 CPU cores and 8 GB of memory. You are charged for resources that exceed the free quota. For more information, see Billing overview of OpenSearch Vector Search Edition for the international site (alibabacloud.com).
Scenario Template: the template that is used to create the table. In this example, Vector: Image Search is selected.
Data Processing: In this example, Convert Raw Data to Vector Data is selected.

In the Data Synchronization step, configure the following parameters to add a data source, and click Check to check the data source information. If the data source information passes the check, click Next.

Parameters:

Full Data Source: the type of the data source. Valid values: MaxCompute + API, OSS + API, and API. In this example, MaxCompute + API is selected.
Project: the name of the MaxCompute project that you want to access.
AccessKey: the AccessKey ID of your Alibaba Cloud account or a RAM user within the Alibaba Cloud account.
AccessKey Secret: the AccessKey secret that corresponds to the AccessKey ID.
Table: the name of the MaxCompute table that you want to access.
Partition Key: the partition key of the MaxCompute data source. This parameter is required. Example: ds=20170626.
Timestamp: If incremental data is pushed by using API operations, this parameter specifies the point in time from which the system synchronizes incremental data from the data source. You can select a date and time within the previous 72 hours.
Automatic Reindexing: specifies whether to enable the automatic reindexing feature. If the automatic reindexing feature is enabled, the system automatically performs reindexing for the index table that references the data source each time the system detects a data change in the data source.

Note

If you enable the automatic reindexing feature, you must create a done table. For more information, see the Configure automatic reindexing section of the "MaxCompute data source" topic.

In the Field Configuration step, configure fields for the table and click Next.

You must configure at least a primary key field, a vector field, and a field that requires embedding.

Images stored in an OSS bucket

If you select the Vector: Image Search template, OpenSearch automatically generates the following preset fields: id, cate_id, vector, and vector_source_image. The id field specifies the primary key. The cate_id field specifies the category ID. The vector field specifies the vector. The vector_source_image field specifies the URL of the stored image. After you configure a MaxCompute data source, the fields that are synchronized from the data source are displayed below the preset fields.

3.1. Configure the vector_source_image field. You must set the Type parameter to STRING and select the check box in the Require Embedding column for the field.

You can modify the name of the preset field based on the corresponding business table field. Make sure that the advanced configurations of the field are correct.

To configure parameters for the advanced configurations of the vector_source_image field, click Edit in the Advanced Configurations column. In the panel that appears, configure the following parameters.

Parameters:

Vectorization Model
- clip: converts a regular image to a vector.
- clip_ecom: converts an e-commerce image to a vector.
Data Type: In this example, image(path) is selected.
Source Content Type: In this example, oss is selected.
OSS Bucket: the name of the OSS bucket.

Note: If the source data comes from an OSS bucket, a service-linked role named AliyunServiceRoleForSearchEngine is required.

Important

For example, the OSS URL of an image is /test image/lake.jpg. The value of the vector_source_image field must be /test image/lake.jpg.

3.2. Configure the vector field. You must set the Type parameter to FLOAT and select the Vector Field column for the field.

Base64-encoded images

3.1. Configure the vector_source_image field. You must set the Type parameter to STRING and select the check box in the Require Embedding column for the field.

You can modify the name of the preset field based on the corresponding business table field. Make sure that the advanced configurations of the field are correct.

Vectorization Model: You can select a model to convert a regular image or an e-commerce image to a vector. Valid values:
- clip: converts a regular image to a vector.
- clip_ecom: converts an e-commerce image to a vector.

Data Type: In this example, Image (Base64-encoded) is selected.

3.2. Configure the vector field. You must set the Type parameter to FLOAT and select the Vector Field column for the field.

In the Index Schema step, configure the index schema and click Next.

Note:

The name of the vector index is the same as the name of the vector field.
The Fields Contained section contains the following fields: Primary Key Field, Vector Field, and Namespace. The Namespace field is optional.
Configure parameters for the advanced configurations based on your business requirements.

Note

The default dimension of the vector that is generated from an image is 512 and cannot be modified.

After you complete the configurations, click Confirm.

Click the Change History in the left-side pane on the instance details page to view the creation progress of the table.

After the full data is synchronized, you can perform search tests.

Perform query tests

Perform query tests in the OpenSearch console

For more information about how to query data in the OpenSearch console, see Query tests.

Use an SDK to perform query tests

Sample query request: For more information, see the Prediction query section of the "Query data" topic.

Sample results:

{
    "totalCount": 5,
    "result": [
        {
            "id": 5,
            "score": 1.103209137916565
        },
        {
            "id": 3,
            "score": 1.1278988122940064
        },
        {
            "id": 2,
            "score": 1.1326735019683838
        }
    ],
    "totalTime": 242.615
}

result: the returned results.

Syntax

For more information about the syntax for prediction-based queries, see the Image vectorization-based query section of the "Prediction-based query" topic.
For more information about the syntax for primary key-based queries, see Primary key-based query
For more information about the syntax for filter expressions, see Filter expression

Use an SDK to perform vector-based queries

Use an SDK to perform vector-based queries or primary key-based queries. For more information, see Query data.
Use an SDK to add or delete documents. For more information, see Update data.