All Products
Search
Document Center

OpenSearch:Build an end-to-end image search solution

Last Updated:Aug 26, 2024

This topic describes how to build an image search service by using OpenSearch Vector Search Edition if no vector data is available.

To implement image search capabilities such as image search based on specified images or text, you can import the image source data to OpenSearch and perform several operations such as image vectorization and vector search in OpenSearch.

Architecture

image_409b3ae0cc76

You can use one of the following three service portfolios to upload images and build an image search engine:

  • OSS + MaxCompute + OpenSearch Vector Search Edition: You can upload images to an Object Storage Service (OSS) bucket and store the business table data and the image URL that corresponds to each data entry in MaxCompute. The image URL refers to the URL of each image in the OSS bucket. Example: /image/1.jpg.

  • MaxCompute + OpenSearch Vector Search Edition instance: You can store Base64-encoded images and the corresponding table data in MaxCompute.

  • API + OpenSearch Vector Search Edition: You can call an operation provided by OpenSearch Vector Search Edition to push Base64-encoded images and the corresponding table data to an OpenSearch Vector Search Edition instance.

In this example, the first service portfolio is used to build an image search engine.

Preparations

1. Create an AccessKey pair

When you create an Alibaba Cloud account and log on to the console for the first time, the system prompts you to create an AccessKey pair before you perform subsequent operations.

  • You must create an AccessKey pair for your Alibaba Cloud account because the AccessKey pair is required when you create and use an OpenSearch instance.

  • After you create an AccessKey pair for your Alibaba Cloud account, you can create an AccessKey pair for a Resource Access Management (RAM) user. This way, you can access the OpenSearch instance as the RAM user. For more information about how to grant permissions to RAM users, see Create and authorize RAM users.

2. Create an OSS bucket

image_409ce896ccx9

  1. Get started by using the OSS console.

  2. Get started by using the OSS console.

  3. Get started by using the OSS console.

  1. Create the opensearch tag for the OSS bucket. Both the key and value of the tag are opensearch.

image

In this example, 1,000 images are uploaded to the OSS bucket.

image_409d36b4ccr4

The following figure shows some of the uploaded images.

image_409d84d0cc9c

Purchase an OpenSearch Vector Search Edition instance

For more information, see Purchase an OpenSearch Vector Search Edition instance.

Configure an instance

The purchased instance is in the Pending Configuration state. Before you can perform search operations, you must configure a table for the instance.

image.png

  1. In the Basic Table Information step, configure the following parameters and click Next.

image.png

Parameters:

  • Table Name: the name of the table. You can customize the table name.

  • Data Shards: the number of data shards contained in the table. If you create multiple index tables in an OpenSearch instance, make sure that the index tables contain the same number of shards. Alternatively, make sure that at least one index table contains one shard and other index tables contain the same number of shards.

  • Number of Resources for Data Updates: the number of resources used for data updates. By default, a free quota of two resources for data updates is provided for each index. Each resource consists of 4 CPU cores and 8 GB of memory. You are charged for resources that exceed the free quota. For more information, see Billing of OpenSearch Vector Search Edition for the international site (alibabacloud.com).

  • Scenario Template: the template that is used to create the table. In this example, Vector: Image Search is selected.

  • Data Processing: In this example, Convert Raw Data to Vector Data is selected.

  1. In the Data Synchronization step, configure the following parameters to add a data source, and click Check to check the data source information. If the check is passed, click Next.

image.png

Parameters:

  • Full Data Source: the type of the data source. Valid values: MaxCompute + API, OSS + API, and API. In this example, MaxCompute + API is selected.

  • Project: the name of the MaxCompute project that you want to access.

  • AccessKey: the AccessKey ID of your Alibaba Cloud account or a RAM user within the Alibaba Cloud account.

  • AccessKey Secret: the AccessKey secret that corresponds to the AccessKey ID.

  • Table: the name of the MaxCompute table that you want to access.

  • Partition Key: the partition key of the MaxCompute data source. This parameter is required. Example: ds=20170626.

  • Timestamp: If incremental data is pushed by using API operations, this parameter specifies the point in time from which the system synchronizes incremental data from the data source. You can select a date and time within the previous 72 hours.

  • Automatic Reindexing: specifies whether to enable the automatic reindexing feature. If the automatic reindexing feature is enabled, the system automatically performs reindexing for the index table that references the data source each time the system detects a data change in the data source.

Note
  • If you enable the automatic reindexing feature, you must create a done table. For more information, see the Configure automatic reindexing section of the "MaxCompute data source" topic.

  1. In the Field Configuration step, configure fields for the table and click Next.

image.png

You must configure at least a primary key field, a vector field, and a field that requires embedding.

Images stored in an OSS bucket

If you select the Vector: Image Search template, OpenSearch automatically generates the following preset fields: id, cate_id, vector, and vector_source_image. The id field specifies the primary key. The cate_id field specifies the category ID. The vector field specifies the vector. The vector_source_image field specifies the URL of the stored image. After you configure a MaxCompute data source, the fields that are synchronized from the data source are displayed below the preset fields.

3.1. Configure the vector_source_image field. You must set the Type parameter to STRING and select Require Embedding for the field.

You can modify the name of the preset field based on the corresponding business table field. Make sure that the advanced configurations of the field are correct.

image.png

To configure parameters for the advanced configurations of the vector_source_image field, click Edit in the Advanced Configurations column. In the panel that appears, configure the following parameters.

image

Parameters:

  • Vectorization Model

    • clip: converts a regular image to a vector.

    • clip_ecom: converts an e-commerce image to a vector.

  • Data Type: In this example, image(path) is selected.

  • Source Content Type: In this example, oss is selected.

  • OSS Bucket: the name of the OSS bucket.

Note: If the source data comes from an OSS bucket, a service-linked role named AliyunServiceRoleForSearchEngine is required.

Important

For example, the OSS URL of an image is /test image/lake.jpg. The value of the vector_source_image field must be /test image/lake.jpg. The following figures show this example.

OSS URL

image.png

Field value in MaxCompute

image.png

3.2. Configure the vector field. You must set the Type parameter to FLOAT and select Vector Field for the field.

image.png

Base64-encoded images

If you select the Vector: Image Search template, OpenSearch automatically generates the following preset fields: id, cate_id, vector, and vector_source_image. The id field specifies the primary key. The cate_id field specifies the category ID. The vector field specifies the vector. The vector_source_image field specifies the URL of the stored image. After you configure a MaxCompute data source, the fields that are synchronized from the data source are displayed below the preset fields.

3.1. Configure the vector_source_image field. You must set the Type parameter to STRING and select Require Embedding for the field.

You can modify the name of the preset field based on the corresponding business table field. Make sure that the advanced configurations of the field are correct.

image.png

To configure parameters for the advanced configurations of the vector_source_image field, click Edit in the Advanced Configurations column. In the panel that appears, configure the following parameters.

image.png

  • Vectorization Model: You can select a model to convert a regular image or an e-commerce image to a vector. Valid values:

    • clip: converts a regular image to a vector.

    • clip_ecom: converts an e-commerce image to a vector.

  • Data Type: In this example, Image (Base64-encoded) is selected.

3.2. Configure the vector field. You must set the Type parameter to FLOAT and select Vector Field for the field.

image.png

  1. In the Index Schema step, configure the index schema and click Next.

image.png

Note:

  • The name of the vector index is the same as the name of the vector field.

  • The Fields Contained section contains the following fields: Primary Key Field, Vector Field, and Namespace. The Namespace field is optional.

  • Configure parameters for the advanced configurations based on your business requirements.

Note

The default dimension of the vector that is generated from an image is 512 and cannot be modified.

  1. After you complete the configurations, click Confirm.

image.png

  1. To view the creation progress of the table, click Change History in the left-side pane on the instance details page, and then click the Data Source Changes tab.

image.png

After the full data is synchronized, you can perform search tests.

Perform search tests

Use the OpenSearch console

For more information, see the Query tests .

SDK reference

Sample search: For more information, see the Prediction query section of the "Query data" topic.

Sample results:

{
    "totalCount": 5,
    "result": [
        {
            "id": 5,
            "score": 1.103209137916565
        },
        {
            "id": 3,
            "score": 1.1278988122940064
        },
        {
            "id": 2,
            "score": 1.1326735019683838
        }
    ],
    "totalTime": 242.615
}

result: the returned results.

Syntax

Use an SDK to perform vector-based queries

  • Use an SDK to perform vector-based queries or primary key-based queries. For more information, see Query data.

  • Use an SDK to upload or delete documents. For more information, see Update data.