This topic describes how to select Object Storage Service (OSS) + API as the data source when you create a table.
The process consists of the following two steps:
Activate OSS
Add an OSS + API data source
Step 1: Activate OSS
1. Activate the service
Activate OSS. Make sure that OSS and your OpenSearch service are in the same region.
2. Create a bucket
Before you upload files to OSS, you must create a bucket to store the files. For more information, see Create buckets.
Path: OSS Management Console → Bucket List → Create Bucket
After you create a bucket, add an OpenSearch tag to the bucket.
Path: Bucket Management → Bucket Tags → Create Tag
3. Configure the OSS file format
A. OSS file format
To ensure that the transferred files can be used as a data source for index building, make sure that:
Files are encoded in UTF-8.
File data is converted to the HA3 format or JSON format.
I. JSON format configuration
Example of multiple records:
{"field_double": ["100.0", "221.123", "500.3333333"], "field_int32": ["100", "200", "300"], "title": "Huawei Mate 9 Kirin 960 chip Leica dual camera", "color": "Red", "empty_int32": "", "price": "3599", "CMD": "add", "nid": "1", "gather_cn_str": "", "desc": ["str1", "str2", "str3"], "brand": "Huawei", "size": "5.9","__subdocs__":[{"sub_pk":"100","sub_field1":"200","sub_field2":["100","200","300"]},{"sub_pk":"200","sub_field1":"200","sub_field2":["100","200","300"]}]}
{"field_double": ["100.0", "221.123", "500.3333333", "100.0", "221.123", "500.3333333"], "field_int32": ["100", "200", "300", "100", "200", "300"], "title": "Huawei/Huawei P10 Plus full network phone", "color": "Blue", "empty_int32": "", "price": "4388", "CMD": "add", "nid": "2", "gather_cn_str": "colorBlue", "desc": ["str1", "str2", "str3", "str1", "str2", "str3"], "brand": "Huawei", "size": "5.5","__subdocs__":[{"sub_pk":"100","sub_field1":"200","sub_field2":["100","200","300"]},{"sub_pk":"200","sub_field1":"200","sub_field2":["100","200","300"]}]}
Description:
All JSON fields are of the string type. When the engine builds an index, it converts the fields to the types specified in the schema.
'\n' represents a line feed. A single record cannot contain line feeds.
Ii. HA3 format configuration
First, look at the content of a complete data file named standard_sample.data.
CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^
CMD=delete^_
PK=12345321^_CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^
CMD=delete^_
PK=12345321^_
File separator definitions: The data file above contains two commands: add and delete. Each command consists of multiple lines, and each line is a key-value pair. Commands are separated by '^^\n', key-value pairs are separated by '^_\n', and multiple values are separated by '^]'. The following table describes the file separators.
C++ encoding | ASCII hexadecimal | Description | Display format in (emacs/vi) | Input method in emacs | Input method in vi |
C++ encoding | ASCII hexadecimal | Description | Display format in (emacs/vi) | Input method in emacs | Input method in vi |
"\x1F\n" | 1F0A | Key-value separator | ^_ (followed by a line feed) | C-q C-7 | C-v C-7 |
"\x1E\n" | 1E0A | Command separator | ^^ (followed by a line feed) | C-q C-6 | C-v C-6 |
"\x1D" | 1D | Multi-value separator | ^] | C-q C-5 | C-v C-5 |
"\x1C" | 1C | Section weight flag | ^\ | C-q C-4 | C-v C-4 |
"\x1D" | 1D | Section separator | ^] | C-q C-5 | C-v C-5 |
"\x03" | 03 | Sub-doc field separator | ^C | C-q C-c | C-v C-c |
Command format definitions
Add command format: The add command is used to add new content to the index. The first line of the add command must be CMD=add, followed by the fields of the document. The order of the fields can be the same as the order of fields in the schema. All fields that appear must be specified in the fields.
CMD=add^_ PK=12345321^_ url=http://www.aliyun.com/index.html^_ title=Alibaba Cloud Computing Co., Ltd.^_ body=xxxxxx xxx^_ time=3123423421^_ multi_value_field=1234^]324^]342^_ bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_ ^^CMD=add^_ PK=12345321^_ url=http://www.aliyun.com/index.html^_ title=Alibaba Cloud Computing Co., Ltd.^_ body=xxxxxx xxx^_ time=3123423421^_ multi_value_field=1234^]324^]342^_ bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_ ^^
Delete command format: The delete command is used to remove specified content from the index. The first line of the delete command must be CMD=delete, followed by the field defined as the primary key in the index schema and the field used for partition hash. If the two fields are the same, only one field needs to appear.
CMD=delete^_ PK=12345321^_ ^^CMD=delete^_ PK=12345321^_ ^^
B. Upload files to the bucket
Return to the control page of the newly created bucket. In the left navigation pane, choose Files → Files → Upload.
In the Upload section, select the method to scan files, select the content to upload, and then click Upload to complete the file upload.
Step 2: Add an OSS + API data source
Purchase an instance: For more information, see Purchase a Vector Search Edition instance.
1. Basic table information
Find the purchased Vector Search Edition instance in the Instance List . In the Actions column, click Manage. In the left navigation pane, choose Table Management → Add Table.
After you complete the basic table information, click Next.
Configuration description:
Table Name: You can customize the table name.
Number of Data Shards: Enter a positive integer not exceeding 256. This parameter is used to improve the full build speed and single query performance.
Number of Data Update Resources: The number of resources used for data updates. Each index provides two free 4-core 8 GB update resources by default. Resources beyond the free quota will incur fees. For more information, see Vector Search Edition international site billing documentation.
Scenario Template: Three templates are available: 1. General Template, 2. Vector: Image Search Template, and 3. Vector: Text Semantic Search Template.
2. Data synchronization
Select "Object Storage Service (OSS) + API" for Full Data Source and complete the other configurations. After the Data Source Verification is passed, click Next.
Configuration description:
OSS Path: The path to access OSS files. The path must start with / and cannot contain ?, =, or & characters. Files cannot be placed in the root directory. They must be placed in a folder, and then you need to fill in the path.
OSS Bucket: The name of the OSS bucket.
Data Format: The transferred files must be in HA3 or JSON format. Otherwise, the file data cannot be transferred successfully.
Data Source Verification: You can proceed to the next step after the verification is passed.
3. Field configuration
Vector Search Edition will preset relevant fields based on the scenario template you choose and automatically import fields from the full data source into the list below. After you complete the Field Configuration and Data Pre-processing Required - Configure settings, click Next.
Field meanings:
id (primary key)
source_image (source image)
namespace (namespace)
source_image_vector (source image vector)
Configuration description:
The primary key field and vector field are required. For the primary key field, you must set the Type parameter to an integer type or the STRING type and select the option button in the Primary Key column. For the vector field, you must set the Type parameter to FLOAT and select the check box in the Vector Field column.
The vector field is a multi-value FLOAT type by default. The multi-value separator uses a comma as the default delimiter, but you can also enter a custom multi-value separator.
When a field is missing or empty in the data, the system automatically supplements the default value. For numeric types, the default value is 0. For STRING types, the default value is an empty string. You can customize the default values.
Data Pre-processing Required - Configure: Click Configure to enter the data pre-processing configuration interface for the source_image field.
Data Source: There are two data type options: OSS Object Storage and Base64 Encoding. In this example, we select OSS Object Storage.
OSS Object Storage: You need to fill in the OSS path, which means storing images in a folder in OSS and importing them directly from OSS.
Base64 Encoding: This means that you need to encode the images first and then store them in a database or transmit them directly using the API method.
Pre-processing Template: Different templates will be displayed based on the type of data to be pre-processed (text or image). Since we are pre-processing image data in this example, the pre-processing templates displayed are (1. Image Vectorization, 2. OCR Image Text Recognition, 3. OCR Image Text Recognition + Image Vectorization). In this demonstration, we select the Image Vectorization pre-processing template.
Service List:
After selecting a pre-processing template, the service list under the template automatically appears, showing the types of models used in the template.
Available model sources:
Built-in Models: These models are fewer in type and quantity but can be called for free.
Custom Models: Users can customize models according to their needs by performing Add Model operations in Model List > Custom Models. For more information, see Custom models.
4. Index schema
Configure the index schema, and then click Next.
Configuration description:
Included Fields: The primary key field and vector field must be filled in. The namespace field is optional and can be left empty.
You can configure Vector Dimension, Real-time Indexing, Distance Type, and Vector Index Algorithm based on your business requirements.
More Advanced Configuration: You can use the default parameters directly or adjust them according to your business requirements. For more information, see General vector index configuration. After you complete the settings on the Index Schema page, click Next to go to the Confirm Creation page.
5. Confirm creation
Click Confirm Creation. After completion, you need to wait for 2 minutes. Return to the Instance List page. When the instance status is "Normal", you can proceed with subsequent searches and tests from the Actions - Query Test column.