Use a free open source model to convert data in a Tablestore table to vectors - Tablestore

This topic describes how to use a free open source model to generate vectors from text data stored in Tablestore.

Overview

ModelScope is a next-generation open source model-as-a-service sharing platform that provides AI developers with flexible and easy-to-use one-stop model services at low costs to simplify the application of models. ModelScope brings together industry-leading pre-trained models to prevent repeated research and development. This helps reduce costs for developers. This way, ModelScope provides a greener, open source AI development environment and model services.

To generate vectors from text data stored in Tablestore by using a free open source model, perform the following steps:

Install Tablestore SDK for Python and ModelScope dependencies: Before you use an open source model to generate vectors and use Tablestore features, you must install Tablestore SDK for Python and ModelScope dependencies.
Select and download an open source model: ModelScope provides a large number of embedding models that you can use to convert text data to vectors. You can select and download a model from the model library.
Generate vectors and write the vectors to Tablestore: Use the downloaded open source model to generate vectors and write the vector data to a data table in Tablestore.
Verify results: Use the data read operations of Tablestore or the KNN vector query feature of search indexes to query vector data.

Additional information

Programming language: Python
Recommended Python version: Python V3.9 or later
Test environment: The examples in this topic are tested in CentOS 7 and macOS.

Usage notes

The number of dimensions, the data type, and the distance measurement algorithm that you specify for vectors in a Tablestore search index must be the same as those for vectors that are generated from text data by using the open source model. For example, the number of dimensions is 256, the data type is Float32, and the distance measurement algorithm is Euclidean distance algorithm for vectors that are generated by using the damo/nlp_corom_sentence-embedding_chinese-tiny open source model. When you create a Tablestore search index, you must specify 256 as the number of dimensions, Float32 as the data type, and the Euclidean distance algorithm as the distance measurement algorithm for the vectors in the search index.

Prerequisites

An Alibaba Cloud account or a RAM user that has the permissions to manage Tablestore is created.
To use a RAM user to perform operations in this topic, you must use your Alibaba Cloud account to create a RAM user and then attach the AliyunOTSFullAccess policy to the RAM user. This way, the RAM user is granted the permissions to manage Tablestore. For more information, see Grant permissions to a RAM user.
An AccessKey pair is created for your Alibaba Cloud account or RAM user. For more information, see Create an AccessKey pair.
The name and endpoint of a Tablestore instance are obtained. For more information, see Query endpoints.
The AccessKey pair of your Alibaba Cloud account or RAM user and the name and endpoint of the Tablestore instance are configured in the environment variables.

1. Install Tablestore SDK for Python and ModelScope dependencies

Run the following commands to install Tablestore SDK for Python and ModelScope dependencies:

# Install Tablestore SDK for Python. 
pip install tablestore
# Install ModelScope dependencies. 
pip install "modelscope[framework]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
pip install --use-pep517 "modelscope[nlp]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
pip install torch torchvision torchaudio

2. Select and download an open source model

2.1 Select an open source model

ModelScope provides a large number of embedding models that you can use to convert text data to vectors. You can select a model from the model library.

The following table describes the models that are frequently used. Select a model based on your business requirements.

Note

If the vectors that are generated by using an open source model of ModelScope are not normalized, you can use the Euclidean distance algorithm as the distance measurement algorithm for the vectors in Tablestore.

Model ID	Applicable field	Vector dimensions	Recommended distance measurement algorithm
damo/nlp_corom_sentence-embedding_chinese-base	Chinese-common-base	768	Euclidean distance
damo/nlp_corom_sentence-embedding_english-base	English-common-base	768	Euclidean distance
damo/nlp_corom_sentence-embedding_chinese-base-ecom	Chinese-e-commerce-base	768	Euclidean distance
damo/nlp_corom_sentence-embedding_chinese-base-medical	Chinese-healthcare-base	768	Euclidean distance
damo/nlp_corom_sentence-embedding_chinese-tiny	Chinese-common-tiny	256	Euclidean distance
damo/nlp_corom_sentence-embedding_english-tiny	English-common-tiny	256	Euclidean distance
damo/nlp_corom_sentence-embedding_chinese-tiny-ecom	Chinese-e-commerce-tiny	256	Euclidean distance
damo/nlp_corom_sentence-embedding_chinese-tiny-medical	Chinese-healthcare-tiny	256	Euclidean distance

2.2 Download the open source model

After you determine the model that you want to use, run the modelscope download --mode {ModelID} command in the command line to download the model. In the command, replace {ModelID} with the ID of the model that you want to use. In this example, the damo/nlp_corom_sentence-embedding_chinese-tiny model is downloaded. Sample command:

modelscope download --mode damo/nlp_corom_sentence-embedding_chinese-tiny

3. Generate vectors and write the vectors to Tablestore

Use an open source model to convert the data from the client that is not written to Tablestore to vectors or convert the existing data in Tablestore to vectors. Then, write the vectors to Tablestore.

The following sample code provides an example on how to use Tablestore SDK for Python to create a data table and search index in Tablestore, use an open source model to generate vectors that have 256 dimensions, and write the vectors to Tablestore:

import json
import os

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from tablestore import OTSClient, TableMeta, TableOptions, ReservedThroughput, CapacityUnit, FieldSchema, FieldType, VectorDataType, VectorOptions, VectorMetricType, \
    SearchIndexMeta, AnalyzerType, Row, INF_MIN, INF_MAX, Direction, OTSClientError, OTSServiceError, Condition, RowExistenceExpectation

# Select an appropriate model and enter the model ID.
pipeline_se = pipeline(Tasks.sentence_embedding, model='damo/nlp_corom_sentence-embedding_chinese-tiny')


def text_to_vector_string(text: str) -> str:
    inputs = {'source_sentence': [text]}
    result = pipeline_se(input=inputs)
    # Convert the returned results to the format supported by Tablestore. The format is an array that consists of float32 strings. Example: [1, 5.1, 4.7, 0.08].
    return json.dumps(result["text_embedding"].tolist()[0])


def create_table():
    table_meta = TableMeta(table_name, [('PK_1', 'STRING')])
    table_options = TableOptions()
    reserved_throughput = ReservedThroughput(CapacityUnit(0, 0))
    tablestore_client.create_table(table_meta, table_options, reserved_throughput)


def create_search_index():
    index_meta = SearchIndexMeta([
        # Support match query for data of the Keyword type.
        FieldSchema('field_string', FieldType.KEYWORD, index=True, enable_sort_and_agg=True),
        # Support range query for data of the Long type.
        FieldSchema('field_long', FieldType.LONG, index=True, enable_sort_and_agg=True),
        # Perform full-text search based on a field.
        FieldSchema('field_text', FieldType.TEXT, index=True, analyzer=AnalyzerType.MAXWORD),
        # Perform a KNN vector query for a field. In this example, the Euclidean distance algorithm is used as the distance measurement algorithm and the number of dimensions for vectors is 256.
        FieldSchema("field_vector", FieldType.VECTOR,
                    vector_options=VectorOptions(
                        data_type=VectorDataType.VD_FLOAT_32,
                        dimension=256,
                        metric_type=VectorMetricType.VM_EUCLIDEAN
                    )),
    ])
    tablestore_client.create_search_index(table_name, index_name, index_meta)


def write_data_to_table():
    for i in range(100):
        pk = [('PK_1', str(i))]
        text = "a string that can be used in full-text search. Use an embedding model to convert the values of the field to vectors and write the vectors to the field_vector field for KNN vector query.
        vector = text_to_vector_string(text)
        columns = [
            ('field_string', 'str-%d' % (i % 5)),
            ('field_long', i),
            ('field_text', text),
            ('field_vector', vector),
        ]
        tablestore_client.put_row(table_name, Row(pk, columns))


def get_range_and_update_vector():
    # Specify the start primary key for range query. INF_MIN is a special flag that specifies the minimum value.
    inclusive_start_primary_key = [('PK_1', INF_MIN)]
    # Specify the end primary key for range query. INF_MAX is a special flag that specifies the maximum value.
    exclusive_end_primary_key = [('PK_1', INF_MAX)]
    total = 0
    try:
        while True:
            consumed, next_start_primary_key, row_list, next_token = tablestore_client.get_range(
                table_name,
                Direction.FORWARD,
                inclusive_start_primary_key,
                exclusive_end_primary_key,
                ["field_text", "Other fields that you want to return"],
                5000,
                max_version=1,
            )
            for row in row_list:
                total += 1
                # Obtain the value of the field_text field that is read.
                text_field_content = row.attribute_columns[0][1]
                # Regenerate vectors based on the value of the field_text field.
                vector = text_to_vector_string(text_field_content)
                update_of_attribute_columns = {
                    'PUT': [('field_vector', vector)],
                }
                update_row = Row(row.primary_key, update_of_attribute_columns)
                condition = Condition(RowExistenceExpectation.IGNORE)
                # Update the row of data.
                tablestore_client.update_row(table_name, update_row, condition)
            if next_start_primary_key is not None:
                inclusive_start_primary_key = next_start_primary_key
            else:
                break
    # In most cases, client exceptions are caused by parameter errors or network exceptions. 
    except OTSClientError as e:
        print('get row failed, http_status:%d, error_message:%s' % (e.get_http_status(), e.get_error_message()))
    # In most cases, server exceptions are caused by parameter or throttling errors. 
    except OTSServiceError as e:
        print('get row failed, http_status:%d, error_code:%s, error_message:%s, request_id:%s' % (e.get_http_status(), e.get_error_code(), e.get_error_message(), e.get_request_id()))
    print("Processed data in total:", total)


if __name__ == '__main__':
    # Initialize a Tablestore client.
    end_point = os.environ.get('end_point')
    access_id = os.environ.get('access_key_id')
    access_key_secret = os.environ.get('access_key_secret')
    instance_name = os.environ.get('instance_name')
    tablestore_client = OTSClient(end_point, access_id, access_key_secret, instance_name)

    table_name = "python_demo_table_name"
    index_name = "python_demo_index_name"

    # Create a table.
    create_table()
    # Create an index.
    create_search_index()
    # Method 1: Convert the data from the client that is not written to Tablestore to vectors and write the vectors to Tablestore
    write_data_to_table()
    # Method 2: Convert the existing data in Tablestore to vectors and write the vectors to Tablestore
    get_range_and_update_vector()

4. Verify results

View the vectors that are written to Tablestore in the Tablestore console. You can call the GetRow, BatchGetRow, or GetRange operation or use the KNN vector query feature of search indexes to query vector data.

Use data read operations to query vector data
After vector data is written to a data table in Tablestore, you can use data read operations to read the data. For more information, see Read data.
Use the KNN vector query feature to query vector data
If you configure vector fields when you create a search index, you can use the KNN vector query feature to query vector data. For more information, see KNN vector query.

Billing

When you use Tablestore, the data volumes of data tables and search indexes occupy storage space. When you read or write data in data tables or use the KNN vector query feature of search indexes to query vector data, computing resources are consumed. you are charged for the consumed computing resources based on the read throughput and write throughput.