Create a search index - Tablestore - Alibaba Cloud Documentation Center

You can create one or more search indexes for a table in the Tablestore console or by using the Tablestore CLI or Tablestore SDKs. Then, you can use the search indexes to query and analyze data in the table. This topic describes how to create a search index.

Optimal method to create search indexes

We recommend that you determine the number of search indexes that you want to create for a data table based on your query requirements.

If you have a data table that contains the id, name, age, city, and sex fields, you can create search indexes to query data by name, age, or city by using one of the following methods:

Method 1: Create a search index for each field
If you use this method, you must create the following search indexes: name_index, age_index, and city_index.
- To query students by city, use the city_index search index. To query students by age, use the age_index search index.
- However, if you want to query students who are younger than 12 years old and live in City A, this method does not work.
The implementation of this method is similar to that of the global secondary index feature. However, this method is not cost-effective. We recommend that you use Method 2 in this case.
Method 2 (recommended): Create one search index for multiple fields
In this method, a search index named student_index is created. This search index consists of the following fields: name, age, and city.
- To query students by city, query the city field in the student_index search index. To query students by age, query the age field in the student_index search index.
- To query students who are younger than 12 years old and live in City A, query the age and city fields in the student_index search index.
This method makes full use of the advantages of search indexes and is more cost-effective. We recommend that you use this method if you want to create a search index.

Limits

Timeliness of search index creation
After you create a search index, you must wait for a few seconds before you can use the search index. During this period, you can write data to the data table, but you cannot query the metadata of the index or query data by using the index.
When you create a search index, make sure that the types of fields in the search index match the types of fields in the data table for which the search index is created. For more information, see Mappings of basic data types.
Number of search indexes
For more information, see Search index limits.

Parameters

When you create a search index, you must configure the tableName, indexName, and indexSchema parameters. You must also configure the fieldSchemas, indexSetting, and indexSort parameters in the indexSchema parameter. The following table describes the parameters.

Parameter	Description
tableName	The name of the data table.
indexName	The name of the search index.
fieldSchemas	The list of field schemas. In each field schema, configure the following parameters: fieldName (required): specifies the name of the field in the search index. The value is used as a column name. Type: String. A field in a search index can be a primary key column or an attribute column of the data table for which the search index is created. fieldType (required): the type of the field. Specify the type in the FieldType.XXX format. For more information, see Mappings of basic data types. isAnArray (optional): specifies whether the value is an array. Type: Boolean. If you set this parameter to true, the field stores data as an array. Data written to the field must be a JSON array. Example: ["a","b","c"]. The values of fields of the Nested type are arrays. If you set the fieldType parameter to Nested, skip this parameter. analyzer (optional): the type of analyzer that you want to use. If you set the fieldType parameter to Text, you can configure this parameter. If you do not configure this parameter, the default analyzer type single-word tokenization is used. For more information, see Tokenization. isVirtualField (optional): specifies whether the field is a virtual column. Type: Boolean. Default value: false. The default value specifies that the field is not a virtual column. If you set this parameter to true, you can use a virtual column. For more information, see Virtual columns. sourceFieldName (optional): the name of the source field to which the virtual column is mapped in the data table. Type: String. If you set the isVirtualField parameter to true, you must configure this parameter. dateFormats (optional): the format of dates. Type: String. If you set the fieldType parameter to Date, you must configure this parameter. For more information, see Date data type. enableHighlighting (optional): specifies whether to enable the highlight feature. Type: Boolean. Default value: false. The default value specifies that the highlight feature is disabled. If you set this parameter to true, you can use the highlight feature. Only Text fields support the highlight feature. For more information, see Highlight. Important You can configure this parameter only by using Tablestore SDKs. enableSortAndAgg (optional): specifies whether to enable sorting and aggregation. Type: Boolean. Default value: true. The default value specifies that sorting and aggregation are enabled. Sorting can be enabled only for fields for which the enableSortAndAgg parameter is set to true. For more information, see Perform sorting and paging. Important You can configure this parameter only by using Tablestore SDKs. Sorting and aggregation are not supported for Text fields. If you want to perform sorting or aggregation on a field of the Text type, you can use the virtual column feature and set the type of the virtual column to which the Text field is mapped to Keyword. For more information, see Virtual columns. vectorOptions (optional): the properties of the Vector field. If you set the fieldType parameter to Vector, you must configure this parameter. You can use the following parameters to specify the properties of the Vector field: dataType: the type of vector data. Only float32 is supported. If you want to use other types of vector data, submit a ticket. dimension: the number of dimensions of the vector. For information about the limits on the number of dimensions of a vector, see Search index limits. metricType: the algorithm that you want to use to measure the distance between vectors. Valid values: euclidean, cosine, and dot_product. euclidean: the Euclidean distance algorithm that measures the shortest path between two vectors in a multi-dimensional space. The Euclidean distance algorithm in Tablestore does not perform the final square root calculation. This is done to improve performance. A greater value that is obtained by using the Euclidean distance algorithm indicates a higher similarity between two vectors. cosine: the cosine similarity algorithm that calculates the cosine of the angle between two vectors in a vector space. A greater value that is obtained by using the cosine similarity algorithm indicates a higher similarity between two vectors. In most cases, this algorithm is used to calculate the similarity between text data. dot_product: the dot product algorithm that multiplies the corresponding coordinates of two vectors of the same dimension and adds the products. A greater value that is obtained by using the dot product algorithm indicates a higher similarity between two vectors. For more information, see Appendix: distance measurement algorithms for vectors.
indexSetting	The settings of the search index, including the settings of the routingFields parameter. routingFields (optional): the custom routing fields. You can specify specific primary key columns as routing fields. Tablestore distributes data that is written to a search index across different partitions based on the specified routing fields. Data with the same routing field values is distributed to the same partition.
indexSort	The presorting settings of the search index, including the settings of the sorters parameter. If you do not configure the indexSort parameter, field values are sorted by primary key. Note If you set the fieldType parameter to Nested, you cannot configure the indexSort parameter. sorters (optional): the presorting method for the search index. Valid values: PrimaryKeySort and FieldSort. For more information, see Perform sorting and paging.
timeToLive	Optional. The retention period of data in the search index. Unit: seconds. The default value is -1, which specifies that the data never expires. The time to live (TTL) must be at least 86,400 seconds (one day) or -1. A value of -1 specifies that the data never expires. If the retention period of data exceeds the value of the timeToLive parameter, the data expires. Tablestore automatically deletes the expired data. For more information, see Configure the TTL of a search index.

Usage notes

To perform full-text search, you can set the type of the field in a search index to Text and specify the type of the analyzer for the Text field. Then, you can use match query or match phrase query to query data. For information about tokenization, see Tokenization.
You can map fields in data tables to virtual columns in search indexes to query new fields and the data of new field types without the need to modify the storage schema and the data in the tables. For more information, see Virtual columns.
You can map String or Integer fields in data tables to Date fields in search indexes. This way, you can query data of the Date type. For more information, see Date data type.
If you want to perform an approximate nearest neighbor search based on vectors, you can set the type of the field to Vector in a search index and use the k-nearest neighbor (KNN) vector query feature to query data. For more information, see Overview.
If you want to store data as arrays in a search index, you can store the data in String fields in the data table and set the type of the mapped fields in the search index to Array.
When you write data to the data table, the values of the fields to which the Array fields in the search index are mapped must be JSON arrays. Example: ["a","b","c"].
Note
Search indexes support Array fields. However, data tables do not support Array fields. For more information, see Array and Nested data types.
If you want to use a field to store data that has a hierarchical relationship or one-to-many relationship, you can set the type of the field in the search index to Nested, and use nested query to query data.
When you write data to the data table, the values of the fields to which the Nested fields in the search index are mapped must be JSON arrays. Example: [{"tagName":"tag1", "score":0.8}, {"tagName":"tag2", "score":0.2}].
If the data that you want to store in the data table is geographical location information, you can set the type of the field in the search index to Geo-point, and use geo-distance query, geo-bounding box query, or geo-polygon query to query data.
By default, the system uses the partition key as the routing key. You can specify custom routing fields to use other primary key columns as the routing key. This reduces the query latency because the partition range scanned by Tablestore is narrowed. For more information, see How do I use routing fields?
By default, if you use a search index to query data, the query results are returned and sorted in ascending order of the primary key. You can specify a presorting method to sort the query results based on the values of specific fields or in descending order of the primary key. For more information, see Index presorting.
You can use the TTL feature to automatically delete historical data in a search index. This reduces storage usage and storage costs. For more information, see Specify the TTL of a search index.
If you want to highlight the matched query strings in the query results, you can enable the highlight feature. For more information, see Highlight.

Methods

You can create a search index in the Tablestore console or by using the Tablestore CLI or Tablestore SDKs.

Before you create a search index, make sure that the following preparations are complete:

An Alibaba Cloud account or a RAM user that has Tablestore operation permissions is created. For information about how to grant Tablestore operation permissions to a RAM user, see Use a RAM policy to grant permissions to a RAM user.
If you want to use Tablestore SDKs or the Tablestore CLI to create a search index, an AccessKey pair is created for your Alibaba Cloud account or RAM user. For more information, see Create an AccessKey pair.
A data table is created. For more information, see Operations on a data table.
If you want to use Tablestore SDKs to create a search index, an OTSClient instance is initialized. For more information, see Initialize an OTSClient instance.
If you want to use the Tablestore CLI to create a search index, the Tablestore CLI is downloaded and started as expected. In addition, the instance access information is configured. For more information, see Download the Tablestore CLI and Start the Tablestore CLI and configure access information.

Use the Tablestore console

You can create a search index in the Tablestore console.

Go to the Indexes tab.
1. Log on to the Tablestore console.
2. In the top navigation bar, select a resource group and a region.
3. On the Overview page, click the name of the instance that you want to manage or click Manage Instance in the Actions column of the instance.
4. On the Tables tab of the Instance Details tab, click the name of the data table for which you want to create a search index or click Indexes in the Actions column of the data table.
On the Indexes tab, click Create Search Index.

In the Create Index dialog box, configure the parameters of the search index.

Retain the default index name that is generated by the system, or enter an index name based on your business requirements.
Select the method that you want to use to generate a schema for the search index.
Important
The values of the Field Name and Field Type parameters must match the values of these parameters in the data table. For information about the mappings between field types in data tables and field types in search indexes, see Mappings of basic data types.
- If you set the Schema Generation Type parameter to Manual, specify the names and types for the fields. Specify whether to turn on Array for each field based on your business requirements.
- If you set the Schema Generation Type parameter to Auto Generate, the system automatically uses the primary key columns and attribute columns of the data table as index fields. You can select field types and specify whether to turn on Array based on your business requirements.
Note
To optimize indexing performance in specific cases, use virtual columns. For more information, see Virtual columns.

If you want to configure advanced settings, such as the Routing Key, Time to Live, and Pre-sorting parameters, turn on Advanced Settings. The following table describes the parameters.

Parameter	Description
Routing Key	The custom routing fields. You can select one or more primary key columns as the routing fields. Tablestore calculates the distribution of the index data based on the values of the routing fields. The records with the same values of the routing fields are distributed to the same data partition.
Time to Live	The retention period of data in the search index. Unit: seconds. The default value is -1, which specifies that the data never expires. The TTL must be at least 86,400 seconds (one day) or -1. A value of -1 specifies that the data never expires. The TTL of a search index that is created for a data table must be less than or equal to the TTL of the data table. If you want the system to automatically clear the historical data in the search index, specify a value that is greater than or equal to 86400. If the retention period exceeds the value of this parameter, the data expires and Tablestore automatically deletes the expired data.
Pre-sorting	The default order in which the data is returned. Valid values: Default and Custom. A value of Default specifies that the data is sorted based on the primary key. A value of Custom specifies that the data is sorted based on the field that you specify. Configure the Pre-sorting parameter based on your business requirements. Important Search indexes that contain Nested fields do not support index presorting.

Click OK.
After you create the search index, click Index Details in the Actions column of the search index on the Indexes tab. On the Index Details dialog box, you can view the parameter settings in the Basic Index Information, Index Meters, Routing Key, Index Fields, and Pre-sorting sections.

Use the Tablestore CLI

You can run the create_search_index command by using the Tablestore CLI to create a search index. For more information, see Search index.

Run the create_search_index command to create a search index named search_index.
```
create_search_index -n search_index
```

Enter the index schema as prompted:

The index schema includes the settings of the search index (IndexSetting), the list of field schemas (FieldSchemas), and the presorting settings for the search index (IndexSort). For more information about the schema of a search index, see Create a search index.

 {

    "IndexSetting": {
        "RoutingFields": null
    },
    "FieldSchemas": [
        {
            "FieldName": "gid",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "uid",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col2",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col3",
            "FieldType": "TEXT",
            "Index": true,
            "Analyzer": "single_word",
            "AnalyzerParameter": {
                "CaseSensitive": true,
                "DelimitWord": null
            },
            "EnableSortAndAgg": false,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col1",
            "FieldType": "KEYWORD",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": false
        },
        {
            "FieldName": "col3V",
            "FieldType": "LONG",
            "Index": true,
            "EnableSortAndAgg": true,
            "Store": true,
            "IsArray": false,
            "IsVirtualField": true,
            "SourceFieldNames": [
                "col3"
            ]
        }
    ]
}

Use Tablestore SDKs

You can create a search index by using the following Tablestore SDKs: Tablestore SDK for Java, Tablestore SDK for Go, Tablestore SDK for Python, Tablestore SDK for Node.js, Tablestore SDK for .NET, and Tablestore SDK for PHP. In this example, Tablestore SDK for Java is used to create a search index.

Create a search index by using the default configurations

The following sample code provides an example on how to create a search index by using the default configurations. In this example, the search index consists of the following fields: the Col_Keyword field of the Keyword type, the Col_Long field of the Long type, and the Col_Vector field of the Vector type. The data in the search index is presorted based on the primary key of the data table and never expires.

private static void createSearchIndex(SyncClient client) {
    CreateSearchIndexRequest request = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    request.setTableName("<TABLE_NAME>"); 
    // Specify the name of the search index. 
    request.setIndexName("<SEARCH_INDEX_NAME>"); 
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
            // Specify the names and types of the fields. 
            new FieldSchema("Col_Keyword", FieldType.KEYWORD), 
            new FieldSchema("Col_Long", FieldType.LONG)));
            // Specify the vector type. 
            new FieldSchema("Col_Vector", FieldType.VECTOR).setIndex(true)
                    // Set the vector dimension to 4 and the distance measurement algorithm for vectors to the dot product algorithm. 
                    .setVectorOptions(new VectorOptions(VectorDataType.FLOAT_32, 4, VectorMetricType.DOT_PRODUCT))
    ));
    request.setIndexSchema(indexSchema);
    // Call the client to create the search index. 
    client.createSearchIndex(request); 
}

Create a search index with the indexSort parameter specified

The following sample code provides an example on how to create a search index with the indexSort parameter specified. In this example, the search index consists of the following fields: the Col_Keyword field of the KEYWORD type, the Col_Long field of the Long type, the Col_Text field of the Text type, and the Timestamp field of the Long type. The data in the search index is presorted based on the Timestamp field.

private static void createSearchIndexWithIndexSort(SyncClient client) {
    CreateSearchIndexRequest request = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    request.setTableName("<TABLE_NAME>"); 
    // Specify the name of the search index. 
    request.setIndexName("<SEARCH_INDEX_NAME>"); 
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
            new FieldSchema("Col_Keyword", FieldType.KEYWORD),
            new FieldSchema("Col_Long", FieldType.LONG),
            new FieldSchema("Col_Text", FieldType.TEXT),
            new FieldSchema("Timestamp", FieldType.LONG)
                    .setEnableSortAndAgg(true)));
    // Presort data based on the Timestamp field. 
    indexSchema.setIndexSort(new Sort(
            Arrays.<Sort.Sorter>asList(new FieldSort("Timestamp", SortOrder.ASC))));
    request.setIndexSchema(indexSchema);
    // Call the client to create the search index. 
    client.createSearchIndex(request);
}

Create a search index with the TTL specified

Important

Make sure that update operations on the data table are prohibited.

The following sample code provides an example on how to create a search index with the TTL specified. In this example, the search index consists of the following fields: the Col_Keyword field of the KEYWORD type and the Col_Long field of the Long type. The TTL of the search index is seven days.

// Use Tablestore SDK for Java V5.12.0 or later to create a search index. 
public static void createIndexWithTTL(SyncClient client) {
    int days = 7;
    CreateSearchIndexRequest createRequest = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    createRequest.setTableName("<TABLE_NAME>");
    // Specify the name of the search index. 
    createRequest.setIndexName("<SEARCH_INDEX_NAME>");
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
            // Specify the names and types of the fields. 
            new FieldSchema("Col_Keyword", FieldType.KEYWORD), 
            new FieldSchema("Col_Long", FieldType.LONG)));
    createRequest.setIndexSchema(indexSchema);
    // Specify the TTL of the search index. 
    createRequest.setTimeToLiveInDays(days);
    // Create the search index. 
    client.createSearchIndex(createRequest);
}

Create a search index with virtual columns specified

The following sample code provides an example on how to create a search index with virtual columns specified. In this example, the search index consists of the following fields: the Col_Keyword field of the KEYWORD type and the Col_Long field of the LONG type. In addition, the following virtual columns are created: Col_Keyword_Virtual_Long of the LONG type and Col_Long_Virtual_Keyword of the KEYWORD type. The Col_Keyword_Virtual_Long column is mapped to the Col_Keyword field. The Col_Long_Virtual_Keyword column is mapped to the Col_Long field.

private static void createSearchIndex(SyncClient client) {
    CreateSearchIndexRequest request = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    request.setTableName("<TABLE_NAME>"); 
    // Specify the name of the search index. 
    request.setIndexName("<SEARCH_INDEX_NAME>"); 
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
        // Specify the names and types of the fields. 
        new FieldSchema("Col_Keyword", FieldType.KEYWORD), 
        // Specify the names and types of the fields. 
        new FieldSchema("Col_Keyword_Virtual_Long", FieldType.LONG) 
             // Specify whether the field is a virtual column. 
            .setVirtualField(true) 
             // Specify the name of the source field to which the virtual column is mapped. 
            .setSourceFieldName("Col_Keyword"), 
        new FieldSchema("Col_Long", FieldType.LONG),
        new FieldSchema("Col_Long_Virtual_Keyword", FieldType.KEYWORD)
            .setVirtualField(true)
            .setSourceFieldName("Col_Long")));
    request.setIndexSchema(indexSchema);
    // Call the client to create the search index. 
    client.createSearchIndex(request); 
}

Create a search index with the highlight feature enabled

The following sample code provides an example on how to create a search index with the highlight feature enabled. In this example, the search index consists of the following fields: the Col_Keyword field of the KEYWORD type, the Col_Long field of the LONG type, and the Col_Text field of the TEXT type. In addition, the highlight feature is enabled for the Col_Text field.

private static void createSearchIndexwithHighlighting(SyncClient client) {
    CreateSearchIndexRequest request = new CreateSearchIndexRequest();
    // Specify the name of the data table. 
    request.setTableName("<TABLE_NAME>"); 
    // Specify the name of the search index. 
    request.setIndexName("<SEARCH_INDEX_NAME>"); 
    IndexSchema indexSchema = new IndexSchema();
    indexSchema.setFieldSchemas(Arrays.asList(
            // Specify the names and types of the fields. 
            new FieldSchema("Col_Keyword", FieldType.KEYWORD), 
            new FieldSchema("Col_Long", FieldType.LONG),
            // Enable the highlight feature for the Col_Text field. 
            new FieldSchema("Col_Text", FieldType.TEXT).setIndex(true).setEnableHighlighting(true)
    ));
    request.setIndexSchema(indexSchema);
    // Call the client to create the search index. 
    client.createSearchIndex(request); 
}

What to do next

After you create a search index, you can use the search index to query, analyze, and export data.

Operation	Available feature
Query data	Select a query method based on your business requirements. Basic queries: match all query, term query, terms query, fuzzy query (wildcard query, prefix query, suffix query, tokenization-based wildcard query), range query, exists query, geo query (suitable for Geo-point fields), nested query (suitable for Nested fields), and collapse (distinct) Queries based on multiple conditions: Boolean query Full-text search (suitable for Text fields): tokenization, highlight, match query, and match phrase query KNN vector query (suitable for Vector fields): Overview When you call the Search operation to query data, you can sort or paginate the rows that meet the query conditions. For more information, see Perform sorting and paging.
Analyze data	Aggregation
Export data	Parallel scan

References

After you create a search index, you can perform operations on the search index based on your business requirements.

You can specify the TTL for a search index to delete historical data in the search index or extend the retention period of data in the search index. For more information, see Configure the TTL of a search index.
You can dynamically modify the schema of a search index to add, update, or remove index columns in the search index. For more information, see Dynamically modify the schema of a search index.
You can call the ListSearchIndex operation to query all search indexes that are created for a data table. For more information, see Query search indexes.
You can call the DescribeSearchIndex operation to query the description of a search index, such as the field information and search index configurations. For more information, see Query the description of a search index.
You can delete a search index that you no longer require. For more information, see Delete a search index.
You can also use the SQL query feature or compute engines, such as MaxCompute, Spark, Hive, HadoopMR, Function Compute, Flink, and PrestoDB, to compute and analyze data in tables. For more information, see SQL query and Overview.

Appendix: distance measurement algorithms for vectors

The following table describes the algorithms that you can use to measure the distance between vectors. A greater value that is obtained by using an algorithm indicates a higher similarity.

Metric type	Formula	Performance	Description
Euclidean distance (euclidean)		Relatively high	Measures the shortest path between two vectors in a multi-dimensional space. The Euclidean distance algorithm in Tablestore does not perform the final square root calculation. This is done to improve performance. A greater value that is obtained by using the Euclidean distance algorithm indicates a higher similarity between two vectors.
Dot product (dot_product)		Highest	Multiplies the corresponding coordinates of two vectors of the same dimension and adds the products. A greater value that is obtained by using the dot product algorithm indicates a higher similarity between two vectors. Float32 vectors must be normalized before they are written to tables. For example, you can use the L2 norm to normalize Float32 vectors. If you do not normalize Float32 vectors before they are written to tables, you may encounter issues such as inaccurate query results, slow construction of vector indexes, and poor query performance.
Cosine similarity (cosine)		Relatively low	Calculates the cosine of the angle between two vectors in a vector space. A greater value that is obtained by using the cosine similarity algorithm indicates a higher similarity between two vectors. In most cases, this algorithm is used to calculate the similarity between text data. If 0 is used as a divisor, the cosine similarity cannot be calculated because it does not allow 0 to be used as a divisor. Therefore, the sum of squares of Float32 vectors cannot be 0. The process to calculate cosine similarity is complex. We recommend that you normalize vectors before you write data to tables and use the dot product algorithm to measure the distance between vectors.

The following sample code provides an example on how to normalize vectors:

  public static float[] l2normalize(float[] v, boolean throwOnZero) {
    double squareSum = 0.0f;
    int dim = v.length;
    for (float x : v) {
      squareSum += x * x;
    }
    if (squareSum == 0) {
      if (throwOnZero) {
        throw new IllegalArgumentException("can't normalize a zero-length vector");
      } else {
        return v;
      }
    }
    double length = Math.sqrt(squareSum);
    for (int i = 0; i < dim; i++) {
      v[i] /= length;
    }
    return v;
  }