Tokenization-based wildcard query - Tablestore - Alibaba Cloud Documentation Center

If you perform a wildcard query to search for the *word* string, you can use a tokenization-based wildcard query (fuzzy tokenization in combination with match phrase query) to ensure better query performance.

Background information

Fuzzy query is a common requirement in databases. For example, you can perform a fuzzy query to query file names and mobile numbers. To perform fuzzy queries in Tablestore, you can use the wildcard query feature of search indexes. The wildcard query feature is similar to the LIKE operator in MySQL. However, the wildcard query feature supports only up to 32 characters in the string that is used for a wildcard query, and the query performance declines as the data size increases.

To resolve these issues, search indexes support tokenization-based wildcard queries to ensure high performance in fuzzy queries. When you use tokenization-based wildcard queries, Tablestore does not limit the length of the string that is used for a query. However, if the column value exceeds 1,024 characters in length, the system truncates the column value and performs tokenization only for the first 1,024 characters.

Scenarios

You can select a method that fits your scenario to perform a fuzzy query.

If you use *word* for a wildcard query, you can use tokenization-based wildcard queries to perform a fuzzy query. For example, if you use "123" to query mobile numbers that contain 123 at any position, you can use tokenization-based wildcard queries to perform a fuzzy query.
In this case, a tokenization-based wildcard query improves the query performance by more than 10 times than a wildcard query.
For example, a data table contains a column named file_name, the column type is Text, and the tokenization method is fuzzy tokenization (FuzzyAnalyzer) for the column in the search index. If you use the search index to query the rows in which the file_name column value is 2021 woRK@Hangzhou, you must perform a match phrase query (MatchPhraseQuery) and set the tokens to consecutive substrings for the query.
- If the token for the query is 2021, 20, 21, work, WORK, @, Hang, zhou, Hangzhou, or @Hangzhou, the rows in which the file_name column value is 2021 woRK@Hangzhou can match the token.
- If the token for the query is 21work, 2021Hangzhou, 2120, or #Hangzhou, the rows in which the file_name column value is 2021 woRK@Hangzhou cannot match the token.
For other complex queries, you can use wildcard queries for fuzzy queries. For more information, see Wildcard query.

Use fuzzy tokenization for a fuzzy query

To use tokenization-based wildcard queries for a fuzzy query, perform the following steps:

Create a search index. When you create a search index, set the field type to Text and the tokenization method to fuzzy tokenization (FuzzyAnalyzer) for the specified column, and retain the default settings for other parameters. For more information, see Create a search index.
Note
If a search index exists, you can add a virtual column for the specified column by dynamically modifying the schema of the search index. Then, set the field type to Text and the tokenization method to fuzzy tokenization for the virtual column. For more information, see Dynamically modify the schema of a search index and Virtual columns.
Use the search index to query data. When you use the search index to query data, perform a match phrase query. For more information, see Match phrase query.

Appendix: test cases

The following test cases show how to use tokenization-based wildcard queries to perform a fuzzy query:

import com.alicloud.openservices.tablestore.SyncClient;
import com.alicloud.openservices.tablestore.model.ColumnValue;
import com.alicloud.openservices.tablestore.model.PrimaryKey;
import com.alicloud.openservices.tablestore.model.PrimaryKeyBuilder;
import com.alicloud.openservices.tablestore.model.PrimaryKeyValue;
import com.alicloud.openservices.tablestore.model.PutRowRequest;
import com.alicloud.openservices.tablestore.model.RowPutChange;
import com.alicloud.openservices.tablestore.model.search.CreateSearchIndexRequest;
import com.alicloud.openservices.tablestore.model.search.CreateSearchIndexResponse;
import com.alicloud.openservices.tablestore.model.search.FieldSchema;
import com.alicloud.openservices.tablestore.model.search.FieldType;
import com.alicloud.openservices.tablestore.model.search.IndexSchema;
import com.alicloud.openservices.tablestore.model.search.SearchQuery;
import com.alicloud.openservices.tablestore.model.search.SearchRequest;
import com.alicloud.openservices.tablestore.model.search.SearchResponse;
import com.alicloud.openservices.tablestore.model.search.query.QueryBuilders;

import java.util.Arrays;
import java.util.Collections;

import static org.junit.Assert.assertEquals;


public class TestFuzzy {
    private static final String tableName = "analysis_test";
    private static final String indexName = "analysis_test_index";


    public void testFuzzyMatchPhrase(SyncClient client) throws Exception {
        // Specify the schema of a search index. 
        IndexSchema indexSchema = new IndexSchema();
        indexSchema.setFieldSchemas(Collections.singletonList(
                // Note: If you specify Text as the type of the name field that is mapped to the name column of the Keyword type in a data table and set the tokenization method for the field, exceptions may occur in the query. 
                // If you want to retain both the Keyword and Text types, see the example provided in the "Virtual columns" topic. If you use *abc* to match the name field, only the name field of the Text type is required. The name field of the Keyword type is not required. 
                new FieldSchema("name", FieldType.TEXT).setAnalyzer(FieldSchema.Analyzer.Fuzzy)
        ));
        // Create the search index. 
        {
            CreateSearchIndexRequest request = new CreateSearchIndexRequest();
            request.setTableName(tableName);
            request.setIndexName(indexName);
            request.setIndexSchema(indexSchema);
            CreateSearchIndexResponse response = client.createSearchIndex(request);
        }

        // Write a row of data to the data table. 
        PrimaryKey primaryKey = PrimaryKeyBuilder.createPrimaryKeyBuilder()
                .addPrimaryKeyColumn("pk1", PrimaryKeyValue.fromString("1"))
                .addPrimaryKeyColumn("pk2", PrimaryKeyValue.fromLong(1))
                .addPrimaryKeyColumn("pk3", PrimaryKeyValue.fromBinary(new byte[]{1, 2, 3}))
                .build();
        RowPutChange rowPutChange = new RowPutChange(tableName, primaryKey);
        // Specify the attribute column of the row. 
        rowPutChange.addColumn("name", ColumnValue.fromString("TheLionKing1024x768P.mp4"));
        PutRowRequest request = new PutRowRequest(rowPutChange);
        client.putRow(request);

        // Wait until the row of data is synchronized to the search index. 
        Thread.sleep(1000 * 180);

        // Use *abc* for the query. 
        assertMatchPhraseQuery(client, tableName, indexName, "name", "The", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "TheLion", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "The Lion", 0);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "TheLionKing102", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "TheLionKing1024", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "TheLionKing1024x", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "TheLionKing1024x7", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "TheLionKing1024x768P.mp4", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "24x768P.mp4", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "24x76 8P.mp4", 0);
        assertMatchPhraseQuery(client, tableName, indexName, "name", "24x7 P.mp4", 0);
    }

    // Use a virtual column. 
    public void testFuzzyMatchPhraseWithVirtualField(SyncClient client) throws Exception {
        // Specify the schema of a search index. 
        IndexSchema indexSchema = new IndexSchema();
        indexSchema.setFieldSchemas(Arrays.asList(
                // Set the type of the name field to Keyword, which facilitates equivalent queries. 
                new FieldSchema("name", FieldType.KEYWORD).setIndex(true).setStore(true),
                // Create a virtual column named name_virtual_text and set the field type to Text and the tokenization method to Fuzzy for the virtual column. The data source of the virtual column is the name field. 
                new FieldSchema("name_virtual_text", FieldType.TEXT).setIndex(true).setAnalyzer(FieldSchema.Analyzer.Fuzzy).setVirtualField(true).setSourceFieldName("name")
        ));
        // Create the search index. 
        {
            CreateSearchIndexRequest request = new CreateSearchIndexRequest();
            request.setTableName(tableName);
            request.setIndexName(indexName);
            request.setIndexSchema(indexSchema);
            CreateSearchIndexResponse response = client.createSearchIndex(request);
        }

        // Write a row of data to the data table. 
        PrimaryKey primaryKey = PrimaryKeyBuilder.createPrimaryKeyBuilder()
                .addPrimaryKeyColumn("pk1", PrimaryKeyValue.fromString("1"))
                .addPrimaryKeyColumn("pk2", PrimaryKeyValue.fromLong(1))
                .addPrimaryKeyColumn("pk3", PrimaryKeyValue.fromBinary(new byte[]{1, 2, 3}))
                .build();
        RowPutChange rowPutChange = new RowPutChange(tableName, primaryKey);
        // Specify the attribute column of the row. 
        rowPutChange.addColumn("name", ColumnValue.fromString("TheLionKing1024x768P.mp4"));
        PutRowRequest request = new PutRowRequest(rowPutChange);
        client.putRow(request);

        // Wait until the row of data is synchronized to the search index. 
        Thread.sleep(1000 * 180);

        // Use *abc* for the query. 
        // Note: The field for the query is name_virtual_text instead of name. 
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "The", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "TheLion", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "The Lion", 0);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "TheLionKing102", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "TheLionKing1024", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "TheLionKing1024x", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "TheLionKing1024x7", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "TheLionKing1024x768P.mp4", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "24x768P.mp4", 1);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "24x76 8P.mp4", 0);
        assertMatchPhraseQuery(client, tableName, indexName, "name_virtual_text", "24x7 P.mp4", 0);
    }

    // Perform a match phrase query. 
    public static void assertMatchPhraseQuery(SyncClient client, String tableName, String indexName, String fieldName, String searchContent, long exceptCount) {
        SearchRequest searchRequest = new SearchRequest();
        searchRequest.setTableName(tableName);
        searchRequest.setIndexName(indexName);
        SearchQuery searchQuery = new SearchQuery();
        // Perform a match phrase query to query data that matches the tokens. 
        searchQuery.setQuery(QueryBuilders.matchPhrase(fieldName, searchContent).build());
        searchQuery.setLimit(0);
        // Specify that the total number of matched rows is returned. If you are not concerned about the total number of matched rows, set this parameter to false for better performance. 
        searchQuery.setGetTotalCount(true);
        searchRequest.setSearchQuery(searchQuery);
        SearchResponse response = client.search(searchRequest);
        assertEquals(String.format("field:[%s], searchContent:[%s]", fieldName, searchContent), exceptCount, response.getTotalCount());
    }
}

FAQ

References

When you use a search index to query data, you can use the following query methods: term query, terms query, match all query, match query, match phrase query, prefix query, range query, wildcard query, fuzzy query, Boolean query, geo query, nested query, KNN vector query, and exists query. You can select query methods based on your business requirements to query data from multiple dimensions.
You can sort or paginate rows that meet the query conditions by using the sorting and paging features. For more information, see Perform sorting and paging.
You can use the collapse (distinct) feature to collapse the result set based on a specific column. This way, data of the specified type appears only once in the query results. For more information, see Collapse (distinct).
If you want to analyze data in a data table, you can use the aggregation feature of the Search operation or execute SQL statements. For example, you can obtain the minimum and maximum values, sum, and total number of rows. For more information, see Aggregation and SQL query.
If you want to obtain all rows that meet the query conditions without the need to sort the rows, you can call the ParallelScan and ComputeSplits operations to use the parallel scan feature. For more information, see Parallel scan.