Tokenization for TEXT fields - Tablestore - Alibaba Cloud Documentation Center

After you specify a tokenization method for a TEXT field, Tablestore tokenizes the values of the field into multiple tokens based on the specified tokenization method. You cannot specify tokenization methods for non-TEXT fields. Sorting and aggregation are not supported for TEXT fields. If you want to perform sorting or aggregation on a field of the TEXT type, you can use the virtual column feature and set the type of the virtual column to which the TEXT field is mapped to Keyword.

Background information

In most cases, you can use match query (MatchQuery) and match phrase query (MatchPhraseQuery) to query a field of the TEXT type. You can also use term query (TermQuery), terms query (TermsQuery), prefix query (PrefixQuery), and wildcard query (WildcardQuery) to query a field of the TEXT type based on your business requirements.

Tokenization methods

The following tokenization methods are supported: single-word tokenization, delimiter tokenization, minimum semantic unit-based tokenization, maximum semantic unit-based tokenization, and fuzzy tokenization.

Single-word tokenization (SingleWord)

This tokenization method applies to all natural languages such as Chinese, English, and Japanese. By default, the tokenization method for TEXT fields is single-word tokenization.

After you specify single-word tokenization for a TEXT field, Tablestore performs tokenization based on the following rules:

Chinese texts are tokenized based on each Chinese character. For example, "杭州" is tokenized into "杭" and "州". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and specify "杭" as the keyword to query the rows that contain "杭州".
Letters or digits are tokenized based on spaces or punctuation marks.
- If you set the caseSensitive parameter to false, the tokens are not case-sensitive. Tablestore converts all letters into lowercase letters for tokens and stores the tokens.
  For example, "Hang Zhou" is tokenized into "hang" and "zhou". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and specify "hang", "HANG", or "Hang" as the keyword to query the rows that contain "Hang Zhou".
- If you set the caseSensitive parameter to true, tokens are case-sensitive. Tablestore stores the tokens in a case-sensitive manner.
  For example, "Hang Zhou" is tokenized into "Hang" and "Zhou". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and specify "Hang" or "Zhou" as the keyword to query the rows that contain "Hang Zhou".

Alphanumeric characters such as model numbers are also tokenized by spaces or punctuation marks. However, these characters cannot be tokenized into smaller words by default. For example, "iphone6" can be tokenized only into "iphone6". When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery), you must specify "iphone6" as the keyword. No results are returned if you specify "iphone" as the keyword.
You can set the delimitWord parameter to true to separate letters from digits. This way, "iphone6" is tokenized into "iphone" and "6". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and specify "iphone" or "6" as the keyword to match the rows that contain "iphone6".

The following table describes the parameters for single-word tokenization.

Parameter

Description

caseSensitive

Specifies whether to enable case sensitivity. Default value: false. A value of false specifies that all letters are converted into lowercase letters.

If you do not want Tablestore to convert letters into lowercase letters, set this parameter to true.

delimitWord

Specifies whether to tokenize alphanumeric characters. Default value: false. A value of false specifies that an alphanumeric character is not tokenized into smaller words.

You can set the delimitWord parameter to true to separate letters from digits. This way, "iphone6" is tokenized into "iphone" and "6".

Delimiter tokenization (Split)

Tablestore provides general dictionary-based tokenization. However, specific industries require custom dictionaries for tokenization. To meet this requirement, Tablestore provides delimiter tokenization. You can perform tokenization by using custom methods, use delimiter tokenization, and then write data to Tablestore.

Delimiter tokenization applies to all natural languages, such as Chinese, English, and Japanese.

After you specify delimiter tokenization for a TEXT field, Tablestore tokenizes field values based on the specified delimiter. For example, if a field value is "badminton,ping pong,rap" and you set the delimiter parameter to a comma (,), the value is tokenized into "badminton", "ping pong", and "rap" and the field is indexed. When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery) to query "badminton", "ping pong", "rap", or "badminton,ping pong", the row can be obtained.

The following table describes the parameters for delimiter tokenization.

Parameter

Description

caseSensitive

Specifies whether to enable case sensitivity. Default value: false. If you set this parameter to false, all letters are converted into lowercase letters.

If you do not want Tablestore to convert letters into lowercase letters, set this parameter to true.

Important

Tablestore SDK for Java V5.17.2 or later supports this parameter.

delimiter

The delimiter. By default, the value is a whitespace character. You can specify a custom delimiter.

When you create a search index, the delimiter that you specified for field tokenization must be the same as the delimiter that is included in the value of the column in the data table. Otherwise, data may not be obtained.
If the custom delimiter is a special character, such as a number sign (#) or a tilde (~), concatenate the delimiter by using an escape character (\). Example: \#.

Minimum semantic unit-based tokenization (MinWord)

This tokenization method applies to the Chinese language in full-text search scenarios.

After you specify minimum semantic unit-based tokenization as the tokenization method for a TEXT field, Tablestore tokenizes the values of the TEXT field into the minimum number of semantic units when Tablestore performs a query.

Maximum semantic unit-based tokenization (MaxWord)

This tokenization method applies to the Chinese language in full-text search scenarios.

After you specify maximum semantic unit-based tokenization as the tokenization method for a TEXT field, Tablestore tokenizes the values of the TEXT field into the maximum number of semantic units when Tablestore performs a query. However, different semantic units may contain the same characters. The total length of the tokens is longer than the length of the original text. As a result, the data volume of the index is increased.

This tokenization method can generate more tokens and increase the probability that the rows are matched. However, the index size is greatly increased. You can use match phrase query (MatchPhraseQuery) when you specify maximum semantic unit-based tokenization as the tokenization method. Match query (MatchQuery) is more suitable for this tokenization method. If you use match phrase query together with this tokenization method, data may not be obtained due to overlapping tokens because the keyword is also tokenized based on maximum semantic unit-based tokenization.

Fuzzy tokenization

This tokenization method applies to all natural languages such as Chinese, English, and Japanese in scenarios that involve short text content, such as titles, movie names, book titles, file names, and directory names.

You can use fuzzy tokenization together with match phrase query to return query results at low latency. The combination of fuzzy tokenization and match phrase query outperforms wildcard query (WildcardQuery). However, the index size is greatly increased.

After you specify fuzzy tokenization as the tokenization method for a TEXT field, Tablestore performs tokenization by using n-gram. The number of characters in a token ranges from the value of the minChars parameter to the value of the maxChars parameter. For example, this tokenization method is used to populate the drop-down list.

Important

To perform a fuzzy query, you must perform a match phrase query (MatchPhraseQuery) on the field for which fuzzy tokenization is used. If you have additional query requirements on the field, use the virtual column feature. For more information about the virtual column feature, see Virtual columns.

Limits
- You can use fuzzy tokenization to tokenize a TEXT field value that is less than or equal to 1,024 characters in length. If the TEXT field value exceeds 1,024 characters in length, Tablestore truncates and discards the excess characters and only tokenizes the first 1,024 characters.
- To prevent an excessive increase of index data, the difference between the values of the maxChars and minChars parameters must not exceed 6.

Parameters

Parameter	Description
minChars	The minimum number of characters for a token. Default value: 1.
maxChars	The maximum number of characters for a token. Default value: 7.
caseSensitive	Specifies whether to enable case sensitivity. Default value: false. If you set this parameter to false, all letters are converted into lowercase letters. If you do not want Tablestore to convert letters into lowercase letters, set this parameter to true. Important Tablestore SDK for Java V5.17.2 or later supports this parameter.

Comparison

The following table compares the tokenization methods.

Item	Single-word tokenization	Delimiter tokenization	Minimum semantic unit-based tokenization	Maximum semantic unit-based tokenization	Fuzzy tokenization
Index increase	Small	Small	Small	Medium	Large
Relevance	Weak	Weak	Medium	Relatively strong	Relatively strong
Applicable language	All	All	Chinese	Chinese	All
Length limit	None	None	None	None	1,024 characters
Recall rate	High	Low	Low	Medium	High