This topic describes the analyzers provided by AnalyticDB for MySQL and their usage and segmentation effects.
Overview
AnalyticDB for MySQL provides a variety of built-in analyzers to implement full-text indexes, including the AliNLP analyzer, IK analyzer, Standard analyzer, Ngram analyzer, Edge_ngram analyzer, and Pattern analyzer. You can use the default analyzer or another built-in analyzer to analyze text based on your business scenarios. The default analyzer is determined based on the following rules:
For clusters of a version earlier than V3.1.4.15, the AliNLP analyzer is used by default.
For clusters of V3.1.4.15 or later, the IK analyzer is used by default.
For more information about how to view the minor engine version of a cluster, see How do I view the minor version of a cluster?
Specify an analyzer
Syntax
FULLTEXT INDEX idx_name(`column_name`) [ WITH ANALYZER analyzer_name ] [ WITH DICT tbl_dict_name];
Parameters
idx_name: the name of the full-text index.
column_name: the name of the column on which to create the full-text index.
WITH ANALYZER analyzer_name: specifies the analyzer.
WITH DICT tbl_dict_name: specifies the custom dictionary. AnalyticDB for MySQL supports custom dictionaries. For more information, see Custom dictionaries for full-text indexes.
Examples
Specify an analyzer when you create a table that contains a full-text index.
CREATE TABLE `tbl_fulltext_demo` (
`id` int,
`content` varchar,
`content_alinlp` varchar,
`content_ik` varchar,
`content_standard` varchar,
`content_ngram` varchar,
`content_edge_ngram` varchar,
FULLTEXT INDEX fidx_c(`content`), // Use the default analyzer.
FULLTEXT INDEX fidx_alinlp(`content_alinlp`) WITH ANALYZER alinlp,
FULLTEXT INDEX fidx_ik(`content_ik`) WITH ANALYZER ik,
FULLTEXT INDEX fidx_standard(`content_standard`) WITH ANALYZER standard,
FULLTEXT INDEX fidx_ngram(`content_ngram`) WITH ANALYZER ngram,
FULLTEXT INDEX fidx_edge_ngram(`content_edge_ngram`) WITH ANALYZER edge_ngram,
PRIMARY KEY (`id`)
) DISTRIBUTED BY HASH(id);
AliNLP analyzer
The AliNLP analyzer is a natural language analyzer kit provided by Alibaba Cloud and DAMO Academy based on natural language processing (NLP) technologies. This analyzer allows you to use the entities and stop words contained in a custom full-text dictionary. The AliNLP analyzer can divide the consecutive natural language text into appropriate segments. A variety of languages are supported, including Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, and Spanish.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
FULLTEXT_SPLIT_GRANULARITY | The segmentation granularity. The value must be an integer from 2 to 8. The default value is 2. |
FULLTEXT_FILTER_ST_CONVERT_ENABLED | Specifies whether to enable stem conversion. Default value: false. Valid values:
For example, men is converted to man, and cars is converted to car. |
FULLTEXT_TOKENIZER_CASE_SENSITIVE | Specifies whether the text segmentation is case-sensitive. Default value: false. Valid values:
|
Segmentation effects
The default configuration provides the following segmentation effects:
Query the segmentation effects for English text.
/*+ mode=two_phase*/ SELECT fulltext_alinlp_test('Hello world');
The following information is returned:
[hello, , world]
When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/
to the beginning of the statement.
IK analyzer
The IK analyzer is an open source lightweight Chinese analyzer kit. This analyzer allows you to use the entities and stop words contained in a custom full-text dictionary.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
CSTORE_IK_SEGMENTER_USE_SMART_ENABLE | The segmentation granularity. Default value: false. Valid values:
|
CSTORE_IK_SEGMENTER_LETTER_MIN_LENGTH | The minimum length of a segment. The value must be an integer from 2 to 16. The default value is 3. |
CSTORE_IK_SEGMENTER_LETTER_MAX_LENGTH | The maximum length of a segment. The value must be an integer from 2 to 256. The default value is 128. |
Segmentation effects
The default configuration provides the following segmentation effects:
Query the segmentation effects for English text.
/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');
The following information is returned:
[hello, world, or]
When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/
to the beginning of the statement.
Standard analyzer
The standard analyzer applies distinct text segmentation rules for different languages. For English text, this analyzer converts the text to lowercase and removes stop words and punctuation marks before segmentation. For Chinese text, this analyzer segments the text directly to individual characters. The standard analyzer allows you to use the entities and stop words contained in a custom full-text dictionary.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
FULLTEXT_MAX_TOKEN_LENGTH | The maximum length of the text that can be segmented. The value must be an integer from 1 to 1048576. The default value is 255. |
Segmentation effects
The default configuration provides the following segmentation effects:
Query the segmentation effects for English text.
/*+ mode=two_phase*/ SELECT fulltext_standard_test('Hello world');
The following information is returned:
[hello, world]
When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/
to the beginning of the statement.
Ngram analyzer
The ngram analyzer segments text based on the value of the FULLTEXT_NGRAM_TOKEN_SIZE parameter and allows you to use the entities and stop words contained in a custom full-text dictionary. The ngram analyzer can improve the efficiency of fuzzy retrieval.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
FULLTEXT_NGRAM_TOKEN_SIZE | The length of a segment. The value must be an integer from 1 to 8. The default value is 2.
|
Segmentation effects
The default configuration provides the following segmentation effects:
Query the segmentation effects for English text.
/*+ mode=two_phase*/ SELECT fulltext_ngram_test('Hello world');
The following information is returned:
[he, el, ll, lo, o , w, wo, or, rl, ld]
When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/
to the beginning of the statement.
Edge_ngram analyzer
The edge_ngram analyzer uses the same segmentation method as the ngram analyzer. This analyzer is ideal for prefix-based segmentation and word retrieval scenarios and allows you to use the entities and stop words contained in a custom full-text dictionary.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
FULLTEXT_MIN_GRAM_SIZE | The minimum length of the prefix segment. The value must be an integer from 1 to 8. The default value is 1. |
FULLTEXT_MAX_GRAM_SIZE | The maximum length of the prefix segment. The maximum length must be greater than the minimum length. The value must be an integer from 1 to 16. The default value is 2. |
Segmentation effects
The default configuration provides the following segmentation effects:
Query the segmentation effects for English text.
/*+ mode=two_phase*/ SELECT fulltext_edge_ngram_test('Hello world');
The following information is returned:
[h, he]
When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/
to the beginning of the statement.
Pattern analyzer
The pattern analyzer segments text based on regular expressions. This analyzer does not allow you to use the entities and stop words contained in a custom full-text dictionary or execute SQL statements to query segmentation effects.
Syntax
FULLTEXT INDEX fidx_name(`column_name`) WITH ANALYZER pattern_tokenizer("Custom_rule") [ WITH DICT `tbl_dict_name` ];
Parameters
Custom_rule: the regular expression.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
FULLTEXT_TOKENIZER_CASE_SENSITIVE | Specifies whether the text segmentation is case-sensitive. Default value: false. Valid values:
|
Query and modify analyzer configuration parameters
AnalyticDB for MySQL allows you to query and modify analyzer configuration parameters.
Query configuration parameters
Execute the
SHOW adb_config
statement to query configuration parameters.Syntax
show adb_config key <analyzer_param>;
Parameters
analyzer_param: the name of the configuration parameter.
Examples
show adb_config key FULLTEXT_NGRAM_TOKEN_SIZE;
NoteThe
SHOW adb_config
statement can be used to query both the default and modified configuration parameters.Execute the SELECT statement to query configuration parameters.
Syntax
SELECT `key`, `value`, `update_time` FROM INFORMATION_SCHEMA.kepler_meta_configs WHERE key = '<analyzer_param>';
Parameters
analyzer_param: the name of the configuration parameter.
Examples
SELECT `key`, `value`, `update_time` FROM INFORMATION_SCHEMA.kepler_meta_configs WHERE key = 'FULLTEXT_NGRAM_TOKEN_SIZE';
NoteThe SELECT statement can be used to query only the modified configuration parameters. If you use this statement to query the default configuration parameters, null is returned.
Modify configuration parameters
Syntax
set adb_config <analyzer_param>=<value>;
Parameters
analyzer_param: the name of the configuration parameter.
value: the value of the configuration parameter.
Examples
set adb_config FULLTEXT_NGRAM_TOKEN_SIZE=3;