This topic describes the analyzers provided by AnalyticDB for MySQL and their usage and segmentation effects.
Overview
- For clusters of a version earlier than V3.1.4.15, the AliNLP analyzer is used by default.
- For clusters of V3.1.4.15 or later, the IK analyzer is used by default.
Specify an analyzer
Syntax
FULLTEXT INDEX idx_name(`column_name`) [ WITH ANALYZER analyzer_name ] [ WITH DICT tbl_dict_name];
Parameters
- idx_name: the name of the full-text index.
- column_name: the name of the column on which to create the full-text index.
- WITH ANALYZER analyzer_name: specifies the analyzer.
- WITH DICT tbl_dict_name: specifies the custom dictionary. AnalyticDB for MySQL supports custom dictionaries. For more information, see Custom dictionaries for full-text indexes.
Examples
CREATE TABLE `tbl_fulltext_demo` (
`id` int,
`content` varchar,
`content_alinlp` varchar,
`content_ik` varchar,
`content_standard` varchar,
`content_ngram` varchar,
`content_edge_ngram` varchar,
FULLTEXT INDEX fidx_c(`content`), // Use the default analyzer.
FULLTEXT INDEX fidx_alinlp(`content_alinlp`) WITH ANALYZER alinlp,
FULLTEXT INDEX fidx_ik(`content_ik`) WITH ANALYZER ik,
FULLTEXT INDEX fidx_standard(`content_standard`) WITH ANALYZER standard,
FULLTEXT INDEX fidx_ngram(`content_ngram`) WITH ANALYZER ngram,
FULLTEXT INDEX fidx_edge_ngram(`content_edge_ngram`) WITH ANALYZER edge_ngram,
PRIMARY KEY (`id`)
) DISTRIBUTE BY HASH(id);
AliNLP analyzer
The AliNLP analyzer is a natural language analyzer kit provided by Alibaba Cloud and DAMO Academy based on natural language processing (NLP) technologies. This analyzer allows you to use the entities and stop words contained in a custom full-text dictionary. The AliNLP analyzer can divide the consecutive natural language text into appropriate segments. A variety of languages are supported, including Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, and Spanish.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
---|---|
FULLTEXT_SPLIT_GRANULARITY | The segmentation granularity. The value must be an integer from 2 to 8. The default value is 2. |
FULLTEXT_FILTER_ST_CONVERT_ENABLED | Specifies whether to enable stem conversion. Default value: false. Valid values:
|
FULLTEXT_TOKENIZER_CASE_SENSITIVE | Specifies whether the text segmentation is case-sensitive. Default value: false. Valid values:
|
Segmentation effects
The default configuration provides the following segmentation effects:
- Query the segmentation effects for English text.
The following information is returned:/*+ mode=two_phase*/ SELECT fulltext_alinlp_test('Hello world');
[hello, , world]
/*+ mode=two_phase*/
to the beginning of the statement. IK analyzer
The IK analyzer is an open source lightweight Chinese analyzer kit. This analyzer allows you to use the entities and stop words contained in a custom full-text dictionary.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
---|---|
CSTORE_IK_SEGMENTER_USE_SMART_ENABLE | The segmentation granularity. Default value: false. Valid values:
|
CSTORE_IK_SEGMENTER_LETTER_MIN_LENGTH | The minimum length of a segment. The value must be an integer from 2 to 16. The default value is 3. |
CSTORE_IK_SEGMENTER_LETTER_MAX_LENGTH | The maximum length of a segment. The value must be an integer from 2 to 256. The default value is 128. |
Segmentation effects
The default configuration provides the following segmentation effects:
- Query the segmentation effects for English text.
The following information is returned:/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');
[hello, world, or]
/*+ mode=two_phase*/
to the beginning of the statement. Standard analyzer
The standard analyzer applies distinct text segmentation rules for different languages. For English text, this analyzer converts the text to lowercase and removes stop words and punctuation marks before segmentation. For Chinese text, this analyzer segments the text directly to individual characters. The standard analyzer allows you to use the entities and stop words contained in a custom full-text dictionary.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
---|---|
FULLTEXT_MAX_TOKEN_LENGTH | The maximum length of the text that can be segmented. The value must be an integer from 1 to 1048576. The default value is 255. |
Segmentation effects
The default configuration provides the following segmentation effects:
- Query the segmentation effects for English text.
The following information is returned:/*+ mode=two_phase*/ SELECT fulltext_standard_test('Hello world');
[hello, world]
/*+ mode=two_phase*/
to the beginning of the statement. Ngram analyzer
The ngram analyzer segments text based on the value of the FULLTEXT_NGRAM_TOKEN_SIZE parameter and allows you to use the entities and stop words contained in a custom full-text dictionary. The ngram analyzer can improve the efficiency of fuzzy retrieval.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
---|---|
FULLTEXT_NGRAM_TOKEN_SIZE | The length of a segment. The value must be an integer from 1 to 8. The default value is 2.
|
Segmentation effects
The default configuration provides the following segmentation effects:
- Query the segmentation effects for English text.
The following information is returned:/*+ mode=two_phase*/ SELECT fulltext_ngram_test('Hello world');
[he, el, ll, lo, o , w, wo, or, rl, ld]
/*+ mode=two_phase*/
to the beginning of the statement. Edge_ngram analyzer
The edge_ngram analyzer uses the same segmentation method as the ngram analyzer. This analyzer is ideal for prefix-based segmentation and word retrieval scenarios and allows you to use the entities and stop words contained in a custom full-text dictionary.
Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
---|---|
FULLTEXT_MIN_GRAM_SIZE | The minimum length of the prefix segment. The value must be an integer from 1 to 8. The default value is 1. |
FULLTEXT_MAX_GRAM_SIZE | The maximum length of the prefix segment. The maximum length must be greater than the minimum length. The value must be an integer from 1 to 16. The default value is 2. |
Segmentation effects
The default configuration provides the following segmentation effects:
- Query the segmentation effects for English text.
The following information is returned:/*+ mode=two_phase*/ SELECT fulltext_edge_ngram_test('Hello world');
[h, he]
/*+ mode=two_phase*/
to the beginning of the statement. Pattern analyzer
The pattern analyzer segments text based on regular expressions. This analyzer does not allow you to use the entities and stop words contained in a custom full-text dictionary or execute SQL statements to query segmentation effects.
Syntax
FULLTEXT INDEX fidx_name(`column_name`) WITH ANALYZER pattern_tokenizer("Custom_rule") [ WITH DICT `tbl_dict_name` ];
Parameters
Custom_rule: the regular expression.Configuration parameters
For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.
Parameter | Description |
---|---|
FULLTEXT_TOKENIZER_CASE_SENSITIVE | Specifies whether the text segmentation is case-sensitive. Default value: false. Valid values:
|
Query and modify analyzer configuration parameters
AnalyticDB for MySQL allows you to query and modify analyzer configuration parameters.
Query configuration parameters
- Execute the
SHOW adb_config
statement to query configuration parameters.Syntax
show adb_config key <analyzer_param>;
Parameters
analyzer_param: the name of the configuration parameter.Examples
show adb_config key FULLTEXT_NGRAM_TOKEN_SIZE;
Note TheSHOW adb_config
statement can be used to query both the default and modified configuration parameters. - Execute the SELECT statement to query configuration parameters.
Syntax
SELECT `key`, `value`, `update_time` FROM INFORMATION_SCHEMA.kepler_meta_configs WHERE key = '<analyzer_param>';
Parameters
analyzer_param: the name of the configuration parameter.Examples
SELECT `key`, `value`, `update_time` FROM INFORMATION_SCHEMA.kepler_meta_configs WHERE key = 'FULLTEXT_NGRAM_TOKEN_SIZE';
Note The SELECT statement can be used to query only the modified configuration parameters. If you use this statement to query the default configuration parameters, null is returned.
Modify configuration parameters
Syntax
set adb_config <analyzer_param>=<value>;
Parameters
- analyzer_param: the name of the configuration parameter.
- value: the value of the configuration parameter.
Examples
set adb_config FULLTEXT_NGRAM_TOKEN_SIZE=3;