Analyzers for full-text indexes - AnalyticDB - Alibaba Cloud Documentation Center

This topic describes the analyzers provided by AnalyticDB for MySQL and their usage and segmentation effects.

Overview

AnalyticDB for MySQL provides a variety of built-in analyzers to implement full-text indexes, including the AliNLP analyzer, IK analyzer, Standard analyzer, Ngram analyzer, Edge_ngram analyzer, and Pattern analyzer. You can use the default analyzer or another built-in analyzer to analyze text based on your business scenarios. The default analyzer is determined based on the following rules:

For clusters of a version earlier than V3.1.4.15, the AliNLP analyzer is used by default.
For clusters of V3.1.4.15 or later, the IK analyzer is used by default.

Note

For more information about how to view the minor engine version of a cluster, see How do I view the minor version of a cluster?

Specify an analyzer

Syntax

FULLTEXT INDEX idx_name(`column_name`) [ WITH ANALYZER analyzer_name ] [ WITH DICT tbl_dict_name];

Parameters

idx_name: the name of the full-text index.
column_name: the name of the column on which to create the full-text index.
WITH ANALYZER analyzer_name: specifies the analyzer.
WITH DICT tbl_dict_name: specifies the custom dictionary. AnalyticDB for MySQL supports custom dictionaries. For more information, see Custom dictionaries for full-text indexes.

Examples

Specify an analyzer when you create a table that contains a full-text index.

CREATE TABLE `tbl_fulltext_demo` (
  `id` int,
  `content` varchar,
  `content_alinlp` varchar,
  `content_ik` varchar,
  `content_standard` varchar,
  `content_ngram` varchar,
  `content_edge_ngram` varchar,
  FULLTEXT INDEX fidx_c(`content`),  // Use the default analyzer.
  FULLTEXT INDEX fidx_alinlp(`content_alinlp`) WITH ANALYZER alinlp,
  FULLTEXT INDEX fidx_ik(`content_ik`) WITH ANALYZER ik,
  FULLTEXT INDEX fidx_standard(`content_standard`) WITH ANALYZER standard,
  FULLTEXT INDEX fidx_ngram(`content_ngram`) WITH ANALYZER ngram,
  FULLTEXT INDEX fidx_edge_ngram(`content_edge_ngram`) WITH ANALYZER edge_ngram,
  PRIMARY KEY (`id`)
) DISTRIBUTED BY HASH(id);

AliNLP analyzer

The AliNLP analyzer is a natural language analyzer kit provided by Alibaba Cloud and DAMO Academy based on natural language processing (NLP) technologies. This analyzer allows you to use the entities and stop words contained in a custom full-text dictionary. The AliNLP analyzer can divide the consecutive natural language text into appropriate segments. A variety of languages are supported, including Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, and Spanish.

Configuration parameters

For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.

Parameter	Description
FULLTEXT_SPLIT_GRANULARITY	The segmentation granularity. The value must be an integer from 2 to 8. The default value is 2.
FULLTEXT_FILTER_ST_CONVERT_ENABLED	Specifies whether to enable stem conversion. Default value: false. Valid values: true: enables stem conversion. false: disables stem conversion. For example, men is converted to man, and cars is converted to car.
FULLTEXT_TOKENIZER_CASE_SENSITIVE	Specifies whether the text segmentation is case-sensitive. Default value: false. Valid values: true: The text segmentation is case-sensitive. false: The text segmentation is case-insensitive.

Segmentation effects

The default configuration provides the following segmentation effects:

Query the segmentation effects for English text.
```
/*+ mode=two_phase*/ SELECT fulltext_alinlp_test('Hello world');
```
The following information is returned:
```
[hello,  , world]
```

Note

When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/ to the beginning of the statement.

IK analyzer

The IK analyzer is an open source lightweight Chinese analyzer kit. This analyzer allows you to use the entities and stop words contained in a custom full-text dictionary.

Configuration parameters

For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.

Parameter	Description
CSTORE_IK_SEGMENTER_USE_SMART_ENABLE	The segmentation granularity. Default value: false. Valid values: true: coarse-grained segmentation in ik_smart mode false: fine-grained segmentation in ik_max_word mode
CSTORE_IK_SEGMENTER_LETTER_MIN_LENGTH	The minimum length of a segment. The value must be an integer from 2 to 16. The default value is 3.
CSTORE_IK_SEGMENTER_LETTER_MAX_LENGTH	The maximum length of a segment. The value must be an integer from 2 to 256. The default value is 128.

Segmentation effects

The default configuration provides the following segmentation effects:

Query the segmentation effects for English text.
```
/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');
```
The following information is returned:
```
[hello, world, or]
```

Note

When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/ to the beginning of the statement.

Standard analyzer

The standard analyzer applies distinct text segmentation rules for different languages. For English text, this analyzer converts the text to lowercase and removes stop words and punctuation marks before segmentation. For Chinese text, this analyzer segments the text directly to individual characters. The standard analyzer allows you to use the entities and stop words contained in a custom full-text dictionary.

Configuration parameters

For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.

Parameter	Description
FULLTEXT_MAX_TOKEN_LENGTH	The maximum length of the text that can be segmented. The value must be an integer from 1 to 1048576. The default value is 255.

Segmentation effects

The default configuration provides the following segmentation effects:

Query the segmentation effects for English text.
```
/*+ mode=two_phase*/ SELECT fulltext_standard_test('Hello world');
```
The following information is returned:
```
[hello, world]
```

Note

When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/ to the beginning of the statement.

Ngram analyzer

The ngram analyzer segments text based on the value of the FULLTEXT_NGRAM_TOKEN_SIZE parameter and allows you to use the entities and stop words contained in a custom full-text dictionary. The ngram analyzer can improve the efficiency of fuzzy retrieval.

Configuration parameters

For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.

Parameter

Description

FULLTEXT_NGRAM_TOKEN_SIZE

The length of a segment. The value must be an integer from 1 to 8. The default value is 2.

Segmentation effects

The default configuration provides the following segmentation effects:

Query the segmentation effects for English text.

/*+ mode=two_phase*/ SELECT fulltext_ngram_test('Hello world');

The following information is returned:

[he, el, ll, lo, o ,  w, wo, or, rl, ld]

Note

When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/ to the beginning of the statement.

Edge_ngram analyzer

The edge_ngram analyzer uses the same segmentation method as the ngram analyzer. This analyzer is ideal for prefix-based segmentation and word retrieval scenarios and allows you to use the entities and stop words contained in a custom full-text dictionary.

Configuration parameters

For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.

Parameter	Description
FULLTEXT_MIN_GRAM_SIZE	The minimum length of the prefix segment. The value must be an integer from 1 to 8. The default value is 1.
FULLTEXT_MAX_GRAM_SIZE	The maximum length of the prefix segment. The maximum length must be greater than the minimum length. The value must be an integer from 1 to 16. The default value is 2.

Segmentation effects

The default configuration provides the following segmentation effects:

Query the segmentation effects for English text.
```
/*+ mode=two_phase*/ SELECT fulltext_edge_ngram_test('Hello world');
```
The following information is returned:
```
[h, he]
```

Note

When you use a SELECT statement to query the segmentation effects, you must add /*+ mode=two_phase*/ to the beginning of the statement.

Pattern analyzer

The pattern analyzer segments text based on regular expressions. This analyzer does not allow you to use the entities and stop words contained in a custom full-text dictionary or execute SQL statements to query segmentation effects.

Syntax

FULLTEXT INDEX fidx_name(`column_name`) WITH ANALYZER pattern_tokenizer("Custom_rule") [ WITH DICT `tbl_dict_name` ];

Parameters

Custom_rule: the regular expression.

Configuration parameters

For more information about how to query and modify analyzer configurations, see the "Query and modify analyzer configuration parameters" section of this topic.

Parameter

Description

FULLTEXT_TOKENIZER_CASE_SENSITIVE

Specifies whether the text segmentation is case-sensitive. Default value: false. Valid values:

true: The text segmentation is case-sensitive.
false: The text segmentation is case-insensitive.

Query and modify analyzer configuration parameters

AnalyticDB for MySQL allows you to query and modify analyzer configuration parameters.

Query configuration parameters

Execute the SHOW adb_config statement to query configuration parameters.
Syntax
```
show adb_config key <analyzer_param>;
```
Parameters
analyzer_param: the name of the configuration parameter.
Examples
```
show adb_config key FULLTEXT_NGRAM_TOKEN_SIZE;
```
Note
The SHOW adb_config statement can be used to query both the default and modified configuration parameters.
Execute the SELECT statement to query configuration parameters.
Syntax
```
SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = '<analyzer_param>';
```
Parameters
analyzer_param: the name of the configuration parameter.
Examples
```
SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = 'FULLTEXT_NGRAM_TOKEN_SIZE';
```
Note
The SELECT statement can be used to query only the modified configuration parameters. If you use this statement to query the default configuration parameters, null is returned.

Modify configuration parameters

Syntax

set adb_config <analyzer_param>=<value>;

Parameters

analyzer_param: the name of the configuration parameter.
value: the value of the configuration parameter.

Examples

set adb_config FULLTEXT_NGRAM_TOKEN_SIZE=3;