Use the zhparser extension to perform Chinese word segmentation - AnalyticDB

This topic describes how to use the zhparser extension to perform Chinese word segmentation during a full-text search in AnalyticDB for PostgreSQL.

Full-text search overview

By default, PostgreSQL performs word segmentation based on spaces and punctuation marks. PostgreSQL does not support Chinese word segmentation. AnalyticDB for PostgreSQL can be integrated with the zhparser extension to support Chinese word segmentation.

In most cases, you can use one of the following methods to perform a full-text search:

Query data in a table:

SELECT name FROM <table...>
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'friend');

Create a Generalized Inverted Index (GIN) index:

CREATE INDEX <idx_...> ON <table...> USING gin(to_tsvector('english', name));

Configure the zhparser extension

Install the zhparser extension.
Before you use the zhparser extension to perform Chinese word segmentation during a full-text search in an AnalyticDB for PostgreSQL instance, you must install the zhparser extension on the Extensions page of the instance. For more information, see Install, update, and uninstall extensions.
Execute the following statement to configure zhparser as the Chinese text parser, and then set the text search configuration name to zh_cn:
```
CREATE TEXT SEARCH CONFIGURATION zh_cn (PARSER = zhparser);
```
After the configuration is complete, you can run the \dF or \dFp command to view the configuration.

Query the token types that are used for word segmentation.

Execute the following statement to query the dictionary configuration of zhparser:

SELECT ts_token_type('zhparser');

The following result is returned:

          ts_token_type
---------------------------------
 (97,a,"adjective")
 (98,b,"differentiation")
 (99,c,"conjunction")
 (100,d,"adverb")
 (101,e,"exclamation")
 (102,f,"position")
 (103,g,"root")
 (104,h,"head")
 (105,i,"idiom")
 (106,j,"abbreviation")
 (107,k,"tail")
 (108,l,"tmp")
 (109,m,"numeral")
 (110,n,"noun")
 (111,o,"onomatopoeia")
 (112,p,"prepositional")
 (113,q,"quantity")
 (114,r,"pronoun")
 (115,s,"space")
 (116,t,"time")
 (117,u,"auxiliary")
 (118,v,"verb")
 (119,w,"punctuation")
 (120,x,"unknown")
 (121,y,"modal")
 (122,z,"status")
(26 rows)

Execute the following statement to query the configuration of zh_cn:

SELECT * FROM pg_ts_config_map 
WHERE mapcfg=(SELECT oid FROM pg_ts_config WHERE cfgname='zh_cn');

Add or remove token types.
- Add token types.
  Execute the following statement to add nouns, verbs, adjectives, idioms, exclamations, and temporary idioms as token types that are used for word segmentation:
```
ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR n,v,a,i,e,l WITH simple;
```
- Remove token types.
  Execute the following statement to remove nouns, verbs, adjectives, idioms, exclamations, and temporary idioms from token types that are used for word segmentation:
```
ALTER TEXT SEARCH CONFIGURATION zh_cn DROP MAPPING IF EXISTS FOR n,v,a,i,e,l;
```

Use the following functions to test the Chinese word segmentation feature during a full-text search:

to_tsvector:

SELECT to_tsvector('zh_cn', '有两种方法进行全文检索');

The following result is returned:

 to_tsvector
---------------------------------------
'全文检索':4 '方法':2 '有':1 '进行':3
(1 ROW)

to_tsquery:

SELECT to_tsquery('zh_cn', '有两种方法进行全文检索');

The following result is returned:

 to_tsquery
-------------------------------------
 '有' & '方法' & '进行' & '全文检索'
(1 ROW)

Custom dictionaries

The zhparser extension supports custom dictionaries in AnalyticDB for PostgreSQL. You can add data to or remove data from the custom dictionary table named zhparser.zhprs_custom_word to add or remove custom words. The zhparser.zhprs_custom_word table has the following data structure.

Note

You do not need to manually create a dictionary table. When you install the zhparser extension, the system automatically creates a custom dictionary table named zhparser.zhprs_custom_word.
If you have installed the zhparser extension, the system automatically creates a custom dictionary table named zhparser.zhprs_custom_word.

CREATE TABLE zhparser.zhprs_custom_word
(
    word text PRIMARY key,                                   --- Custom word
    tf FLOAT DEFAULT '1.0',                                  --- The term frequency (TF) of the word. Default value: 1.0.
    idf FLOAT DEFAULT '1.0',                                 --- The inverse document frequency (IDF) of the word. Default value: 1.0.
    attr CHAR DEFAULT '@', CHECK(attr = '@' OR attr = '!')   --- The type of the word. Value values: @ (new word) and ! (stop word).
);

Add custom dictionary configurations

Execute the following statement to add custom segmentation configurations to zh_cn:

ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR x with simple;

Add a word to the custom dictionary table

INSERT INTO zhparser.zhprs_custom_word(word, attr) VALUES('两种方法', '@');

Remove a word from the custom dictionary table

DELETE FROM zhparser.zhprs_custom_word WHERE word='两种方法';

Query the custom dictionary table

SELECT * FROM zhparser.zhprs_custom_word;

Load the custom dictionary table

After you add words to or remove words from the zhparser.zhprs_custom_word table, you must reload the table to allow the modifications to take effect. Execute the following statement to reload the zhparser.zhprs_custom_word table:

SELECT sync_zhprs_custom_word();

Check the Chinese word segmentation effects

Execute the following statement before and after you configure the zhparser.zhprs_custom_word table to check the Chinese word segmentation effects:

SELECT to_tsvector('zh_cn', '有两种方法进行全文检索');

The following word segmentation effects are returned:

Before configuring the zhparser.zhprs_custom_word table

+---------------------------------------+
| to_tsvector             |  
+---------------------------------------+
|'全文检索':4 '方法':2 '有':1 '进行':3 ｜
+---------------------------------------+
(1 ROW)

After configuring the zhparser.zhprs_custom_word table

+---------------------------------------+
| to_tsvector             |  
+---------------------------------------+
|'两种方法':2 '全文检索':4 '有':1 '进行':3｜
+---------------------------------------+
(1 ROW)

References

For information about full-text search, see Full Text Search.
For information about the functions and operators that can be used for full-text search, see Text Search Functions and Operators.