All Products
Search
Document Center

AnalyticDB:pg_jieba

Last Updated:May 31, 2024

AnalyticDB for PostgreSQL allows you to use the pg_jieba extension to perform Chinese word segmentation and implement efficient Chinese full-text search.

Introduction

Jieba is a commonly used tool for Chinese word segmentation. The pg_jieba extension introduces the Chinese word segmentation capability of Jieba into PostgreSQL databases to help you implement efficient Chinese full-text search. AnalyticDB for PostgreSQL allows you to use the pg_jieba extension for distributed queries.

Prerequisites

Before you use the pg_jieba extension, make sure that the following requirements are met:

  • The AnalyticDB for PostgreSQL instance that you want to manage is in elastic storage mode.

  • The minor version of an AnalyticDB for PostgreSQL V6.0 instance is 6.6.2.1 or later. The minor version of an AnalyticDB for PostgreSQL V7.0 instance is 7.0.5 or later.

    Note

    For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor engine version.

Install the pg_jieba extension

  1. Before you use Jieba, install the pg_jieba extension on the Extensions page of the AnalyticDB for PostgreSQL instance. For more information, see Install, update, and uninstall extensions.

  2. Switch to the public schema of the specified database and execute the following statement to check whether the pg_jieba extension is installed:

    SELECT * FROM pg_extension WHERE extname = 'pg_jieba';

    If the following result is returned, the pg_jieba extension is installed. If the following result is not returned, the pg_jieba extension is not installed for the public schema of the specified database.

    +--------+--------+--------+--------+
    |oid     |extname |extowner|...     |
    +--------+--------+--------+--------+
    |17194   |pg_jieba|10.     |...     |
    +--------+--------+--------+--------+

Chinese word segmentation

After you install the pg_jieba extension, you can use the extension to perform Chinese word segmentation.

Example 1:

SELECT to_tsvector('jiebacfg', '有两种方法进行全文检索');

The following result is returned:

+---------------------------------------+
|               to_tsvector             |  
+---------------------------------------+
|'两种':2 '全文检索':5 '方法':3 '进行':4   |
+---------------------------------------+
(1 row)

Example 2:

SELECT to_tsvector('jiebacfg', '有两种方法进行全文检索') @@ to_tsquery('jiebacfg', '全文检索');
+----------+
| ?column? |  
+----------+
| t        |
+----------+
(1 row)

Custom dictionaries

The pg_jieba extension supports custom dictionaries in AnalyticDB for PostgreSQL. You can add data to or remove data from the custom dictionary table named jieba.jieba_custom_word to add or remove custom words.

Note
  • You do not need to manually create a dictionary table. When the pg_jieba extension is installed, the system automatically creates a custom dictionary table named jieba.jieba_custom_word.

  • The jieba.jieba_custom_word table has the following data structure:

    CREATE TABLE jieba.jieba_custom_word
    (
    	word    text primary key,     -- Custom word
    	weight  float8 default '1.0', -- Weight
    	type    text   default 'x'    -- Part of speech
    );

Apply for permissions to use the custom dictionary table

Submit a ticket to apply for permissions to use the jieba.jieba_custom_word table. Then, you can add words to the jieba.jieba_custom_word table, remove words from the table, query the table, and use the table to perform Chinese word segmentation.

Add a word to the custom dictionary table

INSERT INTO jieba.jieba_custom_word values('两种方法');

Remove a word from the custom dictionary table

DELETE FROM jieba.jieba_custom_word WHERE word='两种方法';

Query the custom dictionary table

SELECT * FROM jieba.jieba_custom_word;

Load the custom dictionary table

After you add words to or remove words from the jieba.jieba_custom_word table, you must reload the table to allow the modifications to take effect. Execute the following statement to reload the jieba.jieba_custom_word table:

SELECT jieba.jieba_load_user_dict();

Check the Chinese word segmentation effect

Execute the following sample statement before and after you configure the jieba.jieba_custom_word table to check the Chinese word segmentation effect:

SELECT to_tsvector('jiebacfg', '有两种方法进行全文检索');

The following result is returned:

Scenario

Before configuring the jieba.jieba_custom_word table

After configuring the jieba.jieba_custom_word table

Chinese word segmentation effect

+---------------------------------------+
|               to_tsvector             |  
+---------------------------------------+
|'两种':2 '全文检索':5 '方法':3 '进行':4   |
+---------------------------------------+
(1 row)
+---------------------------------------+
|               to_tsvector             |  
+---------------------------------------+
| '两种方法':2 '全文检索':4 '进行':3       |
+---------------------------------------+
(1 row)

References