All Products
Search
Document Center

PolarDB:pg_jieba

Last Updated:Oct 22, 2024

The open source third-party extension pg_jieba is used for full-text search in Chinese.

Prerequisites

The extension is supported on the PolarDB for PostgreSQL clusters that run the following engines:

  • PostgreSQL 15 (revision version 15.7.2.0 or later)

  • PostgreSQL 14 (revision version 14.5.2.0 or later)

  • PostgreSQL 11 (revision version 1.1.28 or later)

Note

You can execute the following statements to view the revision version of a PolarDB for PostgreSQL cluster:

  • PostgreSQL 15 and PostgreSQL 14

    SELECT version();
  • PostgreSQL 11

    SHOW polar_version;

Overview

The pg_jieba extension provides the following text search modes:

  • jiebacfg: exact mode. This mode precisely splits the text without redundant words.

  • jiebaqry: full mode. This mode obtains all matching words in the text. Redundant words may exist.

  • jiebacfg_pos: exact mode (with subscript position). The subscript information of each word is added to the result of the exact mode, The stopwords that are ignored in exact mode (highly frequent words that are not used in the search) are also displayed.

Usage

Create and delete the extension

  • Create the pg_jieba extension.

    CREATE EXTENSION pg_jieba;
  • Delete the pg_jieba extension.

    DROP EXTENSION pg_jieba;
Note

Only privileged accounts are authorized to execute the preceding statements.

Examples

  • Example 1:

    • Exact mode

      SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');

      Sample result:

                                         to_tsvector
      ----------------------------------------------------------------------------------
       '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6
      (1 row)
    • Full-mode

      SELECT * FROM to_tsvector('jiebaqry', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');

      Sample result:

                                                                                 to_tsvector
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------
       '中国':5 '中国科学院':9 '京都':16 '大学':17 '学院':7 '小明':1 '日本':15 '日本京都大学':18 '毕业':3 '深造':19 '硕士':2 '科学':6 '科学院':8 '计算':10 '计算所':11
      (1 row)
    • Exact mode (with subscript position)

      SELECT * FROM to_tsvector('jiebacfg_pos', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');

      Sample result:

                                                                     to_tsvector
      ------------------------------------------------------------------------------------------------------------------------------------------
       '中国科学院:7':5 '于:6':4 '后:16':8 '在:17':9 '小明:0':1 '日本京都大学:18':10 '毕业:4':3 '深造:24':11 '硕士:2':2 '计算所:12':6 ',:15':7
      (1 row)
  • Example 2:

    • Exact mode

      SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');

      Sample result:

                                  to_tsvector
      -------------------------------------------------------------------
       '专家':11 '主任':5 '云计算':8 '创新':3 '办':4 '方面':9 '李小福':1
      (1 row)
    • Full-mode

      SELECT * FROM to_tsvector('jiebaqry', '李小福是创新办主任也是云计算方面的专家');

      Sample result:

                                       to_tsvector
      -----------------------------------------------------------------------------
       '专家':12 '主任':5 '云计算':9 '创新':3 '办':4 '方面':10 '李小福':1 '计算':8
      (1 row)
    • Exact mode (with subscript position)

      SELECT * FROM to_tsvector('jiebacfg_pos', '李小福是创新办主任也是云计算方面的专家');

      Sample result:

                                                              to_tsvector
      ---------------------------------------------------------------------------------------------------------------------------
       '专家:17':11 '主任:7':5 '也:9':6 '云计算:11':8 '创新:4':3 '办:6':4 '方面:14':9 '是:10':7 '是:3':2 '李小福:0':1 '的:16':10
      (1 row)

Extended features

The pg_jieba extension allows you to configure multiple custom dictionaries and switch between the dictionaries.

Note

To use custom dictionaries, you need to add pg_jieba to the shared_preload_libraries parameter. You can configure shared_preload_libraries in the console. For more information, see Specify cluster parameters. The cluster is restarted after you configure this parameter. Procedd with caution.

  1. Insert data into the first custom dictionary. By default, data is inserted in to the first custom dictionary. The first custom dictionary is represented by 0. The weight value of the first custom dictionary is 10.

    INSERT INTO jieba_user_dict VALUES ('阿里云');
    INSERT INTO jieba_user_dict VALUES ('研发工程师',0,10);
  2. Use the dictionary predefined in the pg_jieba extension to segment Chinese text.

    SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

    Sample result:

                         to_tsvector
    ------------------------------------------------------
     'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
    (1 row)
                        
  3. Switch to custom dictionary 0.

    SELECT jieba_load_user_dict(0);

    Sample result:

     jieba_load_user_dict
    ----------------------
    
    (1 row)
  4. Use custom dictionary 0 to split words.

    SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

    Sample result:

                    to_tsvector
    --------------------------------------------
     'zth':1 '一个':5 '研发工程师':6 '阿里云':3
    (1 row)