シノニムの設定

Alibaba Cloud Elasticsearch (ES) では、シノニムを設定できます。カスタマイズされたシノニム辞書ファイルをアップロードして、Elasticsearch のシノニム辞書を更新できます。シノニム辞書が更新されると、更新されたシノニム辞書を使用して新しいインデックスが検索されます。

考慮事項

シノニム辞書ファイルをアップロードした後、Alibaba Cloud Elasticsearch でノードを再起動する必要はありません。辞書ファイルは、バックグラウンドでノードにデプロイされます。更新された辞書が有効になるまでにかかる時間は、ノードの数によって異なります。
たとえば、インデックス index-aliyun がシノニム辞書ファイル aliyun.txt を使用しており、既存の辞書ファイルを上書きするために、新しいシノニム辞書ファイルをアップロードした場合、インデックス index-aliyun は新しい辞書ファイルを自動的にロードできません。したがって、シノニム辞書を更新した後、インデックスを再作成することを推奨します。それ以外の場合、新しいインデックスのみが更新されたシノニム辞書を使用します。
シノニム辞書ファイルは、UTF-8 を使用してエンコードされた .txt ファイルでなければなりません。各行にはシノニム式が 1 つだけ含まれています。例：
```
ipod, i-pod, i pod => ipod, i-pod, i pod
foo => foo bar
```

フィルターを使用してシノニムを設定できます。サンプルコードは次のとおりです。

PUT /test_index
{    
    "settings": {        
        "index" : {            
            "analysis" : {                
                "analyzer" : {                    
                    "synonym" : {                        
                        "tokenizer" : "whitespace",                       
                        "filter" : ["synonym"]                    
                        }               
                   },                
                   "filter" : {                    
                        "synonym" : {                       
                             "type" : "synonym",                        
                              "synonyms_path" : "analysis/synonym.txt",                                          
                              "tokenizer" : "whitespace"                    
                          }               
                       }            
                    }        
                  }    
          }
}

filter ： analysis/synonym.txt パスを含むシノニムフィルターを設定します。このパスは、config の場所に相対的です。
tokenizer ：シノニムをトークン化するトークナイザーです。デフォルトでは、whitespace に設定されています。付加的な設定：
- ignore_case ：デフォルト値は false です。
- expand ：デフォルト値は true です。

2 つのシノニム形式、Solr と WordNet をサポートしています。

Solr シノニム

次の例は、Solr シノニムファイルの形式を示しています。

# Blank lines and lines starting with pound are comments.
# Explicit mappings match any token sequence on the LHS of "=>"
# and replace with all alternatives on the RHS.  These types of mappings
# ignore the expand parameter in the schema.
# Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit
# Equivalent synonyms may be separated with commas and give
# no explicit mapping.  In this case the mapping behavior will
# be taken from the expand parameter in the schema.  This allows
# the same synonym file to be used in different synonym handling strategies.
# Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos
lol, laughing out loud
# If expand==true, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod
# Multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
# is equivalent to
foo => foo bar, baz

設定ファイルでフィルターのシノニムを直接定義することもできますが、synonyms_path ではなく、synonyms を使用する必要があります。サンプルコードは次のとおりです。

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms" : [
                            "i-pod, i pod => ipod",
                            "begin, start"
                        ]
                    }
                }
            }
        }
    }
}

注ファイルに大きなシノニムセットを定義するには、synonyms_path の使用を推奨します。 synonyms を使用して大きなシノニムセットを定義すると、クラスターのサイズが大きくなります。

WordNet シノニム

次の例は、WordNet シノニムファイルの形式を示しています。

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "format" : "wordnet",
                        "synonyms" : [
                            "s(100000001,1,'abstain',v,1,0).",
                            "s(100000001,2,'refrain',v,1,0).",
                            "s(100000001,3,'desist',v,1,0)."
                        ]
                    }
                }
            }
        }
    }
}

この例では、synonyms を使用して、WordNet シノニムを定義しています。 synonyms_path を使用して、WordNet シノニムを定義することもできます。

手順

Alibaba Cloud Elasticsearch コンソールにログインし、シノニムファイルをアップロードして保存します。アップロードしたファイルが有効になっていることを確認してください。
settings を設定してインデックスを作成する場合、 "synonyms_path": "analysis/your_dict_name.txt" を設定し、mapping をインデックスに追加します。その後、指定されたフィールドのシノニムを設定します。
シノニムを確認し、ファイルをアップロードして検索テストを実行します。

例 1

次の例は、フィルターを使用してシノニムを設定する方法を示しています。

[ES クラスターの設定] ページで、[ワードセグメンテーションの設定] の右側にある [シノニムの設定] をクリックします。
[シノニムの設定] ページで、[アップロード] をクリックし、シノニム辞書ファイルを選択します。次に、[保存] をクリックします。この例では、シノニムの設定セクションの説明に従って作成された TXT ファイルがアップロードされます。
Elasticsearch インスタンスが有効化され、そのステータスが [有効] に変わったら、シノニム辞書を使用できます。
この例では、aliyun_synonyms.txt ファイルがテスト用にアップロードされます。ファイルには begin, start が含まれています。

シノニム辞書を設定してテストします。

Kibana コンソールへのログイン。

[コンソール] タブで、次のリクエストを送信してインデックスを作成します。

PUT aliyun-index-test
{
"index": {
 "analysis": {
   "analyzer": {
     "by_smart": {
       "type": "custom",
       "tokenizer": "ik_smart",
       "filter": ["by_tfr","by_sfr"],
       "char_filter": ["by_cfr"]
     },
     "by_max_word": {
       "type": "custom",
       "tokenizer": "ik_max_word",
       "filter": ["by_tfr","by_sfr"],
       "char_filter": ["by_cfr"]
     }
   },
   "filter": {
     "by_tfr": {
       "type": "stop",
       "stopwords": [" "]
     },
     "by_sfr": {
       "type": "synonym",
       "synonyms_path": "analysis/aliyun_synonyms.txt"
     }
   },
   "char_filter": {
     "by_cfr": {
       "type": "mapping",
       "mappings": ["| => |"]
     }
   }
 }
}
}

次のリクエストを送信して、類義語フィールド title を設定します。

PUT /aliyun-index-test/_mapping/doc
{
"properties": {
 "title": {
   "type": "text",
   "analyzer": "by_max_word",
   "search_analyzer": "by_smart"
 }
}
}

次のリクエストを送信して、シノニムを確認します。

GET /aliyun-index-test/_analyze
{
"analyzer": "by_smart",
"text":"begin"
}

リクエストが成功した場合、次の結果が返されます。

{
"tokens": [
 {
   "token": "begin",
   "start_offset": 0,
   "end_offset": 5,
   "type": "ENGLISH",
   "position": 0
 },
 {
   "token": "start",
   "start_offset": 0,
   "end_offset": 5,
   "type": "SYNONYM",
   "position": 0
 }
]
}

次のリクエストを送信して、さらにテストするためのデータを追加します。

PUT /aliyun-index-test/doc/1
{
"title": "Shall I begin?"
}

PUT /aliyun-index-test/doc/2
{
"title": "I start work at nine."
}

次のリクエストを送信して、検索テストを実行します。

GET /aliyun-index-test/_search
{
 "query" : { "match" : { "title" : "begin" }},
 "highlight" : {
     "pre_tags" : ["<red>", "<bule>"],
     "post_tags" : ["</red>", "</bule>"],
     "fields" : {
         "title" : {}
     }
 }
}

リクエストが成功した場合、次の結果が返されます。

{
"took": 11,
"timed_out": false,
"_shards": {
 "total": 5,
 "successful": 5,
 "failed": 0
},
"hits": {
 "total": 2,
 "max_score": 0.41048482,
 "hits": [
   {
     "_index": "aliyun-index-test",
     "_type": "doc",
     "_id": "2",
     "_score": 0.41048482,
     "_source": {
       "title": "I start work at nine."
     },
     "highlight": {
       "title": [
         "I <red>start</red> work at nine."
       ]
     }
   },
   {
     "_index": "aliyun-index-test",
     "_type": "doc",
     "_id": "1",
     "_score": 0.39556286,
     "_source": {
       "title": "Shall I begin?"
     },
     "highlight": {
       "title": [
         "Shall I <red>begin</red>?"
       ]
     }
   }
 ]
}
}

例 2

次の例は、シノニム辞書を直接参照し、IK フィルターを使用する方法を示しています。

Kibana コンソールへのログイン。[コンソール] タブで、次のリクエストを送信します。
```
PUT /my_index
{
 "settings": {
     "analysis": {
         "analyzer": {
             "my_synonyms": {
                 "filter": [
                     "lowercase",
                     "my_synonym_filter"
                 ],
                 "tokenizer": "ik_smart"
             }
         },
         "filter": {
             "my_synonym_filter": {
                 "synonyms": [
                     "begin,start"
                 ],
                 "type": "synonym"
             }
         }
     }
 }
}
```
このリクエストは、次のタスクを完了するために送信されます。
1. シノニムフィルター my_synonym_filter とシノニム辞書を設定します。
2. my_synonyms アナライザーを作成し、IK トークナイザー ik_smart を使用してクエリ文字列をトークン化します。
3. ik_smart トークナイザーは、すべてのトークンを小文字に変更し、シノニム辞書とトークンを照合します。

次のリクエストを送信して、シノニムフィールド title を設定します。

PUT /my_index/_mapping/doc
{
"properties": {
 "title": {
   "type": "text",
   "analyzer": "my_synonyms"
 }
}
}

次のリクエストを送信して、シノニムを確認します。

GET /my_index/_analyze
{
 "analyzer":"my_synonyms",
 "text":"Shall I begin?"
}

リクエストが成功した場合、次の結果が返されます。

{
"tokens": [
 {
   "token": "shall",
   "start_offset": 0,
   "end_offset": 5,
   "type": "ENGLISH",
   "position": 0
 },
 {
   "token": "i",
   "start_offset": 6,
   "end_offset": 7,
   "type": "ENGLISH",
   "position": 1
 },
 {
   "token": "begin",
   "start_offset": 8,
   "end_offset": 13,
   "type": "ENGLISH",
   "position": 2
 },
 {
   "token": "start",
   "start_offset": 8,
   "end_offset": 13,
   "type": "SYNONYM",
   "position": 2
 }
]
}

次のリクエストを送信して、さらにテストするためのデータを追加します。

PUT /my_index/doc/1
{
"title": "Shall I begin?"
}

PUT /my_index/doc/2
{
"title": "I start work at nine."
}

次のリクエストを送信して、検索テストを実行します。

GET /my_index/_search
{
"query" : { "match" : { "title" : "begin" }},
"highlight" : {
  "pre_tags" : ["<red>", "<bule>"],
  "post_tags" : ["</red>", "</bule>"],
  "fields" : {
      "title" : {}
  }
}
}

リクエストが成功した場合、次の結果が返されます。

{
"took": 11,
"timed_out": false,
"_shards": {
 "total": 5,
 "successful": 5,
 "failed": 0
},
"hits": {
 "total": 2,
 "max_score": 0.41913947,
 "hits": [
   {
     "_index": "my_index",
     "_type": "doc",
     "_id": "2",
     "_score": 0.41913947,
     "_source": {
       "title": "I start work at nine."
     },
     "highlight": {
       "title": [
         "I <red>start</red> work at nine."
       ]
     }
   },
   {
     "_index": "my_index",
     "_type": "doc",
     "_id": "1",
     "_score": 0.39556286,
     "_source": {
       "title": "Shall I begin?"
     },
     "highlight": {
       "title": [
         "Shall I <red>begin</red>?"
       ]
     }
   }
 ]
}
}

このドキュメントのコンテンツの一部は、Elasticsearch の公式ドキュメントから参照しています。詳細については、「シノニムトークンフィルター」、および「シノニムの使用」をご参照ください。