分詞器可以將長文檔解析、拆分為多個詞,存入索引中。在多數情境下,您可以直接使用TairSearch提供的多種內建分詞器,同時您也可以按需自訂分詞器。本文介紹TairSearch分詞器的使用方法。
導航
內建分詞器 | Character Filter | Tokenizer | Token Filter |
分詞器的工作流程
TairSearch分詞器由Character Filter、Tokenizer和Token Filter三部分組成,其工作流程依次為Character Filter、Tokenizer和Token Filter,其中Character Filter和Token Filter可以為空白。其具體作用如下:
Character Filter:負責將文檔進行預先處理,每個分詞器可以配置零個或者多個Character Filter,多個Character Filter會按照指定順序執行。例如將
"(:"
字元替換成"happy"
字元。Tokenizer:負責將輸入的文檔拆分成多個Token(詞元),每個分詞器僅能配置一個Tokenizer。例如通過Whitespace Tokenizer將
"I am very happy"
拆分成["I", "am", "very", "happy"]
。Token Filter:負責對Tokenizer產生的Token進行處理,每個分詞器可以配置零個或者多個Token Filter,多個Token Filter會按照指定順序執行。例如通過Stop Token Filter過濾停用詞(Stopwords)。
內建分詞器
Standard
基於Unicode文本切割演算法拆分文檔,並將Token(詞元,Tokenizer的結果)轉為小寫、過濾停用詞,適用於多數語言。
組成部分:
Tokenizer(分詞器):Standard Tokenizer。
Token Filter(詞元過濾器):LowerCase Token Filter和Stop Token Filter。
未展示Character Filter(字元過濾器)表示無Character Filter。
選擇性參數:
stopwords:停用詞,分詞器會過濾這些詞。數群組類型,單個停用詞必須是字串。配置後,會覆蓋預設停用詞。預設停用詞如下:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
max_token_length:每個Token的長度上限,預設為255。若Token超過該長度,會根據指定的長度上限對Token進行拆分。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"standard"
}
}
}
}
# 自訂停用詞配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"standard",
"max_token_length":10,
"stopwords":[
"memory",
"disk",
"is",
"a"
]
}
}
}
}
}
Stop
根據非字母(non-letter)的符號拆分文檔,並將Token轉為小寫,同時過濾停用詞。
組成部分:
Tokenizer:LowerCase Tokenizer。
Token Filter:Stop Token Filter。
選擇性參數:
stopwords:停用詞,分詞器會過濾這些詞。數群組類型,單個停用詞必須是字串。配置後,會覆蓋預設停用詞。預設停用詞如下:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"stop"
}
}
}
}
# 自訂停用詞配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"stop",
"stopwords":[
"memory",
"disk",
"is",
"a"
]
}
}
}
}
}
Jieba
推薦的中文分詞器,可以按照預先訓練好的詞典或者指定的詞典拆分文檔,採用Jieba搜尋引擎模式,同時將英文Token轉為小寫,並過濾停用詞。
組成部分:
Tokenizer:Jieba Tokenizer。
Token Filter:LowerCase Token Filter和Stop Token Filter。
選擇性參數:
userwords:自訂字典,數群組類型,單個詞必須是字串。配置後會追加至預設詞典中,預設詞典請參見Jieba預設詞典。
重要為了更好的分詞效果,Jieba內建了一個較大的詞典,約佔用20 MB記憶體,該詞典在記憶體中僅會保留一份。在首次使用Jieba時才會載入詞典,這可能會導致首次使用Jieba分詞器時延時出現微小的抖動。
自訂字典的單詞中不能出現空格與特殊字元:
\t
、\n
、,
和。
。
use_hmm:對於字典中不存在的詞,是否使用隱式馬爾科夫鏈模型判斷成詞,取值為true(預設,表示開啟)或false(不開啟)。
stopwords:停用詞,分詞器會過濾這些詞。數群組類型,單個停用詞必須是字串。配置後,會覆蓋預設停用詞。預設停用詞請參見Jieba預設停用詞。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"jieba"
}
}
}
}
# 自訂停用詞配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"jieba",
"stopwords":[
"memory",
"disk",
"is",
"a"
],"userwords":[
"Redis",
"開源免費",
"靈活"
],
"use_hmm":true
}
}
}
}
}
IK
中文分詞器,相容ES的IK分詞器外掛程式。分為ik_max_word
和ik_smart
模式,ik_max_word
模式會拆分出文檔中所有可能存在的Token,ik_smart
模式會在ik_max_word
的基礎上,對Token進行二次識別,選擇出最有可能的Token。
以“Redis是完全開源免費的,遵守BSD協議,是一個靈活的高效能key-value資料結構儲存,可以用來作為資料庫、緩衝和訊息佇列。Redis比其他key-value緩衝產品有以下三個特點:Redis支援資料的持久化,可以將記憶體中的資料儲存在磁碟中,重啟的時候可以再次載入到記憶體使用量。”文檔為例,ik_max_word
和ik_smart
的Token如下:
ik_max_word
:redis 是 完全 全開 開源 免費 的 遵守 bsd 協議 是 一個 一 個 靈活 的 高效能 效能 key-value key value 資料結構 資料 結構 儲存 可以用 可以 用來 來作 作為 資料庫 資料 庫 緩衝 和 訊息 隊列 redis 比 其他 key-value key value 緩衝 產品 有 以下 三個 三 個 特點 redis 支援 資料 的 持久 化 可以 將 記憶體 中 的 資料 儲存 存在 磁碟 中 重啟 的 時候 可以 再次 載入 載到 記憶體 使用
ik_smart
:redis 是 完全 開源 免費 的 遵守 bsd 協議 是 一個 靈活 的 高效能 key-value 資料結構 儲存 可以 用來 作為 資料庫 緩衝 和 訊息 隊列 redis 比 其他 key-value 緩衝 產品 有 以下 三個 特點 redis 支援 資料 的 持久 化 可以 將 記憶體 中 的 資料 保 存在 磁碟 中 重啟 的 時候 可以 再次 加 載到 記憶體 使用
組成部分:
Tokenizer:IK Tokenizer。
選擇性參數:
stopwords:停用詞,分詞器會過濾這些詞。數群組類型,單個停用詞必須是字串。配置後,會覆蓋預設停用詞。預設停用詞如下:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
userwords:自訂字典,數群組類型,單個詞必須是字串,配置後會追加至預設詞典中。預設詞典請參見IK預設詞典。
quantifiers:自訂量詞詞典,數群組類型,單個詞必須是字串,配置後會追加至預設量詞詞典中。預設量詞詞典請參見IK預設量詞詞典。
enable_lowercase:是否將大寫字母轉換為小寫,取值為true(預設,表示開啟)或false(不開啟)。
重要由於本參數所控制的操作(將大寫字母轉換為小寫)會發生在分詞之前,若自訂字典中存在大寫字母,請將本參數設定為false。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"ik_smart"
},
"f1":{
"type":"text",
"analyzer":"ik_max_word"
}
}
}
}
# 自訂停用詞配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_ik_smart_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_ik_max_word_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_ik_smart_analyzer":{
"type":"ik_smart",
"stopwords":[
"memory",
"disk",
"is",
"a"
],"userwords":[
"Redis",
"開源免費",
"靈活"
],
"quantifiers":[
"納秒"
],
"enable_lowercase":false
},
"my_ik_max_word_analyzer":{
"type":"ik_max_word",
"stopwords":[
"memory",
"disk",
"is",
"a"
],"userwords":[
"Redis",
"開源免費",
"靈活"
],
"quantifiers":[
"納秒"
],
"enable_lowercase":false
}
}
}
}
}
Pattern
根據指定的Regex拆分文檔,Regex匹配的詞將作為分隔字元。例如指定的Regex是"aaa"
,對"bbbaaaccc"
文檔進行分詞,會得到"bbb"
和"ccc"
,同時根據lowercase參數決定是否將英文Token轉為小寫,並過濾停用詞。
組成部分:
Tokenizer:Pattern Tokenizer。
Token Filter:LowerCase Token Filter和Stop Token Filter。
選擇性參數:
pattern:Regex,Regex匹配的詞將作為分隔字元,預設為
\W+
,更多文法資訊請參見Re2。stopwords:停用詞,分詞器會過濾這些詞。配置時,停用詞詞典必須是一個數組,每個停用詞必須是字串,配置停用詞後會覆蓋預設停用詞。預設停用詞如下:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
lowercase:是否將Token轉換為小寫,取值為true(預設,表示開啟)或false(不開啟)。
flags:Regex是否大小寫敏感,預設為空白(表示大小寫敏感),取值為CASE_INSENSITIVE(表示大小寫不敏感)。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"pattern"
}
}
}
}
# 自訂停用詞配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"pattern",
"pattern":"\\'([^\\']+)\\'",
"stopwords":[
"aaa",
"@"
],
"lowercase":false,
"flags":"CASE_INSENSITIVE"
}
}
}
}
}
Whitespace
根據空格拆分文檔。
組成部分:
Tokenizer:Whitespace Tokenizer。
選擇性參數:無
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"whitespace"
}
}
}
}
Simple
根據非字母(non-letter)的符號拆分文檔,將Token轉為小寫。
組成部分:
Tokenizer:LowerCase Tokenizer。
選擇性參數:無
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"simple"
}
}
}
}
Keyword
不拆分文檔,將文檔作為一個Token輸出。
組成部分:
Tokenizer:Keyword Tokenizer。
選擇性參數:無
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"keyword"
}
}
}
}
Language
支援多國語言分詞器,包括:chinese、arabic、cjk、brazilian、czech、german、greek、persian、french、dutch和russian。
選擇性參數:
stopwords:停用詞,分詞器會過濾這些詞。配置時,停用詞詞典必須是一個數組,每個停用詞必須是字串,配置停用詞後會覆蓋預設停用詞。各語言的預設停用詞請參見附錄4:內建分詞器Language各語言的預設停用詞(Stopwords)。
說明暫不支援修改chinese分詞器的停用詞。
stem_exclusion:指定不需要進行詞幹化處理的詞(Term),例如
"apples"
進行詞幹化處理後為"apple"
。本參數預設為空白,配置時,stem_exclusion必須是一個數組,每個詞必須是字串。說明僅brazilian、german、french和dutch分詞器支援本參數。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"arabic"
}
}
}
}
# 自訂停用詞配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"german",
"stopwords":[
"ein"
],
"stem_exclusion":[
"speicher"
]
}
}
}
}
}
自訂分詞器
TairSearch分詞器的工作流程依次為Character Filter、Tokenizer和Token Filter,您可以按需配置Character Filter、Tokenizer和Token Filter參數。
配置方法:在properties
中配置analyzer
為自訂分詞器,例如my_custom_analyzer
,在settings
中,指定自訂分詞器(my_custom_analyzer
)的相關配置。
參數說明:
參數 | 說明 |
type(必選) | 固定為custom,表示自訂分詞器。 |
char_filter(可選) | 字元過濾器,在開始Tokenizer流程前,對文檔進行預先處理,預設為空白,表示不進行預先處理,當前僅支援Mapping。 參數說明:
|
tokenizer(必選) | 分詞器,必選且只能選擇一個,取值為:whitespace、lowercase、standard、classic、letter、keyword、jieba、pattern、ik_max_word和ik_smart,更多資訊請參見附錄2:支援的Tokenizer。 |
filter(可選) | 詞元過濾器,對Token(Tokenizer的結果)進行處理,例如刪除停用詞、將詞元轉換為小寫等,支援多選,預設為空白,表示不進行處理。 取值為:classic、elision、lowercase、snowball、stop、asciifolding、length、arabic_normalization和persian_normalization,更多資訊請參見附錄3:支援的Token Filter。 |
配置樣本:
# 自訂分詞器配置:
# 本樣本配置了名為emoticons和conjunctions的Character Filter,同時配置了Whitespace Tokenizer以及Lowercase Token Filter和Stop Token Filter。
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":[
"lowercase",
"stop"
],
"char_filter": [
"emoticons",
"conjunctions"
]
}
},
"char_filter":{
"emoticons":{
"type":"mapping",
"mappings":[
":) => _happy_",
":( => _sad_"
]
},
"conjunctions":{
"type":"mapping",
"mappings":[
"&=>and"
]
}
}
}
}
}
附錄1:支援的Character Filter
Mapping Character Filter
可通過mappings
參數配置Key-Value映射關係,當匹配到Key字元,則用對應Value進行替換,例如":) =>_happy_"
,表示":)"會被"_happy_"替換。支援配置多個過濾器。
參數說明:
mappings(必填):數群組類型,每個元素必須包含
=>
,例如"&=>and"
。
配置樣本:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"char_filter": [
"emoticons"
]
}
},
"char_filter":{
"emoticons":{
"type":"mapping",
"mappings":[
":) => _happy_",
":( => _sad_"
]
}
}
}
}
}
附錄2:支援的Tokenizer
whitespace
根據空格拆分文檔。
選擇性參數:
max_token_length:每個Token的長度上限,預設為255。若Token超過該長度,會根據指定的長度上限對Token進行拆分。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"whitespace",
"max_token_length":2
}
}
}
}
}
standard
基於Unicode文本切割演算法拆分文檔,適用於多數語言。
選擇性參數:
max_token_length:每個Token的長度上限,預設為255。若Token超過該長度,會根據指定的長度上限對Token進行拆分。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"standard",
"max_token_length":2
}
}
}
}
}
classic
根據英文文法拆分文檔,並且會對縮寫詞、公司名稱、電子郵件地址和互連網IP地址進行特殊處理,詳細說明如下。
按標點符號拆分單詞,並刪除標點符號,但沒有空格的英文句號會被認為是Token的一部分,例如
red.apple
不會被拆分,red.[space] apple
會被拆分為red
和apple
。按連字號拆分單詞,若Token中含有數字,則整個Token會被解釋為產品編號而不會被拆分。
將電子郵件地址和網際網路主機名稱識別為一個Token。
選擇性參數:
max_token_length:每個Token的長度上限,預設為255。若Token超過該長度,會被跳過。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"classic"
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"classic",
"max_token_length":2
}
}
}
}
}
letter
根據非字母(non-letter)的符號拆分文檔,適用於歐洲語言,不適用於亞洲語言。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"letter"
}
}
}
}
}
lowercase
根據非字母(non-letter)的符號拆分文檔,並將所有Token轉為小寫。Lowercase Tokenizer的分詞效果與Letter Tokenizer組合LowerCase Filter的效果相同,但Lowercase Tokenizer可減少一次遍曆。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"lowercase"
}
}
}
}
}
keyword
不拆分文檔,將文檔作為一個Token輸出。通常與Token Filter配合使用,例如Keyword Tokenizer組合Lowercase Token Filter,可實現將輸入的文檔轉為小寫。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"keyword"
}
}
}
}
}
jieba
推薦的中文分詞器,可以按照預先訓練好的詞典或者指定的詞典拆分文檔。
選擇性參數:
userwords:自訂字典,數群組類型,單個詞必須是字串。配置後會追加至預設詞典中,預設詞典請參見Jieba預設詞典。
重要自訂字典的單詞中不能出現空格與特殊字元:
\t
、\n
、,
和。
。use_hmm:對於字典中不存在的詞,是否使用隱式馬爾科夫鏈模型判斷成詞,取值為true(預設,表示開啟)或false(不開啟)。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"jieba"
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f1":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"jieba",
"userwords":[
"Redis",
"開源免費",
"靈活"
],
"use_hmm":true
}
}
}
}
}
pattern
根據指定的Regex拆分文檔,Regex匹配的詞可以作為分隔字元或者目標Token。
選擇性參數:
pattern:Regex,預設為
\W+
,更多文法資訊請參見Re2。group:指定Regex作為分隔字元或目標Token,取值如下:
-1(預設):指定Regex匹配的詞作為分隔字元,例如指定的Regex是
"aaa"
,對"bbbaaaccc"
文檔進行分詞,會得到"bbb"
和"ccc"
。0或大於0的整數:指定Regex匹配的詞作為目標Token,0表示以整個Regex進行匹配,1或1以上的整數表示以Regex中的第幾個擷取的群組進行匹配。例如指定的Regex是
"a(b+)c"
,對"abbbcdefabc"
文檔進行分詞:當group
為0時,會得到"abbbc"
和"abc"
;當group
為1時,將以"a(b+)c"
中的第一個擷取的群組b+
進行匹配,會得到"bbb"
和"b"
。
flags:Regex是否大小寫敏感,預設為空白(表示大小寫敏感),取值為CASE_INSENSITIVE(表示大小寫不敏感)。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"pattern"
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f1":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"pattern_tokenizer"
}
},
"tokenizer":{
"pattern_tokenizer":{
"type":"pattern",
"pattern":"AB(A(\\w+)C)",
"flags":"CASE_INSENSITIVE",
"group":2
}
}
}
}
}
IK
中文分詞器,取值為ik_max_word或ik_smart。ik_max_word會拆分出文檔中所有可能存在的Token;ik_smart會在ik_max_word的基礎上,對Token進行二次識別,選擇出最有可能的Token。
選擇性參數:
stopwords:停用詞,分詞器會過濾這些詞。數群組類型,單個停用詞必須是字串。配置後,會覆蓋預設停用詞。預設停用詞如下:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
userwords:自訂字典,數群組類型,單個詞必須是字串,配置後會追加至預設詞典中。預設詞典請參見IK預設詞典。
quantifiers:自訂量詞詞典,數群組類型,單個詞必須是字串,配置後會追加至預設量詞詞典中。預設量詞詞典請參見IK預設量詞詞典。
enable_lowercase:是否將大寫字母轉換為小寫,取值為true(預設,表示開啟)或false(不開啟)。
重要由於本參數所控制的操作(將大寫字母轉換為小寫)會發生在分詞之前,若自訂字典中存在大寫字母,請將本參數設定為false。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_ik_smart_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_custom_ik_max_word_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_ik_smart_analyzer":{
"type":"custom",
"tokenizer":"ik_smart"
},
"my_custom_ik_max_word_analyzer":{
"type":"custom",
"tokenizer":"ik_max_word"
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_ik_smart_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_custom_ik_max_word_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_ik_smart_analyzer":{
"type":"custom",
"tokenizer":"my_ik_smart_tokenizer"
},
"my_custom_ik_max_word_analyzer":{
"type":"custom",
"tokenizer":"my_ik_max_word_tokenizer"
}
},
"tokenizer":{
"my_ik_smart_tokenizer":{
"type":"ik_smart",
"userwords":[
"中文分詞器",
"自訂stopwords"
],
"stopwords":[
"關於",
"測試"
],
"quantifiers":[
"納秒"
],
"enable_lowercase":false
},
"my_ik_max_word_tokenizer":{
"type":"ik_max_word",
"userwords":[
"中文分詞器",
"自訂stopwords"
],
"stopwords":[
"關於",
"測試"
],
"quantifiers":[
"納秒"
],
"enable_lowercase":false
}
}
}
}
}
附錄3:支援的Token Filter
classic
過濾Token中尾部的's
和縮減詞中的.
,例如會將Fig.
轉換為Fig
。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"classic",
"filter":["classic"]
}
}
}
}
}
elision
過濾指定的母音,常用於法語中。
選擇性參數:
articles(自訂時必填):指定的母音,數群組類型,單個字母必須是字串,預設為
["l", "m", "t", "qu", "n", "s", "j"]
,配置後會覆蓋預設詞典。articles_case(可選):指定的母音是否大小寫敏感,取值為true(表示大小寫不敏感)或false(預設,大小寫敏感)。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["elision"]
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["elision_filter"]
}
},
"filter":{
"elision_filter":{
"type":"elision",
"articles":["l", "m", "t", "qu", "n", "s", "j"],
"articles_case":true
}
}
}
}
}
lowercase
將所有Token轉換為小寫。
選擇性參數:
language:詞元過濾器的語言,只能設定為greek或russian。若不設定該參數,則預設為英語。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["lowercase"]
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_greek_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_custom_russian_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_greek_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["greek_lowercase"]
},
"my_custom_russian_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["russian_lowercase"]
}
},
"filter":{
"greek_lowercase":{
"type":"lowercase",
"language":"greek"
},
"russian_lowercase":{
"type":"lowercase",
"language":"russian"
}
}
}
}
}
snowball
將所有Token轉換為詞幹,例如將cats
轉換為cat
。
選擇性參數:
language:詞元過濾器的語言,取值為english(預設)、german、french和dutch。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["snowball"]
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["my_filter"]
}
},
"filter":{
"my_filter":{
"type":"snowball",
"language":"english"
}
}
}
}
}
stop
根據指定的停用詞數組,過濾Token中出現的停用詞。
選擇性參數:
stopwords:停用詞數組,單個停用詞必須是字串。配置後,會覆蓋預設停用詞。預設停用詞如下:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
ignoreCase:匹配停用詞時是否大小寫敏感,取值為true(表示大小寫不敏感)或false(預設,大小寫敏感)。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["stop"]
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["stop_filter"]
}
},
"filter":{
"stop_filter":{
"type":"stop",
"stopwords":[
"the"
],
"ignore_case":true
}
}
}
}
}
asciifolding
將不在基本拉丁文Unicode塊(前127個ASCII字元)中的字母、數字和符號轉換為等價的ASCII字元(如果存在),例如將é
轉換為e
。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["asciifolding"]
}
}
}
}
}
length
過濾指定長度範圍以外的Token。
選擇性參數:
min:Token的最小長度,整數,預設為0。
max:Token的最大長度,整數,預設為(2^31 - 1)。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["length"]
}
}
}
}
}
# 自訂配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["length_filter"]
}
},
"filter":{
"length_filter":{
"type":"length",
"max":5,
"min":2
}
}
}
}
}
Normalization
規範某種語言的特定字元,取值為arabic_normalization或persian_normalization,推薦搭配Standard tokenizer使用。
配置樣本:
# 預設配置:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_arabic_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_persian_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_arabic_analyzer":{
"type":"custom",
"tokenizer":"arabic",
"filter":["arabic_normalization"]
},
"my_persian_analyzer":{
"type":"custom",
"tokenizer":"arabic",
"filter":["persian_normalization"]
}
}
}
}
}
附錄4:內建分詞器Language各語言的預設停用詞(Stopwords)
arabic
["من","ومن","منها","منه","في","وفي","فيها","فيه","و","ف","ثم","او","أو","ب","بها","به","ا","أ","اى","اي","أي","أى","لا","ولا","الا","ألا","إلا","لكن","ما","وما","كما","فما","عن","مع","اذا","إذا","ان","أن","إن","انها","أنها","إنها","انه","أنه","إنه","بان","بأن","فان","فأن","وان","وأن","وإن","التى","التي","الذى","الذي","الذين","الى","الي","إلى","إلي","على","عليها","عليه","اما","أما","إما","ايضا","أيضا","كل","وكل","لم","ولم","لن","ولن","هى","هي","هو","وهى","وهي","وهو","فهى","فهي","فهو","انت","أنت","لك","لها","له","هذه","هذا","تلك","ذلك","هناك","كانت","كان","يكون","تكون","وكانت","وكان","غير","بعض","قد","نحو","بين","بينما","منذ","ضمن","حيث","الان","الآن","خلال","بعد","قبل","حتى","عند","عندما","لدى","جميع"]
cjk
["with","will","to","this","there","then","the","t","that","such","s","on","not","no","it","www","was","is","","into","their","or","in","if","for","by","but","they","be","these","at","are","as","and","of","a"]
brazilian
["uns","umas","uma","teu","tambem","tal","suas","sobre","sob","seu","sendo","seja","sem","se","quem","tua","que","qualquer","porque","por","perante","pelos","pelo","outros","outro","outras","outra","os","o","nesse","nas","na","mesmos","mesmas","mesma","um","neste","menos","quais","mediante","proprio","logo","isto","isso","ha","estes","este","propios","estas","esta","todas","esses","essas","toda","entre","nos","entao","em","eles","qual","elas","tuas","ela","tudo","do","mesmo","diversas","todos","diversa","seus","dispoem","ou","dispoe","teus","deste","quer","desta","diversos","desde","quanto","depois","demais","quando","essa","deles","todo","pois","dele","dela","dos","de","da","nem","cujos","das","cujo","durante","cujas","portanto","cuja","contudo","ele","contra","como","com","pelas","assim","as","aqueles","mais","esse","aquele","mas","apos","aos","aonde","sua","e","ao","antes","nao","ambos","ambas","alem","ainda","a"]
czech
["a","s","k","o","i","u","v","z","dnes","cz","tímto","budeš","budem","byli","jseš","muj","svým","ta","tomto","tohle","tuto","tyto","jej","zda","proc","máte","tato","kam","tohoto","kdo","kterí","mi","nám","tom","tomuto","mít","nic","proto","kterou","byla","toho","protože","asi","ho","naši","napište","re","což","tím","takže","svých","její","svými","jste","aj","tu","tedy","teto","bylo","kde","ke","pravé","ji","nad","nejsou","ci","pod","téma","mezi","pres","ty","pak","vám","ani","když","však","neg","jsem","tento","clánku","clánky","aby","jsme","pred","pta","jejich","byl","ješte","až","bez","také","pouze","první","vaše","která","nás","nový","tipy","pokud","muže","strana","jeho","své","jiné","zprávy","nové","není","vás","jen","podle","zde","už","být","více","bude","již","než","který","by","které","co","nebo","ten","tak","má","pri","od","po","jsou","jak","další","ale","si","se","ve","to","jako","za","zpet","ze","do","pro","je","na","atd","atp","jakmile","pricemž","já","on","ona","ono","oni","ony","my","vy","jí","ji","me","mne","jemu","tomu","tem","temu","nemu","nemuž","jehož","jíž","jelikož","jež","jakož","nacež"]
german
["wegen","mir","mich","dich","dir","ihre","wird","sein","auf","durch","ihres","ist","aus","von","im","war","mit","ohne","oder","kein","wie","was","es","sie","mein","er","du","daß","dass","die","als","ihr","wir","der","für","das","einen","wer","einem","am","und","eines","eine","in","einer"]
greek
["ο","η","το","οι","τα","του","τησ","των","τον","την","και","κι","κ","ειμαι","εισαι","ειναι","ειμαστε","ειστε","στο","στον","στη","στην","μα","αλλα","απο","για","προσ","με","σε","ωσ","παρα","αντι","κατα","μετα","θα","να","δε","δεν","μη","μην","επι","ενω","εαν","αν","τοτε","που","πωσ","ποιοσ","ποια","ποιο","ποιοι","ποιεσ","ποιων","ποιουσ","αυτοσ","αυτη","αυτο","αυτοι","αυτων","αυτουσ","αυτεσ","αυτα","εκεινοσ","εκεινη","εκεινο","εκεινοι","εκεινεσ","εκεινα","εκεινων","εκεινουσ","οπωσ","ομωσ","ισωσ","οσο","οτι"]
persian
["انان","نداشته","سراسر","خياه","ايشان","وي","تاكنون","بيشتري","دوم","پس","ناشي","وگو","يا","داشتند","سپس","هنگام","هرگز","پنج","نشان","امسال","ديگر","گروهي","شدند","چطور","ده","و","دو","نخستين","ولي","چرا","چه","وسط","ه","كدام","قابل","يك","رفت","هفت","همچنين","در","هزار","بله","بلي","شايد","اما","شناسي","گرفته","دهد","داشته","دانست","داشتن","خواهيم","ميليارد","وقتيكه","امد","خواهد","جز","اورده","شده","بلكه","خدمات","شدن","برخي","نبود","بسياري","جلوگيري","حق","كردند","نوعي","بعري","نكرده","نظير","نبايد","بوده","بودن","داد","اورد","هست","جايي","شود","دنبال","داده","بايد","سابق","هيچ","همان","انجا","كمتر","كجاست","گردد","كسي","تر","مردم","تان","دادن","بودند","سري","جدا","ندارند","مگر","يكديگر","دارد","دهند","بنابراين","هنگامي","سمت","جا","انچه","خود","دادند","زياد","دارند","اثر","بدون","بهترين","بيشتر","البته","به","براساس","بيرون","كرد","بعضي","گرفت","توي","اي","ميليون","او","جريان","تول","بر","مانند","برابر","باشيم","مدتي","گويند","اكنون","تا","تنها","جديد","چند","بي","نشده","كردن","كردم","گويد","كرده","كنيم","نمي","نزد","روي","قصد","فقط","بالاي","ديگران","اين","ديروز","توسط","سوم","ايم","دانند","سوي","استفاده","شما","كنار","داريم","ساخته","طور","امده","رفته","نخست","بيست","نزديك","طي","كنيد","از","انها","تمامي","داشت","يكي","طريق","اش","چيست","روب","نمايد","گفت","چندين","چيزي","تواند","ام","ايا","با","ان","ايد","ترين","اينكه","ديگري","راه","هايي","بروز","همچنان","پاعين","كس","حدود","مختلف","مقابل","چيز","گيرد","ندارد","ضد","همچون","سازي","شان","مورد","باره","مرسي","خويش","برخوردار","چون","خارج","شش","هنوز","تحت","ضمن","هستيم","گفته","فكر","بسيار","پيش","براي","روزهاي","انكه","نخواهد","بالا","كل","وقتي","كي","چنين","كه","گيري","نيست","است","كجا","كند","نيز","يابد","بندي","حتي","توانند","عقب","خواست","كنند","بين","تمام","همه","ما","باشند","مثل","شد","اري","باشد","اره","طبق","بعد","اگر","صورت","غير","جاي","بيش","ريزي","اند","زيرا","چگونه","بار","لطفا","مي","درباره","من","ديده","همين","گذاري","برداري","علت","گذاشته","هم","فوق","نه","ها","شوند","اباد","همواره","هر","اول","خواهند","چهار","نام","امروز","مان","هاي","قبل","كنم","سعي","تازه","را","هستند","زير","جلوي","عنوان","بود"]
french
["ô","être","vu","vous","votre","un","tu","toute","tout","tous","toi","tiens","tes","suivant","soit","soi","sinon","siennes","si","se","sauf","s","quoi","vers","qui","quels","ton","quelle","quoique","quand","près","pourquoi","plus","à","pendant","partant","outre","on","nous","notre","nos","tienne","ses","non","qu","ni","ne","mêmes","même","moyennant","mon","moins","va","sur","moi","miens","proche","miennes","mienne","tien","mien","n","malgré","quelles","plein","mais","là","revoilà","lui","leurs","","toutes","le","où","la","l","jusque","jusqu","ils","hélas","ou","hormis","laquelle","il","eu","nôtre","etc","est","environ","une","entre","en","son","elles","elle","dès","durant","duquel","été","du","voici","par","dont","donc","voilà","hors","doit","plusieurs","diverses","diverse","divers","devra","devers","tiennes","dessus","etre","dessous","desquels","desquelles","ès","et","désormais","des","te","pas","derrière","depuis","delà","hui","dehors","sans","dedans","debout","vôtre","de","dans","nôtres","mes","d","y","vos","je","concernant","comme","comment","combien","lorsque","ci","ta","nບnmoins","lequel","chez","contre","ceux","cette","j","cet","seront","que","ces","leur","certains","certaines","puisque","certaine","certain","passé","cependant","celui","lesquelles","celles","quel","celle","devant","cela","revoici","eux","ceci","sienne","merci","ce","c","siens","les","avoir","sous","avec","pour","parmi","avant","car","avait","sont","me","auxquels","sien","sa","excepté","auxquelles","aux","ma","autres","autre","aussi","auquel","aujourd","au","attendu","selon","après","ont","ainsi","ai","afin","vôtres","lesquels","a"]
dutch
["andere","uw","niets","wil","na","tegen","ons","wordt","werd","hier","eens","onder","alles","zelf","hun","dus","kan","ben","meer","iets","me","veel","omdat","zal","nog","altijd","ja","want","u","zonder","deze","hebben","wie","zij","heeft","hoe","nu","heb","naar","worden","haar","daar","der","je","doch","moet","tot","uit","bij","geweest","kon","ge","zich","wezen","ze","al","zo","dit","waren","men","mijn","kunnen","wat","zou","dan","hem","om","maar","ook","er","had","voor","of","als","reeds","door","met","over","aan","mij","was","is","geen","zijn","niet","iemand","het","hij","een","toen","in","toch","die","dat","te","doen","ik","van","op","en","de"]
russian
["а","без","более","бы","был","была","были","было","быть","в","вам","вас","весь","во","вот","все","всего","всех","вы","где","да","даже","для","до","его","ее","ей","ею","если","есть","еще","же","за","здесь","и","из","или","им","их","к","как","ко","когда","кто","ли","либо","мне","может","мы","на","надо","наш","не","него","нее","нет","ни","них","но","ну","о","об","однако","он","она","они","оно","от","очень","по","под","при","с","со","так","также","такой","там","те","тем","то","того","тоже","той","только","том","ты","у","уже","хотя","чего","чей","чем","что","чтобы","чье","чья","эта","эти","это","я"]