Analyzers are used to resolve and split a document into words that can be saved to indexes. In most cases, you can use the built-in analyzers of TairSearch or custom analyzers that suit your needs. This topic describes how to use TairSearch analyzers.
Navigation
Built-in analyzer | Character filter | Tokenizer | Token filter |
Workflow of a TairSearch analyzer
A TairSearch analyzer consists of character filters, a tokenizer, and token filters, which are applied sequentially. Character filters and token filters can be left empty. Description:
Character filter: preprocesses documents. You can configure zero or more character filters that run in the specified sequence for a TairSearch analyzer. For example, a character filter can replace
"(:"
with"happy"
.Tokenizer: splits a document into multiple tokens. You can specify only a single tokenizer for each TairSearch analyzer. For example, you can use the whitespace tokenizer to split
"I am very happy"
into["I", "am", "very", "happy"]
.Token filter: processes the tokens that are generated by the specified tokenizer. You can configure zero or more token filters that run in the specified sequence for a TairSearch analyzer. For example, you can use the stop token filter to filter stop words.
Built-in analyzers
Standard
The standard analyzer splits a document based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29, converts tokens to lowercase letters, and filters out stop words. The analyzer works well for most languages.
Components:
The absence of character filters indicates that no character filters are available.
Optional parameters:
stopwords: the stop words to be filtered out. Data type: ARRAY. Each stop word must be a string. After you specify this parameter, the default stop words are overwritten. Default stop words:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
max_token_length: the maximum number of characters that are allowed for a token. Default value: 255. Tokens that exceed the maximum length are split based on the specified maximum length.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"standard"
}
}
}
}
# Use of custom stop words:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"standard",
"max_token_length":10,
"stopwords":[
"memory",
"disk",
"is",
"a"
]
}
}
}
}
}
Stop
The stop analyzer splits a document into tokens at any non-letter character, converts tokens to lowercase letters, and filters out stop words.
Optional parameters:
stopwords: the stop words to be filtered out. Data type: ARRAY. Each stop word must be a string. After you specify this parameter, the default stop words are overwritten. Default stop words:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"stop"
}
}
}
}
# Use of custom stop words:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"stop",
"stopwords":[
"memory",
"disk",
"is",
"a"
]
}
}
}
}
}
Jieba
The jieba analyzer is recommended for documents in Chinese. It splits a document based on a trained or specified dictionary, converts English tokens to lowercase letters, and filters out stop words.
Optional parameters:
userwords: a dictionary of user-defined words. Data type: ARRAY. Each word must be a string. After you specify this parameter, the user-defined words are added to the default dictionary. For more information, see the default dictionary of jieba.
ImportantThe jieba analyzer has a large built-in dictionary that is 20 MB in size. Only a single copy of this dictionary is retained in the memory of jieba. This dictionary is loaded only when jieba is used for the first time. This may cause a slight jitter in the latency while using jieba.
Words in the custom dictionary cannot contain spaces or the following special characters:
\t
,\n
,,
, and。
.
use_hmm: specifies whether to use a hidden Markov model (HMM) to handle words that are not included in the dictionary. Valid values: true and false. Default value: true.
stopwords: the stop words to be filtered out. Data type: ARRAY. Each stop word must be a string. After you specify this parameter, the default stop words are overwritten. For more information, see the default stop words of jieba.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"jieba"
}
}
}
}
# Use of custom stop words:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"jieba",
"stopwords":[
"memory",
"disk",
"is",
"a"
],"userwords":[
"Redis",
"open-source",
"flexible"
],
"use_hmm":true
}
}
}
}
}
IK
The IK analyzer is used for documents in Chinese and is compatible with the IK analyzer plug-in of Alibaba Cloud Elasticsearch. IK supports the ik_max_word
and ik_smart
modes. In ik_max_word
mode, IK identifies all possible tokens. In ik_smart
mode, IK filters the results of the ik_max_word
mode to identify the most possible tokens.
Components:
Tokenizer: IK
Optional parameters:
stopwords: the stop words to be filtered out. Data type: ARRAY. Each stop word must be a string. After you specify this parameter, the default stop words are overwritten. Default stop words:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
userwords: a dictionary of user-defined words. Data type: ARRAY. Each word must be a string. After you specify this parameter, the user-defined words are added to the default dictionary. For more information, see the default dictionary of IK.
quantifiers: a dictionary of user-defined quantifiers. Data type: ARRAY. After you specify this parameter, the user-defined quantifiers are added to the default dictionary. For more information, see the default quantifier dictionary of IK.
enable_lowercase: specifies whether to convert uppercase letters to lowercase letters. Valid values: true and false. Default value: true.
ImportantIf the custom dictionary contains uppercase letters, set this parameter to false because the conversion is performed before a document is split.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"ik_smart"
},
"f1":{
"type":"text",
"analyzer":"ik_max_word"
}
}
}
}
# Use of custom stop words:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_ik_smart_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_ik_max_word_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_ik_smart_analyzer":{
"type":"ik_smart",
"stopwords":[
"memory",
"disk",
"is",
"a"
],"userwords":[
"Redis",
"open-source",
"flexible"
],
"quantifiers":[
"ns"
],
"enable_lowercase":false
},
"my_ik_max_word_analyzer":{
"type":"ik_max_word",
"stopwords":[
"memory",
"disk",
"is",
"a"
],"userwords":[
"Redis",
"open-source",
"flexible"
],
"quantifiers":[
"ns"
],
"enable_lowercase":false
}
}
}
}
}
Pattern
The pattern analyzer splits a document based on the specified regular expression. The words that are matched by the regular expression are used as delimiters. For example, if the "aaa"
regular expression is used to split "bbbaaaccc"
, the splitting results are "bbb"
and "ccc"
. At the same time, you can specify the lowercase parameter to convert tokens to lowercase letters and filter out stop words.
Optional parameters:
pattern: the regular expression. The words that are matched by the regular expression are used as delimiters. Default value:
\W+
. For more information about the syntax of regular expressions, visit GitHub.stopwords: the stop words to be filtered out. The dictionary of stop words must be an array, and each stop word must be a string. After you specify stop words, the default stop words are overwritten. Default stop words:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
lowercase: specifies whether to convert tokens to lowercase letters. Valid values: true and false. Default value: true.
flags: specifies whether the regular expression is case-sensitive. By default, this parameter is left empty, which indicates that the regular expression is case-sensitive. A value of CASE_INSENSITIVE indicates that the regular expression is case-insensitive.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"pattern"
}
}
}
}
# Use of custom stop words:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"pattern",
"pattern":"\\'([^\\']+)\\'",
"stopwords":[
"aaa",
"@"
],
"lowercase":false,
"flags":"CASE_INSENSITIVE"
}
}
}
}
}
Whitespace
The whitespace analyzer splits a document into tokens whenever it encounters a whitespace character.
Components:
Tokenizer: whitespace
Optional parameters: none
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"whitespace"
}
}
}
}
Simple
The simple analyzer splits a document into tokens at any non-letter character and converts tokens to lowercase letters.
Components:
Tokenizer: lowercase
Optional parameters: none
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"simple"
}
}
}
}
Keyword
The keyword analyzer converts a document to a token without splitting the document.
Components:
Tokenizer: keyword
Optional parameters: none
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"keyword"
}
}
}
}
Language
The language analyzer is available for the following languages: chinese, arabic, cjk, brazilian, czech, german, greek, persian, french, dutch, and russian.
Optional parameters:
stopwords: the stop words to be filtered out. The dictionary of stop words must be an array, and each stop word must be a string. After you specify stop words, the default stop words are overwritten. For more information about the default stop words for different languages, see the Appendix 4: Default stop words of the built-in language analyzer for different languages section of this topic.
NoteYou cannot modify the stop words of the chinese analyzer.
stem_exclusion: the words whose stems are not extracted. For example, if you extract the stem of
"apples"
, the result is"apple"
. By default, this parameter is left empty. The value of the stem_exclusion parameter must be an array, and each word must be a string.NoteThis parameter is supported only for the brazilian, german, french, and dutch analyzers.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"arabic"
}
}
}
}
# Use of custom stop words:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"german",
"stopwords":[
"ein"
],
"stem_exclusion":[
"speicher"
]
}
}
}
}
}
Custom analyzers
A custom TairSearch analyzer is defined as a combination of character filters, a tokenizer, and token filters, which are applied sequentially. You can specify the corresponding char_filter, tokenizer, and filter parameters based on your needs.
Method: Set the analyzer
parameter to a custom analyzer such as my_custom_analyzer
in properties
. Configure the my_custom_analyzer
custom analyzer in settings
.
The following table describes the parameters.
Parameter | Description |
type | A custom analyzer. This parameter is required and is set to custom. |
char_filter | Filters characters to preprocess documents. By default, this parameter is left empty, which indicates that TairSearch does not preprocess documents. This parameter is optional and can be set only to mapping. Fields:
|
tokenizer | The tokenizer. This parameter is required. You can specify only a single tokenizer. Valid values: whitespace, lowercase, standard, classic, letter, keyword, jieba, pattern, ik_max_word, and ik_smart. For more information, see the Appendix 2: Supported tokenizers section of this topic. |
filter | Converts tokens to lowercase letters, and filters out stop words. This parameter is optional. You can specify multiple values for this parameter. By default, this parameter is left empty, which indicates that Tairsearch does not process the tokens. Valid values: classic, elision, lowercase, snowball, stop, asciifolding, length, arabic_normalization, and persian_normalization. For more information, see the Appendix 3: Supported token filters section of this topic. |
Configuration example:
# Configure the custom analyzer:
# In this example, emoticons and conjunctions are specified as the character filters. In addition, the whitespace tokenizer and the lowercase and stop token filters are specified.
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":[
"lowercase",
"stop"
],
"char_filter": [
"emoticons",
"conjunctions"
]
}
},
"char_filter":{
"emoticons":{
"type":"mapping",
"mappings":[
":) => _happy_",
":( => _sad_"
]
},
"conjunctions":{
"type":"mapping",
"mappings":[
"&=>and"
]
}
}
}
}
}
Appendix 1: Supported character filters
Mapping Character Filter
You can configure key-value pairs in mappings
. This way, when a key is identified, the key is replaced with the corresponding value. For example, ":) =>_happy_"
indicates that ":)" is replaced with "_happy_". You can specify multiple character filters.
Parameters:
mappings: This parameter is required. Data type: ARRAY. Each element must include
=>
. Example:"&=>and"
.
Configuration example:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"char_filter": [
"emoticons"
]
}
},
"char_filter":{
"emoticons":{
"type":"mapping",
"mappings":[
":) => _happy_",
":( => _sad_"
]
}
}
}
}
}
Appendix 2: Supported tokenizers
whitespace
The whitespace tokenizer splits a document into tokens whenever it encounters a whitespace character.
Optional parameters:
max_token_length: the maximum number of characters that are allowed for a token. Default value: 255. Tokens that exceed the maximum length are split based on the specified maximum length.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"whitespace",
"max_token_length":2
}
}
}
}
}
standard
The standard tokenizer splits a document based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. The tokenizer works well for most languages.
Optional parameters:
max_token_length: the maximum number of characters that are allowed for a token. Default value: 255. Tokens that exceed the maximum length are split based on the specified maximum length.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"standard",
"max_token_length":2
}
}
}
}
}
classic
The classic tokenizer splits a document based on English grammar and handles acronyms, company names, email addresses, and IP addresses in a special way, as described in the following section:
Splits a document by punctuation and deletes punctuations. Periods (.) without whitespaces are not considered as punctuations. For example,
red.apple
is not split, andred.[space] apple
is split intored
andapple
.Splits a document by hyphen. If numbers are contained in a token, the token is interpreted as a product number and is not split.
Identifies email addresses and hostnames as tokens.
Optional parameters:
max_token_length: the maximum number of characters that are allowed for a token. Default value: 255. Tokens that exceed the maximum length are skipped.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"classic"
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"classic",
"max_token_length":2
}
}
}
}
}
letter
The letter tokenizer splits a document into tokens at any non-letter character and works well for European languages.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"letter"
}
}
}
}
}
lowercase
The lowercase tokenizer splits a document into tokens at any non-letter character and converts all tokens to lowercase letters. The splitting results of the lowercase tokenizer are the same as those of a combination of the letter tokenizer and the lowercase token filter. In contrast, the lowercase tokenizer traverses a document only once, which consumes less time.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"lowercase"
}
}
}
}
}
keyword
The keyword tokenizer converts a document to a token without splitting the document. Typically, the keyword tokenizer is used together with a token filter such as the lowercase token filter to convert documents to lowercase letters.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"keyword"
}
}
}
}
}
jieba
The jieba tokenizer is recommended for the Chinese language. It splits a document based on a trained or specified dictionary.
Optional parameters:
userwords: a dictionary of user-defined words. Data type: ARRAY. Each word must be a string. After you specify this parameter, the user-defined words are added to the default dictionary. For more information, see the default dictionary of jieba.
ImportantWords in the custom dictionary cannot contain spaces or the following special characters:
\t
,\n
,,
, and。
.use_hmm: specifies whether to use a hidden Markov model (HMM) to handle words that are not included in the dictionary. Valid values: true and false. Default value: true.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"jieba"
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f1":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"token1"
}
},
"tokenizer":{
"token1":{
"type":"jieba",
"userwords":[
"Redis",
"open-source",
"flexible"
],
"use_hmm":true
}
}
}
}
}
pattern
The pattern tokenizer splits a document based on the specified regular expression. The words that are matched by the regular expression are used as delimiters or identified as tokens.
Optional parameters:
pattern: the regular expression. Default value:
\W+
. For more information, visit GitHub.group: uses the specified regular expression as a delimiter or token. Default value: -1. Valid values:
-1: uses the matched words of the specified regular expression as delimiters. For example, if you use the
"aaa"
regular expression to split"bbbaaaccc"
, the splitting results are"bbb"
and"ccc"
.0 or an integer greater than 0: identifies words that are matched by the regular expression as tokens. A value of 0 indicates that TairSearch matches words by the whole regular expression. A value of 1 or an integer greater than 1 indicates that TairSearch matches words by the corresponding capture group in the regular expression. For example, assume that the
"a(b+)c"
regular expression is used to split"abbbcdefabc"
. Ifgroup
is set to 0, the splitting results are"abbbc"
and"abc"
. Ifgroup
is set to 1, the first capture groupb+
in"a(b+)c"
is used to match words. In this case, the splitting results are"bbb"
and"b"
.
flags: specifies whether the specified regular expression is case-sensitive. By default, this parameter is left empty, which indicates that the regular expression is case-sensitive. A value of CASE_INSENSITIVE indicates that the specified regular expression is case-insensitive.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"pattern"
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f1":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"pattern_tokenizer"
}
},
"tokenizer":{
"pattern_tokenizer":{
"type":"pattern",
"pattern":"AB(A(\\w+)C)",
"flags":"CASE_INSENSITIVE",
"group":2
}
}
}
}
}
IK
The IK tokenizer splits documents in Chinese. IK supports the ik_max_word and ik_smart modes. In ik_max_word mode, IK identifies all possible tokens. In ik_smart mode, IK filters the results of the ik_max_word mode to identify the most possible tokens.
Optional parameters:
stopwords: the stop words to be filtered out. Data type: ARRAY. Each stop word must be a string. After you specify this parameter, the default stop words are overwritten. Default stop words:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
userwords: a dictionary of user-defined words. Data type: ARRAY. Each word must be a string. After you specify this parameter, the user-defined words are added to the default dictionary. For more information, see the default dictionary of IK.
quantifiers: a dictionary of user-defined quantifiers. Data type: ARRAY. After you specify this parameter, the user-defined quantifiers are added to the default dictionary. For more information, see the default quantifier dictionary of IK.
enable_lowercase: specifies whether to convert uppercase letters to lowercase letters. Valid values: true and false. Default value: true.
ImportantIf the custom dictionary contains uppercase letters, set this parameter to false because the conversion is performed before a document is split.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_ik_smart_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_custom_ik_max_word_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_ik_smart_analyzer":{
"type":"custom",
"tokenizer":"ik_smart"
},
"my_custom_ik_max_word_analyzer":{
"type":"custom",
"tokenizer":"ik_max_word"
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_ik_smart_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_custom_ik_max_word_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_ik_smart_analyzer":{
"type":"custom",
"tokenizer":"my_ik_smart_tokenizer"
},
"my_custom_ik_max_word_analyzer":{
"type":"custom",
"tokenizer":"my_ik_max_word_tokenizer"
}
},
"tokenizer":{
"my_ik_smart_tokenizer":{
"type":"ik_smart",
"userwords":[
"The tokenizer for the Chinese language",
"The custom stop words"
],
"stopwords":[
"about",
"test"
],
"quantifiers":[
"ns"
],
"enable_lowercase":false
},
"my_ik_max_word_tokenizer":{
"type":"ik_max_word",
"userwords":[
"The tokenizer for the Chinese language",
"The custom stop words"
],
"stopwords":[
"about",
"test"
],
"quantifiers":[
"ns"
],
"enable_lowercase":false
}
}
}
}
}
Appendix 3: Supported token filters
classic
The classic token filter filters out 's
at the end of tokens and periods (.)
in acronyms. For example, Fig.
is converted to Fig
.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"classic",
"filter":["classic"]
}
}
}
}
}
elision
The elision token filter removes specified elisions from the beginning of tokens. This filter primarily applies to the French language.
Optional parameters:
articles: the specified elisions. This parameter is required if you want to specify custom elisions. Data type: ARRAY. Each element in the array must be a string. Default value:
["l", "m", "t", "qu", "n", "s", "j"]
. After you specify this parameter, the default value is overwritten.articles_case: specifies whether the elisions are case-sensitive. This parameter is optional. Valid values: true and false. Default value: false.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["elision"]
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["elision_filter"]
}
},
"filter":{
"elision_filter":{
"type":"elision",
"articles":["l", "m", "t", "qu", "n", "s", "j"],
"articles_case":true
}
}
}
}
}
lowercase
The lowercase token filter converts tokens to lowercase letters.
Optional parameters:
language: the language that the token filter uses. Valid values: greek and russian. If you do not specify this parameter, the token filter uses English.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["lowercase"]
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_greek_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_custom_russian_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_greek_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["greek_lowercase"]
},
"my_custom_russian_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["russian_lowercase"]
}
},
"filter":{
"greek_lowercase":{
"type":"lowercase",
"language":"greek"
},
"russian_lowercase":{
"type":"lowercase",
"language":"russian"
}
}
}
}
}
snowball
The snowball token filter extracts stems from all tokens. For example, the token filter extracts cat
from cats
.
Optional parameters:
language: the language that the token filter uses. Valid values: english, german, french, and dutch. Default value: english.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["snowball"]
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["my_filter"]
}
},
"filter":{
"my_filter":{
"type":"snowball",
"language":"english"
}
}
}
}
}
stop
The stop token filter removes stop words from tokens based on the specified array of stop words.
Optional parameters:
stopwords: the array of stop words. Each stop word must be a string. After you specify this parameter, the default stop words are overwritten. Default stop words:
["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
ignoreCase: specifies whether the stop words are case-sensitive. Valid values: true and false. Default value: false.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["stop"]
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["stop_filter"]
}
},
"filter":{
"stop_filter":{
"type":"stop",
"stopwords":[
"the"
],
"ignore_case":true
}
}
}
}
}
asciifolding
The asciifolding token filter converts alphabetic, numeric, and symbolic characters that are not included in the Basic Latin Unicode block to their ASCII equivalents. For example, this token filter converts é
to e
.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["asciifolding"]
}
}
}
}
}
length
The length token filter removes tokens shorter or longer than specified character lengths.
Optional parameters:
min: the minimum number of characters allowed for a token. Data type: INTEGER. Default value: 0.
max: the maximum number of characters allowed for a token. Data type: INTEGER. Default value: 2^31 - 1.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["length"]
}
}
}
}
}
# Custom configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":["length_filter"]
}
},
"filter":{
"length_filter":{
"type":"length",
"max":5,
"min":2
}
}
}
}
}
Normalization
The nomalization token filter normalizes specific characters of a specific language. Valid values: arabic_normalization and persian_normalization. We recommend that you use this token filter together with the standard tokenizer.
Configuration example:
# Default configuration:
{
"mappings":{
"properties":{
"f0":{
"type":"text",
"analyzer":"my_arabic_analyzer"
},
"f1":{
"type":"text",
"analyzer":"my_persian_analyzer"
}
}
},
"settings":{
"analysis":{
"analyzer":{
"my_arabic_analyzer":{
"type":"custom",
"tokenizer":"arabic",
"filter":["arabic_normalization"]
},
"my_persian_analyzer":{
"type":"custom",
"tokenizer":"arabic",
"filter":["persian_normalization"]
}
}
}
}
}
Appendix 4: Default stop words of the built-in language analyzer for different languages
arabic
["من","ومن","منها","منه","في","وفي","فيها","فيه","و","ف","ثم","او","أو","ب","بها","به","ا","أ","اى","اي","أي","أى","لا","ولا","الا","ألا","إلا","لكن","ما","وما","كما","فما","عن","مع","اذا","إذا","ان","أن","إن","انها","أنها","إنها","انه","أنه","إنه","بان","بأن","فان","فأن","وان","وأن","وإن","التى","التي","الذى","الذي","الذين","الى","الي","إلى","إلي","على","عليها","عليه","اما","أما","إما","ايضا","أيضا","كل","وكل","لم","ولم","لن","ولن","هى","هي","هو","وهى","وهي","وهو","فهى","فهي","فهو","انت","أنت","لك","لها","له","هذه","هذا","تلك","ذلك","هناك","كانت","كان","يكون","تكون","وكانت","وكان","غير","بعض","قد","نحو","بين","بينما","منذ","ضمن","حيث","الان","الآن","خلال","بعد","قبل","حتى","عند","عندما","لدى","جميع"]
cjk
["with","will","to","this","there","then","the","t","that","such","s","on","not","no","it","www","was","is","","into","their","or","in","if","for","by","but","they","be","these","at","are","as","and","of","a"]
brazilian
["uns","umas","uma","teu","tambem","tal","suas","sobre","sob","seu","sendo","seja","sem","se","quem","tua","que","qualquer","porque","por","perante","pelos","pelo","outros","outro","outras","outra","os","o","nesse","nas","na","mesmos","mesmas","mesma","um","neste","menos","quais","mediante","proprio","logo","isto","isso","ha","estes","este","propios","estas","esta","todas","esses","essas","toda","entre","nos","entao","em","eles","qual","elas","tuas","ela","tudo","do","mesmo","diversas","todos","diversa","seus","dispoem","ou","dispoe","teus","deste","quer","desta","diversos","desde","quanto","depois","demais","quando","essa","deles","todo","pois","dele","dela","dos","de","da","nem","cujos","das","cujo","durante","cujas","portanto","cuja","contudo","ele","contra","como","com","pelas","assim","as","aqueles","mais","esse","aquele","mas","apos","aos","aonde","sua","e","ao","antes","nao","ambos","ambas","alem","ainda","a"]
czech
["a","s","k","o","i","u","v","z","dnes","cz","tímto","budeš","budem","byli","jseš","muj","svým","ta","tomto","tohle","tuto","tyto","jej","zda","proc","máte","tato","kam","tohoto","kdo","kterí","mi","nám","tom","tomuto","mít","nic","proto","kterou","byla","toho","protože","asi","ho","naši","napište","re","což","tím","takže","svých","její","svými","jste","aj","tu","tedy","teto","bylo","kde","ke","pravé","ji","nad","nejsou","ci","pod","téma","mezi","pres","ty","pak","vám","ani","když","však","neg","jsem","tento","clánku","clánky","aby","jsme","pred","pta","jejich","byl","ješte","až","bez","také","pouze","první","vaše","která","nás","nový","tipy","pokud","muže","strana","jeho","své","jiné","zprávy","nové","není","vás","jen","podle","zde","už","být","více","bude","již","než","který","by","které","co","nebo","ten","tak","má","pri","od","po","jsou","jak","další","ale","si","se","ve","to","jako","za","zpet","ze","do","pro","je","na","atd","atp","jakmile","pricemž","já","on","ona","ono","oni","ony","my","vy","jí","ji","me","mne","jemu","tomu","tem","temu","nemu","nemuž","jehož","jíž","jelikož","jež","jakož","nacež"]
german
["wegen","mir","mich","dich","dir","ihre","wird","sein","auf","durch","ihres","ist","aus","von","im","war","mit","ohne","oder","kein","wie","was","es","sie","mein","er","du","daß","dass","die","als","ihr","wir","der","für","das","einen","wer","einem","am","und","eines","eine","in","einer"]
greek
["ο","η","το","οι","τα","του","τησ","των","τον","την","και","κι","κ","ειμαι","εισαι","ειναι","ειμαστε","ειστε","στο","στον","στη","στην","μα","αλλα","απο","για","προσ","με","σε","ωσ","παρα","αντι","κατα","μετα","θα","να","δε","δεν","μη","μην","επι","ενω","εαν","αν","τοτε","που","πωσ","ποιοσ","ποια","ποιο","ποιοι","ποιεσ","ποιων","ποιουσ","αυτοσ","αυτη","αυτο","αυτοι","αυτων","αυτουσ","αυτεσ","αυτα","εκεινοσ","εκεινη","εκεινο","εκεινοι","εκεινεσ","εκεινα","εκεινων","εκεινουσ","οπωσ","ομωσ","ισωσ","οσο","οτι"]
persian
["انان","نداشته","سراسر","خياه","ايشان","وي","تاكنون","بيشتري","دوم","پس","ناشي","وگو","يا","داشتند","سپس","هنگام","هرگز","پنج","نشان","امسال","ديگر","گروهي","شدند","چطور","ده","و","دو","نخستين","ولي","چرا","چه","وسط","ه","كدام","قابل","يك","رفت","هفت","همچنين","در","هزار","بله","بلي","شايد","اما","شناسي","گرفته","دهد","داشته","دانست","داشتن","خواهيم","ميليارد","وقتيكه","امد","خواهد","جز","اورده","شده","بلكه","خدمات","شدن","برخي","نبود","بسياري","جلوگيري","حق","كردند","نوعي","بعري","نكرده","نظير","نبايد","بوده","بودن","داد","اورد","هست","جايي","شود","دنبال","داده","بايد","سابق","هيچ","همان","انجا","كمتر","كجاست","گردد","كسي","تر","مردم","تان","دادن","بودند","سري","جدا","ندارند","مگر","يكديگر","دارد","دهند","بنابراين","هنگامي","سمت","جا","انچه","خود","دادند","زياد","دارند","اثر","بدون","بهترين","بيشتر","البته","به","براساس","بيرون","كرد","بعضي","گرفت","توي","اي","ميليون","او","جريان","تول","بر","مانند","برابر","باشيم","مدتي","گويند","اكنون","تا","تنها","جديد","چند","بي","نشده","كردن","كردم","گويد","كرده","كنيم","نمي","نزد","روي","قصد","فقط","بالاي","ديگران","اين","ديروز","توسط","سوم","ايم","دانند","سوي","استفاده","شما","كنار","داريم","ساخته","طور","امده","رفته","نخست","بيست","نزديك","طي","كنيد","از","انها","تمامي","داشت","يكي","طريق","اش","چيست","روب","نمايد","گفت","چندين","چيزي","تواند","ام","ايا","با","ان","ايد","ترين","اينكه","ديگري","راه","هايي","بروز","همچنان","پاعين","كس","حدود","مختلف","مقابل","چيز","گيرد","ندارد","ضد","همچون","سازي","شان","مورد","باره","مرسي","خويش","برخوردار","چون","خارج","شش","هنوز","تحت","ضمن","هستيم","گفته","فكر","بسيار","پيش","براي","روزهاي","انكه","نخواهد","بالا","كل","وقتي","كي","چنين","كه","گيري","نيست","است","كجا","كند","نيز","يابد","بندي","حتي","توانند","عقب","خواست","كنند","بين","تمام","همه","ما","باشند","مثل","شد","اري","باشد","اره","طبق","بعد","اگر","صورت","غير","جاي","بيش","ريزي","اند","زيرا","چگونه","بار","لطفا","مي","درباره","من","ديده","همين","گذاري","برداري","علت","گذاشته","هم","فوق","نه","ها","شوند","اباد","همواره","هر","اول","خواهند","چهار","نام","امروز","مان","هاي","قبل","كنم","سعي","تازه","را","هستند","زير","جلوي","عنوان","بود"]
french
["ô","être","vu","vous","votre","un","tu","toute","tout","tous","toi","tiens","tes","suivant","soit","soi","sinon","siennes","si","se","sauf","s","quoi","vers","qui","quels","ton","quelle","quoique","quand","près","pourquoi","plus","à","pendant","partant","outre","on","nous","notre","nos","tienne","ses","non","qu","ni","ne","mêmes","même","moyennant","mon","moins","va","sur","moi","miens","proche","miennes","mienne","tien","mien","n","malgré","quelles","plein","mais","là","revoilà","lui","leurs","","toutes","le","où","la","l","jusque","jusqu","ils","hélas","ou","hormis","laquelle","il","eu","nôtre","etc","est","environ","une","entre","en","son","elles","elle","dès","durant","duquel","été","du","voici","par","dont","donc","voilà","hors","doit","plusieurs","diverses","diverse","divers","devra","devers","tiennes","dessus","etre","dessous","desquels","desquelles","ès","et","désormais","des","te","pas","derrière","depuis","delà","hui","dehors","sans","dedans","debout","vôtre","de","dans","nôtres","mes","d","y","vos","je","concernant","comme","comment","combien","lorsque","ci","ta","nບnmoins","lequel","chez","contre","ceux","cette","j","cet","seront","que","ces","leur","certains","certaines","puisque","certaine","certain","passé","cependant","celui","lesquelles","celles","quel","celle","devant","cela","revoici","eux","ceci","sienne","merci","ce","c","siens","les","avoir","sous","avec","pour","parmi","avant","car","avait","sont","me","auxquels","sien","sa","excepté","auxquelles","aux","ma","autres","autre","aussi","auquel","aujourd","au","attendu","selon","après","ont","ainsi","ai","afin","vôtres","lesquels","a"]
dutch
["andere","uw","niets","wil","na","tegen","ons","wordt","werd","hier","eens","onder","alles","zelf","hun","dus","kan","ben","meer","iets","me","veel","omdat","zal","nog","altijd","ja","want","u","zonder","deze","hebben","wie","zij","heeft","hoe","nu","heb","naar","worden","haar","daar","der","je","doch","moet","tot","uit","bij","geweest","kon","ge","zich","wezen","ze","al","zo","dit","waren","men","mijn","kunnen","wat","zou","dan","hem","om","maar","ook","er","had","voor","of","als","reeds","door","met","over","aan","mij","was","is","geen","zijn","niet","iemand","het","hij","een","toen","in","toch","die","dat","te","doen","ik","van","op","en","de"]
russian
["а","без","более","бы","был","была","были","было","быть","в","вам","вас","весь","во","вот","все","всего","всех","вы","где","да","даже","для","до","его","ее","ей","ею","если","есть","еще","же","за","здесь","и","из","или","им","их","к","как","ко","когда","кто","ли","либо","мне","может","мы","на","надо","наш","не","него","нее","нет","ни","них","но","ну","о","об","однако","он","она","они","оно","от","очень","по","под","при","с","со","так","также","такой","там","те","тем","то","того","тоже","той","только","том","ты","у","уже","хотя","чего","чей","чем","что","чтобы","чье","чья","эта","эти","это","я"]