×
Community Blog The Application of Natural Language Processing in OpenSearch

The Application of Natural Language Processing in OpenSearch

This article interprets and analyzes the common issues of building search engines by combining the technical points of NLP in OpenSearch.

Catch the replay of the Apsara Conference 2020 at this link!

By ELK Geek, with special guest, Xie Pengjun (Chengchen), Senior Algorithm Expert of Alibaba Cloud AI

Introduction: When building search engines, effect optimization issues will emerge, many of which are related to Natural Language Processing (NLP). This article interprets and analyzes these issues by combining the technical points of NLP in OpenSearch.

Natural Language Processing

Research on NLP aims to achieve effective communication between humans and computers through languages. It is a science that integrates linguistics, psychology, computer science, mathematics, and statistics. It involves many topics, such as analysis, extraction, understanding, conversion, and the generation of natural languages and symbolic languages.

The Stages of AI

  • Computing Intelligence: It refers to the ability to outperform humans in some areas by relying on computing power and the ability to store massive data. A representative example is "Alphago" from Google. With the strong computing power of Google TPU and the combination of algorithms, like Monte Carlo Tree Search (MCTS) and reinforcement learning, Alphago can make good decisions by processing massive information about the Go game. Thus, it can outperform humans in terms of computational ability.
  • Intellisense: It refers to the ability to identify important elements from unstructured data. For example, it can analyze a query to identify information, such as people's names, places, and institutions.
  • Cognitive Intelligence: Based on intellisense, cognitive intelligence can understand the meaning of elements and make some inferences. For example, in Chinese, sentences like "谢霆锋是谁的儿子" and "谁是谢霆锋的儿子" both contain the same characters, but the semantics of them are different. This is what cognitive intelligence aims to solve.
  • Creative Intelligence: It refers to computers' ability to create sentences that conform to common sense, semantics, and logic, based on understandings of semantics. For example, computers can automatically write novels, create music, and chat with people naturally.

The research on NLP covers all of the subjects above. NLP is necessary to realize comprehensive AI.

The Development Trend of NLP

  1. The breakthrough in in-depth language models will lead to the progress of important natural language technologies.
  2. NLP services on public clouds will evolve to customized services from general functions.
  3. Natural language technologies will be gradually and closely integrated with industries and scenarios to create greater value.

The Capabilities of Alibaba Group's NLP Platform

1

From bottom to top, the capabilities of the NLP platform are divided into NLP data, NLP basic capabilities, NLP application technologies, and high-level applications.

NLP data is the basis for many algorithms, including language dictionaries, substantive knowledge dictionaries, syntactic dictionaries, and sentiment analysis dictionaries. Basic NLP technologies include lexical analysis, syntactic analysis, text analysis, and in-depth models. On top of basic NLP technologies, there are vertical technologies of NLP, including Q&A and conversation technologies, anti-spam technology, and address resolution. The combination of these technologies supports many applications. Among them, OpenSearch is an application with intensive NLP capabilities.

Applications and Typical NLP Technologies in OpenSearch

2

  • The infrastructure of OpenSearch includes Alibaba Cloud's basic products and exclusive search systems based on the search scenarios of Alibaba Cloud's ecosystem, such as HA3, RTP, and Dii.
  • The basic management platform ensures the collection, management, and training of offline data.
  • The algorithm module is divided into two parts. One is related to query parsing, including multi-grained word segmentation (MWS), entity recognition, error correction, and rewriting. Another is related to correlation and sorting, including text correlation, prediction of Click Through Rate (CTR) and Conversion Rate (CVR), and Learning to Rank (LTR).

Parts with orange backgrounds are related to NLP

The goal of OpenSearch is to create all-in-one and out-of-the-box intelligent search services. Alibaba Cloud will open these algorithms to users in the form of industry templates, scenarios, and peripheral services.

The Analyzing Procedure of NLP in OpenSearch

A search starts with a keyword. For example, when a user searches "aj1北卡兰新款球鞋" in Chinese, the analyzing procedure works like this:

3

Cross-Domain Word Segmentation

Alibaba Cloud has provided a series of open models for cross-domain word segmentation in OpenSearch.

Word Segmentation Challenges

  1. The effect of word segmentation is greatly reduced by additional unrecognized words or so-called "new words" in various fields.
  2. The costs to customize word segmentation models for new users of the process from data labeling to data training are expensive.

Solution

  1. A model for forming terms can be built by combining statistical characteristics, such as mutual information, and left-skewed and right-skewed log transformations. By doing so, a domain dictionary can be quickly built based on user data.
  2. By combining word segmentation models from a source domain with dictionaries from a target domain, a tokenizer can be quickly built in a target domain based on remote supervision technology.

4
The figure above shows the automatic cross-domain word segmentation framework.

Users need to provide some corpus data from their business, and Alibaba Cloud can automatically build a customized word segmentation model. This method greatly improves efficiency and meets the needs of customers quickly.

This technology offers better results compared to the open-source general models of word segmentation in various domains.

5

Named Entity Recognition (NER)

NER can recognize important elements. For example, NER can recognize and extract people's names, places, and times in queries.

Challenges and Difficulties

There is a lot of research and challenges for NER in NLP. NER faces difficulties, such as boundary ambiguity, semantic ambiguity, and nesting ambiguity, especially in Chinese, due to the lack of native word separators.

Solution

  • The architecture of the NER model in OpenSearch is shown in the upper-right corner of the following figure.
  • In OpenSearch, many users have accumulated a large number of dictionary object libraries. To make full use of these libraries, Alibaba Cloud builds a GraphNER framework that organically integrates knowledge based on the BERT model. As shown on the table in the lower-right corner, the best effect of NER can be achieved in Chinese.

6

Spelling Correction

The error correction steps of OpenSearch include mining, training, evaluation, and online prediction.

The main model of spelling correction is based on the statistical translation model and the neural network translation model. Also, the model has a complete set of methods in performance, display style, and intervention.

7

Semantic Matching

The emergence of in-depth language models has greatly improved many NLP tasks, especially for semantic matching.

Alibaba DAMO Academy has also proposed many innovations based on BERT and developed the exclusive StructBERT model. The main innovation of StructBERT is that in the training of in-depth language models, it adds more objective functions of words and term orders. More diverse objective functions for sentence structure prediction are also added to carry out multi-task learning. However, the universal StructBERT model cannot be provided to different customers in different domains. Alibaba Cloud needs to adapt StructBERT to different domains. Therefore, a three-stage paradigm for semantic matching has been proposed to create a semantic matching model that is used to quickly produce customized semantic models for customers.

Process details are shown in the figure below:

8

Services Based on NLP Algorithms

The systematic architecture of services based on algorithms includes offline computing, online engines, and product consoles.

As shown in the figure, the light blue area shows the algorithm-related features provided by NLP in OpenSearch. Users can experience and use these features directly in the console.

9

0 0 0
Share on

Alibaba Clouder

2,599 posts | 762 followers

You may also like

Comments

Alibaba Clouder

2,599 posts | 762 followers

Related Products