Catch the replay of the Apsara Conference 2020 at this link!
By ELK Geek, with special guest, Xie Pengjun (Chengchen), Senior Algorithm Expert of Alibaba Cloud AI
Introduction: When building search engines, effect optimization issues will emerge, many of which are related to Natural Language Processing (NLP). This article interprets and analyzes these issues by combining the technical points of NLP in OpenSearch.
Research on NLP aims to achieve effective communication between humans and computers through languages. It is a science that integrates linguistics, psychology, computer science, mathematics, and statistics. It involves many topics, such as analysis, extraction, understanding, conversion, and the generation of natural languages and symbolic languages.
The research on NLP covers all of the subjects above. NLP is necessary to realize comprehensive AI.
From bottom to top, the capabilities of the NLP platform are divided into NLP data, NLP basic capabilities, NLP application technologies, and high-level applications.
NLP data is the basis for many algorithms, including language dictionaries, substantive knowledge dictionaries, syntactic dictionaries, and sentiment analysis dictionaries. Basic NLP technologies include lexical analysis, syntactic analysis, text analysis, and in-depth models. On top of basic NLP technologies, there are vertical technologies of NLP, including Q&A and conversation technologies, anti-spam technology, and address resolution. The combination of these technologies supports many applications. Among them, OpenSearch is an application with intensive NLP capabilities.
Parts with orange backgrounds are related to NLP
The goal of OpenSearch is to create all-in-one and out-of-the-box intelligent search services. Alibaba Cloud will open these algorithms to users in the form of industry templates, scenarios, and peripheral services.
A search starts with a keyword. For example, when a user searches "aj1北卡兰新款球鞋" in Chinese, the analyzing procedure works like this:
Alibaba Cloud has provided a series of open models for cross-domain word segmentation in OpenSearch.
The figure above shows the automatic cross-domain word segmentation framework.
Users need to provide some corpus data from their business, and Alibaba Cloud can automatically build a customized word segmentation model. This method greatly improves efficiency and meets the needs of customers quickly.
This technology offers better results compared to the open-source general models of word segmentation in various domains.
NER can recognize important elements. For example, NER can recognize and extract people's names, places, and times in queries.
There is a lot of research and challenges for NER in NLP. NER faces difficulties, such as boundary ambiguity, semantic ambiguity, and nesting ambiguity, especially in Chinese, due to the lack of native word separators.
The error correction steps of OpenSearch include mining, training, evaluation, and online prediction.
The main model of spelling correction is based on the statistical translation model and the neural network translation model. Also, the model has a complete set of methods in performance, display style, and intervention.
The emergence of in-depth language models has greatly improved many NLP tasks, especially for semantic matching.
Alibaba DAMO Academy has also proposed many innovations based on BERT and developed the exclusive StructBERT model. The main innovation of StructBERT is that in the training of in-depth language models, it adds more objective functions of words and term orders. More diverse objective functions for sentence structure prediction are also added to carry out multi-task learning. However, the universal StructBERT model cannot be provided to different customers in different domains. Alibaba Cloud needs to adapt StructBERT to different domains. Therefore, a three-stage paradigm for semantic matching has been proposed to create a semantic matching model that is used to quickly produce customized semantic models for customers.
Process details are shown in the figure below:
The systematic architecture of services based on algorithms includes offline computing, online engines, and product consoles.
As shown in the figure, the light blue area shows the algorithm-related features provided by NLP in OpenSearch. Users can experience and use these features directly in the console.
2,599 posts | 764 followers
FollowPM - C2C_Yuan - April 18, 2024
Farruh - March 22, 2024
OpenSearch - June 21, 2023
Alibaba Cloud Community - August 28, 2023
Alibaba Clouder - May 11, 2020
Alibaba Cloud Indonesia - October 24, 2023
2,599 posts | 764 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreMore Posts by Alibaba Clouder