By Yijia
Elasticsearch can achieve sub-second searches. A distributed deployment like the Elasticsearch cluster can easily scale, making it capable of handling petabytes of database capacity. Its search results are sorted by score to provide us with the most relevant search results.
Distributed architecture: Elasticsearch automatically distributes large amounts of data to multiple servers.
Provides highly automated query methods such as fuzzy search, and features such as relevance ranking and highlighting.
In a community website, data such as user logins in the last week and each function usage in the last month can be analyzed.
Due to its distributed architecture, a large number of servers can be utilized for storing and retrieving data.
Examples include personnel retrieval, equipment retrieval, in-app search, and order search.
The classic combination of ELK (Elasticsearch/Logstash/Kibana) can achieve log collection, log storage, and log analysis.
For instance, the community group purchase prompt can automatically notify users of a purchase when the offer price falls below a certain value.
It can also analyze a competitor's sales Top 10 for operational analysis.
For example, in a community setting, it is necessary to analyze user consumption amounts and commodity categories in a certain area, output the corresponding report data, and predict the top-selling commodities based on regional and population characteristics. Elasticsearch handles the data analysis and mining, while Kibana provides data visualization.
As an information search toolkit written in Java (JAR package), Lucene is just a framework, and skilled use of Lucene is complex.
Lucene-based HTTP interface query server.A search engine system encapsulating a lot of Lucene details.
Near real-time search engine based on Lucene distributed massive data. The strategy used is to index each field so that it can be searched.
(1) Solr uses Zookeeper for distributed management, while Elasticsearch itself has distributed coordination management capabilities.
(2) Solr is more comprehensive than Elasticsearch implementation, while Elasticsearch focuses more on core features, and advanced features are mostly provided by third-party plug-ins.
(3) Solr performs better than Elasticsearch in traditional search applications, while Elasticsearch performs better than Solr in real-time search applications.
At present, the mainstream is still Elasticsearch 7.x and the latest is 7.8.
Optimizations: Integrate JDK by default, upgrade Lucene8 to significantly improve TopK performance, and introduce circuit breakers to avoid OOM.
IK analyzer is an open-source lightweight Chinese word segmentation toolkit developed based on the Java language. The new IK analyzer 3.0 is developed into a common word segmentation component for Java, which is independent of the Lucene project and provides a default optimized implementation of Lucene.
IK analyzer 3.0 has the following features:
settings: specify the index library and define things such as the number of shards and the number of replicas of the index library.
A geographic coordinate point refers to a point on the Earth's surface that can be described using latitude and longitude. Geographic coordinate points are used for calculating the distance between two coordinates and determining if a coordinate is within a specific area. To create a geographic coordinate point, you need to explicitly declare the field type as geo_point.
Dynamic mapping is used to determine the data type of a field and automatically add new fields to the type mapping.
Full-text query
Term-level query
Aggregation analysis is an important feature in the database, which completes the aggregation calculation of data in a queried dataset, such as finding the maximum and minimum values or calculating the sum and average values of a field (or the results of a calculation expression).
If the Completion Suggester has reached a zero match, you can guess that the user has an input error, and you can try the Phrase Suggester at this time. If there is still no match, try Term Suggester.
In terms of precision, Completion > Phrase > Term**, while in terms of recall, the opposite is true.
In terms of performance, the Completion Suggester is the fastest. It is ideal to use only the Completion Suggester for prefix matching if it can meet business requirements. Due to their search for inverted indexes, the Phrase and Term have lower performance in comparison. The amount of data used by the Suggester should be controlled as much as possible. The ideal scenario is that after a certain warm-up period, the index can be fully mapped into memory.
When initializing the data for the first time, the number of replicas is set to 0. It is changed back after writing, thus avoiding indexing replicas.
It can avoid the process of judging the existence before writing.
The binary type is not applicable. Use different analyzers for the title and text to speed up.
External data import
The solution based on scroll + bulk + index alias
Reindex API solution
Participation and Flexibility: Self-developed > scroll + bulk > reindex
Stability and reliability: Self-developed < scroll + bulk < reindex
For example, if a super administrator wants to send an announcement or advertisement to users in a province, the easiest method is to use from + size, but this is unrealistic.
Paging method | Performance | Advantage | Disadvantage | Scenarios |
From + size | Low | Good flexibility and simple implementation. | The deep paging problems. | If the data volume is relatively small, it can tolerate the deep paging problems. |
scroll | Medium | The deep paging problems are resolved. | It cannot reflect the real-time performance of data (snapshot version). The maintenance cost is high. You need to maintain a scroll_id. | Exporting large amounts of data requires querying data in large amounts of result sets. |
search_after | High | The best performance requires no deep paging problems and being able to reflect the real-time change of data. | The implementation of continuous paging is more complicated because each query requires the results of the previous query. | Paging of large amounts of data |
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Alibaba Cloud Pilots AI Coding Assistant to Help Employees Write Code
New Tool of Java 22: Use Java Stream Gather to Handle States in a Stream
1,076 posts | 263 followers
FollowAlibaba Cloud Community - May 10, 2024
Data Geek - July 25, 2024
Data Geek - July 11, 2024
Data Geek - July 23, 2024
Data Geek - July 2, 2024
Data Geek - July 10, 2024
1,076 posts | 263 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreAlibaba Cloud Elasticsearch helps users easy to build AI-powered search applications seamlessly integrated with large language models, and featuring for the enterprise: robust access control, security monitoring, and automatic updates.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba Cloud Community