By Xunjian from Alibaba Cloud Storage
This article introduces some best practices using search indexes based on the problems many users and business scenarios face during the process.
The physical structure design of different types of fields varies in search indexes:
Although Keyword supports range queries and Long supports equivalent queries, the performance is much worse. The larger the data volume, the larger the performance difference is. Therefore, you must plan the field types in advance when you import data from a wide table (primary table) in Tablestore.
Some common designs that are wrongly applied are described below:
The primary table of Tablestore is partitioned by Range based on the partition key. The design of the primary key affects the synchronization speed of search indexes and the horizontal expansion of queries in some scenarios.
The primary key design has an impact on the speed of index synchronization. The requirement for primary keys is to distribute writes to different partitions of the primary table as much as possible. Therefore, if the write TPS is high, it is necessary to distribute writes to multiple partitions of the primary table as much as possible to avoid problems (such as single write partitions and tail write partitions). Currently, search indexes only support asynchronous writing from the primary table to the index. The architecture upgrade of real-time synchronization is still under design.
A counterexample:
If the query must carry a certain field (like the case of the Taobao app where each query carries UserID=xxx), routing optimization is recommended. By default, the data distribution of search indexes is hash partitioned based on all primary keys of the primary table. Therefore, all partitions of the index engine are accessed during queries. After indexes are added with routes, the data distribution can be changed. Data is hash partitioned based on route keys (instead of all previous primary keys). Therefore, a certain route key must fall on one or more partitions.
Please see Use of Search Index Routing Fields for more information about how to use routing. Routing optimization can bring the following benefits:
Common Mistakes:
If the query is very complex (there are too many conditions, too deep nesting, and too many elements in the Terms query), the query latency is likely to be relatively high. Therefore, we recommend simplifying the query and removing unnecessary conditions as much as possible. In addition, the server automatically performs query rewriting and query optimization. In general, users do not need to pay special attention. If you find that the query latency is high, you can contact Search Index R&D to optimize the query.
Tablestore only supports ordinary Double (but not BigDecimal type) for the time being. However, the business side needs to be very accurate in fields such as money. Therefore, it is recommended to use Long for the storage of such fields. For example, 5.32 yuan could be stored as 53200.
MatchQuery and MatchPhraseQuery are queries designed specifically for full-text index scenarios of text-type fields. MatchQuery and TermQuery may have the same query results for fields of the Keyword type. However, MatchQuery has additional word segmentation processes and relatively poor performance. Therefore, do not use MatchQuery for fields of the Keyword type.
*word*
, which means that for any substring query requirements, you can use the fuzzy word segmentation method (fuzzy word segmentation and phrase matching query MatchPhraseQuery combined) to implement fuzzy queries with better performance. Please see Fuzzy query for more information.We recommend using token pagination for deep pagination in search indexes. If you need to persist a token (of the byte[] type), you can use Base64 encoding as a string and then store it. If you directly perform string encoding (such as new String(token)
) the token content will be lost.
This section solves the problem in which users need personalized column names while the maximum number of index fields supported by search index is insufficient. Let’s suppose there are 1,000 users in the system, and each user has a personalized column name. Then, a total of 101,000=10,000 fields are required when each user needs to use 10 fields in the search index. However, the current search index does not support so many fields. We use the logical fields and physical field mapping* idea to solve the problem, so all users can share some fields of the search index. The details are listed below:
1. Index Design: Let’s assume you only need two data types: Keyword and Long. Then, create an index in search index in advance. This index contains 200 fields. The number of fields of different types can be customized according to business needs and other necessary non-personalized fields. The fixed field names in the index are listed below:
2. Prepare a meta table. Tablestore's table or other database tables will do. If the content of this table is not large, it is best to cache it in memory. The relationship with the preceding index is listed below:
3. Data writes and queries need to be performed based on the mapping of meta tables.
Search index currently recommends an index with 20 billion rows or less. However, this does not mean that the maximum number of rows is 20 billion. If the maximum number of rows is more than 20 billion, you can evaluate and design together with search index development. For example, the current number of rows of a user's largest log table is 6.1 billion. It increases by 2.1 billion every year and will not exceed 20 billion in three to five years. Therefore, table sharding is not required. If the stock of data exceeds 20 billion or has the potential to do so, and the growth rate is fast, you can consider table sharding. The specific design can be evaluated and designed together with the development of search indexes. At the same time, some potential problems can be avoided when the amount of data is large.
An Out-of-the-Box: Centralized Audit Upgrade of Kubernetes Logs in Multiple Clusters
1,076 posts | 263 followers
FollowAlibaba Cloud Storage - February 27, 2020
Alibaba Cloud Storage - May 14, 2019
Alibaba Cloud Storage - May 14, 2019
Alibaba Cloud Storage - February 27, 2020
Alibaba Cloud Storage - February 27, 2020
Alibaba Cloud Community - November 26, 2024
1,076 posts | 263 followers
FollowPlan and optimize your storage budget with flexible storage services
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreMore Posts by Alibaba Cloud Community