Tablestore – Best Practice in Diversified Search Index

By Xunjian from Alibaba Cloud Storage

This article introduces some best practices using search indexes based on the problems many users and business scenarios face during the process.

Index Field Type

The physical structure design of different types of fields varies in search indexes:

The bottom layer of the Keyword field type uses FST and inverted chaining. Therefore, equivalent queries on the Keyword field type TermQuery are fast.
Long-type fields use BKD-tree at the underlying layer. Therefore, range queries on Long-type fields are fast.

Although Keyword supports range queries and Long supports equivalent queries, the performance is much worse. The larger the data volume, the larger the performance difference is. Therefore, you must plan the field types in advance when you import data from a wide table (primary table) in Tablestore.

Some common designs that are wrongly applied are described below:

The values of enumeration types (such as Type and Status) are 0,1, and 2. They are numbers but are saved as strings and indexed as Keyword. The TermQuery or TermsQuery operation is fast (similar to the In operation in SQL).
UserId is a long-shaped number, but UserId is equivalent query (TermQuery), so UserId needs to be stored in a string.
Status (such as isDeleted) had the valid value of 0 or 1. You must save the value as a string or bool.
If the existing service uses the wrong data type, and it is not convenient to modify the data type of the existing service, you can use Virtual columns and Dynamically modify schemas to correct the field type to achieve the purpose of acceleration. Dynamically modify the schema function to provide a smooth upgrade to the original service, and virtual columns provide field type correction.

Primary Key Design

The primary table of Tablestore is partitioned by Range based on the partition key. The design of the primary key affects the synchronization speed of search indexes and the horizontal expansion of queries in some scenarios.

Primary keys need to be as discrete as possible (such as hashing with md5). Some common counterexamples are using the auto-increment ID and using the current timestamp as the partition key. Please see the Tablestore table design for more information.
If you need to pull batch data from the primary table based on the prefix of the primary key, you can design some special primary keys. Then, you can perform a GetRange query on the primary table to pull data quickly.
If you often use a certain field in search indexes to perform equivalent queries on TermQuery (like the case in Taobao App where each query contains UserID=xxx), we recommend putting UserId in the primary key. Please see Index Routing Optimization later in this article for more information.

The primary key design has an impact on the speed of index synchronization. The requirement for primary keys is to distribute writes to different partitions of the primary table as much as possible. Therefore, if the write TPS is high, it is necessary to distribute writes to multiple partitions of the primary table as much as possible to avoid problems (such as single write partitions and tail write partitions). Currently, search indexes only support asynchronous writing from the primary table to the index. The architecture upgrade of real-time synchronization is still under design.

A counterexample:

The user splices UserID + commodity ID as the partition key, and the commodity ID is auto-incrementing. If a UserID is a large user and the TPS written by the large user is too high, there will be a write tail problem. The last partition of the Tablestore main table will always be written, so the write is not friendly to synchronization delay and query. If there's no need to perform a range query based on UserID on the main table, set the partition key to md5 (commodity ID). If it needs to perform a batch query based on UserID on the main table, set the partition key to UserID + md5 (commodity ID).

Index Presorting

If most queries in search indexes are sorted according to a certain mode, setting pre-sorting can save sorting time. The performance optimization effect of pre-sorting will be better with more data being hit. Search indexes are pre-sorted based on all primary keys of the primary table by default. Currently, this feature only supports the use of SDKs to create pre-sorted search indexes. Please see Create search indexes for more information.

Index Routing Optimization

If the query must carry a certain field (like the case of the Taobao app where each query carries UserID=xxx), routing optimization is recommended. By default, the data distribution of search indexes is hash partitioned based on all primary keys of the primary table. Therefore, all partitions of the index engine are accessed during queries. After indexes are added with routes, the data distribution can be changed. Data is hash partitioned based on route keys (instead of all previous primary keys). Therefore, a certain route key must fall on one or more partitions.

Please see Use of Search Index Routing Fields for more information about how to use routing. Routing optimization can bring the following benefits:

It reduces long-tail queries. Before routing is used, each query will access each partition of the index engine. According to the barrel effect, the overall query latency depends on the slowest partition. Therefore, if one partition has burrs or network cards, the overall query request will slow down.
It supports higher QPS query capability. Requests that carry routes only access some partitions of the index engine and do not access all partitions. Therefore, there is no read amplification, which consumes fewer resources on the cluster and can support higher QPS.
If the search index with routing is properly designed, theoretically, there is no upper limit.

Common Mistakes:

The user has only two UserIDs, but there are 2 billion rows of data in the table, which means one UserID has 1 billion data, and the index partition will be too large after the route is added. Users can contact search index R&D for evaluation and special solution processing to deal with it. In general, it is recommended that the value of the routing key (such as the preceding UserID) be as rich as possible. The total data under the same routing key should not be too much (e.g., no more than 0.1 billion). If there is too much data, consider concatenating multiple unchanged fields into the routing key.
If users set UserID as the routing key but later come across scenarios with no UserID specified in the query, all partitions in the engine are accessed during the query without affecting the integrity of the query results.

Query Optimization

If the query is very complex (there are too many conditions, too deep nesting, and too many elements in the Terms query), the query latency is likely to be relatively high. Therefore, we recommend simplifying the query and removing unnecessary conditions as much as possible. In addition, the server automatically performs query rewriting and query optimization. In general, users do not need to pay special attention. If you find that the query latency is high, you can contact Search Index R&D to optimize the query.

Floating-Point Numbers (Currency, Price, etc.)

Tablestore only supports ordinary Double (but not BigDecimal type) for the time being. However, the business side needs to be very accurate in fields such as money. Therefore, it is recommended to use Long for the storage of such fields. For example, 5.32 yuan could be stored as 53200.

Full-Text Index and Fuzzy Queries

MatchQuery and MatchPhraseQuery are queries designed specifically for full-text index scenarios of text-type fields. MatchQuery and TermQuery may have the same query results for fields of the Keyword type. However, MatchQuery has additional word segmentation processes and relatively poor performance. Therefore, do not use MatchQuery for fields of the Keyword type.

For scenarios where the query mode in wildcard query (WildcardQuery) is *word*, which means that for any substring query requirements, you can use the fuzzy word segmentation method (fuzzy word segmentation and phrase matching query MatchPhraseQuery combined) to implement fuzzy queries with better performance. Please see Fuzzy query for more information.

Paging and Token Encoding

We recommend using token pagination for deep pagination in search indexes. If you need to persist a token (of the byte[] type), you can use Base64 encoding as a string and then store it. If you directly perform string encoding (such as new String(token)) the token content will be lost.

Logical Fields and Physical Field Mapping

This section solves the problem in which users need personalized column names while the maximum number of index fields supported by search index is insufficient. Let’s suppose there are 1,000 users in the system, and each user has a personalized column name. Then, a total of 101,000=10,000 fields are required when each user needs to use 10 fields in the search index. However, the current search index does not support so many fields. We use the logical fields and physical field mapping* idea to solve the problem, so all users can share some fields of the search index. The details are listed below:

1. Index Design: Let’s assume you only need two data types: Keyword and Long. Then, create an index in search index in advance. This index contains 200 fields. The number of fields of different types can be customized according to business needs and other necessary non-personalized fields. The fixed field names in the index are listed below:

keyword_1,keyword_2,....,keyword_100
long_1,long_2,...,long_100
Other Business Necessary Fields

2. Prepare a meta table. Tablestore's table or other database tables will do. If the content of this table is not large, it is best to cache it in memory. The relationship with the preceding index is listed below:

If User 1 has fields a, b, and c of the Keyword type, a row of data is recorded for User 1 in the meta table: field a-> index keyword_1, field b-> index keyword_2, field c-> index keyword_3, and the same for Long type.
If User 2 has Keyword type fields b, c, and d, record a row of data for User 2 in the meta table: field b-> index keyword_1, field c-> index keyword_2, field d-> index keyword_3, and the same for Long type.

3. Data writes and queries need to be performed based on the mapping of meta tables.

Data Writing: Write physical fields to logical columns based on the meta mapping table.
Data Query: First, convert the user query fields based on the meta mapping table. For example, if you want to check User 2 for b=4 && d=5, convert it to keyword_1=4 && keyword_3=5.

Preparation before Importing Large Quantities of Data

After you create a Tablestore table for the first time and before importing data, you can contact R&D to pre-partition the table if there are a lot of data (such as more than 1 billion rows).
If the amount of data is too large, we recommend importing data from the primary table before creating a search index. This significantly improves the speed of indexing the existing data.

Table Sharding Problem

Search index currently recommends an index with 20 billion rows or less. However, this does not mean that the maximum number of rows is 20 billion. If the maximum number of rows is more than 20 billion, you can evaluate and design together with search index development. For example, the current number of rows of a user's largest log table is 6.1 billion. It increases by 2.1 billion every year and will not exceed 20 billion in three to five years. Therefore, table sharding is not required. If the stock of data exceeds 20 billion or has the potential to do so, and the growth rate is fast, you can consider table sharding. The specific design can be evaluated and designed together with the development of search indexes. At the same time, some potential problems can be avoided when the amount of data is large.

Community

Tablestore – Best Practice in Diversified Search Index

Index Field Type

Primary Key Design

Index Presorting

Index Routing Optimization

Query Optimization

Floating-Point Numbers (Currency, Price, etc.)

Full-Text Index and Fuzzy Queries

Paging and Token Encoding

Logical Fields and Physical Field Mapping

Preparation before Importing Large Quantities of Data

Table Sharding Problem

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Storage Capacity Unit

Hybrid Cloud Storage

Hybrid Cloud Distributed Storage

Data Lake Storage Solution

A Free Trial That Lets You Build Big!