Search indexes are used for multi-dimensional data queries and statistical analysis in big data scenarios based on inverted indexes and column stores. If your business requires complex queries and data analysis, you can create a search index and specify the required attributes as the fields of the search index. Then, you can query and analyze data by using the search index. For example, you can use a search index to perform queries based on non-primary key columns, Boolean queries, fuzzy queries, full-text search, and k-nearest neighbor (KNN) vector queries. You can also use a search index to obtain maximum and minimum values, collect statistics about the number of rows, and group query results.
Background information
Search indexes can solve complex query problems in big data scenarios. Other systems such as databases and search engines can also solve data query problems.
The following figure shows the differences among Tablestore, databases, and search engines.
Tablestore can provide all features of databases and search engines except JOIN operations, transactions, and relevance of search results. Tablestore also has high data reliability of databases and supports advanced queries of search engines. Therefore, Tablestore can be used to replace the common architecture that consists of databases and search engines
. If you do not need JOIN operations, transactions, or complex relevance of search results, we recommend that you use the search index feature of Tablestore.
Introduction
Search indexes are used for multi-dimensional data queries and statistical analysis in big data scenarios based on inverted indexes and column stores. Search indexes support various query methods, including query based on non-primary key columns, prefix query, fuzzy query, Boolean query, nested query, geo query, full-text search, and KNN vector query. Search indexes also support multiple aggregation operations. You can perform aggregation operations to obtain the maximum and minimum values, count and distinct count of rows, sums, averages, and percentile statistics, group results by specific conditions, and display data as histograms.
The following figure shows how inverted indexes and column stores are used for search indexes, as well as the structure of a multi-dimensional spatial index.
Compared with indexes of traditional database services such as MySQL, the search index feature of Tablestore is not subject to the leftmost matching principle. Therefore, the search index feature can be used in more scenarios. In most cases, only one search index is required for a data table. For example, a data table about student information contains the following columns: student name, ID, gender, grade, class, and home address. If you want to query the data in the data table by specifying a combination of conditions, such as querying students named Zhang San in Grade Three
, querying male students who live within one kilometer of the school
, and querying students in Class Two, Grade Three from the specified residential community
, you can create a search index and add these columns to the search index.
Aside from queries based on primary key columns in data tables, Tablestore provides the following two index schemas for accelerated data queries: secondary index and search index. The following table describes the differences among the three types of indexes.
Index type | Description | Scenario |
Primary key of a data table | A data table is similar to a large map. Data tables support queries based only on primary key columns. | Primary key-based queries are suitable for scenarios in which the values of all primary key columns or the primary key prefix can be determined. |
Secondary index | You can create one or more index tables for a data table and perform queries by using the primary key columns of the index tables. | Secondary indexes are suitable for scenarios in which the columns to be queried can be determined, the number of columns to be queried is small, and the values of all primary key columns or the primary key prefix can be determined. |
Search index | Search indexes use inverted indexes, Bkd-trees, and column stores for various query scenarios. | Search indexes are suitable for all query and analysis scenarios in which queries based on the primary key columns and secondary indexes of data tables cannot meet your business requirements. For example, you can use search indexes to perform queries based on non-primary key columns, Boolean queries, relational queries, full-text search, geo queries, prefix queries, fuzzy queries, nested queries, exists queries, and aggregation operations. |
Typical scenarios
Search indexes can be widely used in various application systems for data query and analysis. The following table describes some scenarios in which search indexes can be used.
Application system | Scenario |
E-commerce platform | You can use search indexes on e-commerce platforms to classify products and filter attributes. This facilitates product search and filtering for customers. |
Social media application | You can use search indexes in social media applications to query the follower and friend connections between users and provide the recommendation and matching features based on the interests of users. |
Log analysis | You can use search indexes to query logs based on conditions such as keywords and time ranges. This allows you to identify issues in an efficient manner and analyze log data. |
Data analysis for IoT | You can use search indexes to query and analyze device data in Internet of things (IoT) scenarios. For example, you can filter device data and collect statistics based on device types and geographic locations. |
Application performance monitoring | You can use search indexes to aggregate and query metric data, which is essential for monitoring application performance. For example, you can filter and aggregate data by time range and application name. |
Location-based service | You can use search indexes to query geographic locations and search for nearby points of interest such as stores, attractions, and services. |
Text search engine | You can use search indexes for full-text queries and relevance-based sorts in a text search engine. This way, you can search for information such as documents and articles in an efficient manner. |
Key features
Search indexes support the following key features:
Database query acceleration
Queries based on primary key columns and non-primary key columns
Boolean queries
Geo queries
AND, OR, NOT, and exists queries
IN queries
Fuzzy queries, including wildcard queries, prefix queries, and suffix queries
Nested queries
Sorting
Paging
Aggregation operations to obtain the maximum and minimum values, sums, averages, count and distinct count of rows, and percentile statistics, group results by specific conditions, and display data as histograms
Full-text search
KNN vector queries
Quick data filtering
For more information, see Features.
Disaster recovery capability
By default, search indexes provide the disaster recovery capability in regions that support the zone-redundant storage (ZRS) feature. In the regions, data is stored in multiple zones. If a fault such as power outage, network outage, or fire occurs in a zone, the read and write of data is not affected. Search indexes support the ZRS feature in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), China (Ulanqab), China (Hong Kong), Japan (Tokyo), Indonesia (Jakarta), Singapore, and Germany (Frankfurt).
In the preceding regions, existing search indexes are upgraded. Therefore, both existing and newly created search indexes support the ZRS feature. For more information, see ZRS.
Limits
For more information, see Search index limits.
Usage notes
To create a search index for a data table, you do not need to specify predefined columns for the data table.
Applicable model
Search indexes are applicable only to the Wide Column model.
Index synchronization
After a search index is created for a data table, data is written to the data table first. After the data is written to the data table, a success message is returned. At the same time, another asynchronous thread reads the newly written data from the data table and writes the data to the search index. The write performance of Tablestore is not affected when data is being asynchronously synchronized from a data table to a search index.
In most cases, the latency for synchronizing data to a search index is within 3 seconds. You can view the latency in real time in the Tablestore console.
Time to live (TTL)
If the UpdateRow operation is disabled for a data table, you can use the TTL feature of the search index that is created for the data table. For more information, see Specify the TTL of a search index.
If you want to retain data only for a period of time and the time field does not need to be updated, you can implement the TTL feature by partitioning a data table into several data tables based on the time field. The following table describes the principle, rule, and benefits of table partitioning by time.
Item
Table partitioning by time
Principle
Partition a data table based on fixed periods of time, such as daily, weekly, monthly, or annually. Then, create a search index for each partitioned table. This way, you can retain tables for the specified periods of time based on your business requirements.
For example, to retain data for a six-month period, you can store the data for each month in a data table. Label the data tables in sequence from table_1 to table_6 and create a search index for each data table. Make sure that each data table and its search index exclusively store data for their respective month. To implement the TTL feature, you need to only delete data tables and their associated search indexes that are retained for longer than six months.
When you query data by using a search index, you need to only query a specific data table if the data table contains all the data that falls within your desired time range. If the data that falls within your desired time range spans multiple data tables, you need to query all these data tables and merge the query results.
Rule
The size of a single search index can be up to 50 billion rows. To ensure the optimal query performance, we recommend that you limit the size of a single search index to 20 billion rows or fewer.
Benefits
You can adjust the data storage duration based on the number of data tables retained.
Query performance is directly proportional to data volumes. After a data table is partitioned into multiple data tables, the data volume of each data table has an upper limit. This ensures better query performance and avoids high query latency or timeouts.
Max versions
You cannot create a search index for a data table for which you specified the max versions parameter.
You can specify the timestamp when you write data to a column that allows only a single version. If you write data with a greater version number first and then write data with a smaller version number, the data with the greater version number may be overwritten by data with the smaller version number.
The results of the Search and ParallelScan requests may not include the timestamp attribute.
API operations
Tablestore provides API operations to manage search indexes and implement the features of search indexes. You can call the Search or ParallelScan operation to implement the features of search indexes. Most features that are provided by the two API operations are the same. However, to improve the performance and throughput, the ParallelScan operation does not provide some features of the Search operation.
Category | Operation | Description |
Management operations | Creates a search index. | |
Queries the details about a search index. | ||
Queries a list of search indexes. | ||
Deletes a search index. | ||
Query operations | Search | Supports all features of search indexes. You can call the Search operation to query data by using all supported query methods and analyze data by performing sorting and aggregation operations. The query results are returned based on the specified order.
|
Exports data in parallel. You can call the ParallelScan operation to query data by using all supported query methods. However, to ensure faster retrieval of query results, the ParallelScan operation does not support analysis features such as sorting and aggregation. The ParallelScan operation offers superior query performance compared to the Search operation. When the ParallelScan operation processes a single query request that includes a parallel scan task, its throughput is five times that of the Search operation.
When you call this operation, you must call the ComputeSplits operation to query the maximum number of parallel scan tasks for a single ParallelScan request. |
Procedure
Step | Operation | Description |
1 | After you create a search index for a data table, you can query data in the data table based on the fields that are used to create the search index. | |
2 | Use the search index to query data | The following query methods are provided: match all query, match query, match phrase query, term query, terms query, prefix query, range query, wildcard query, Boolean query, nested query, geo query, geo-bounding box query, geo-polygon query, exists query, and collapse (distinct). Use query methods based on your business requirements. If you use a search index to query data, the field values can be tokenized into multiple tokens based on the tokenization method that you specify. The rows that meet the query conditions can be returned by order and page based on the sorting and paging methods that you specify. For more information, see Tokenization and Sorting and paging. |
3 | To analyze data in the data table by using the search index, you can perform aggregation operations to obtain the minimum value, maximum value, sum, average value, count and distinct count of rows, and percentile statistics. You can also perform aggregation operations to group results by field value, range, geographical location, filter, histogram, or date histogram. | |
4 | If you do not have requirements on the order of query results, you can use the parallel scan feature to obtain query results in an efficient manner. |
Methods
You can use search indexes by using the Tablestore console, Tablestore CLI, and Tablestore SDKs.
Billing rules
The billable items of a search index include the read throughput, data storage size, and outbound traffic over the Internet of the search index. The read throughput comprises the reserved read throughput and the metered read throughput. You are charged for the metered read throughput on a pay-as-you-go basis. For more information, see Billable items of search indexes.
FAQ
How do I select between a secondary index and a search index?
What do I do if no data is found by calling the Search operation?
What are the differences between the GetRange and Search operations?
Does Tablestore support the storage of data in the JSON format?
Why are reserved read CUs generated when I use a search index?
Can I modify the reserved read CU settings for a search index?
References
Tablestore allows you to query and analyze data by using the SQL query feature. For more information, see Overview.
NoteYou can also use compute engines such as MaxCompute, Spark, Hive, HadoopMR, Function Compute, and Realtime Compute for Apache Flink to analyze data in Tablestore. For more information, see Overview.
Tablestore provides examples of using search indexes in the following scenarios: e-commerce orders, store searches, geo-fence, and intelligent metadata. For more information, go to the Scenario Demos page in the Tablestore console.