By Donggu
Lindorm's Wide Table Engine provides petabyte-level storage capability, allowing data to be partitioned based on the primary key range and evenly distributed across each machine. Additionally, Lindorm offers SQL and indexing support, delivering an experience similar to that of relational databases. However, it is essential to note that the underlying structure of Lindorm's Wide Table is fundamentally a distributed NoSQL database based on the LSM-Tree (Log-Structured Merge-Tree) storage architecture, which differs in some aspects from traditional relational databases.
Therefore, before using Lindorm's Wide Table, having a basic understanding of principles such as data modeling and data distribution patterns can help you use Lindorm more effectively. This knowledge can assist in avoiding a range of issues related to performance, hot spots, and other problems that may arise from unreasonable business modeling.
The Lindorm Wide Table Engine is a row storage engine. If you create a sample table orders
using the following SQL statement:
(Note: Insert the actual SQL statement here if it was provided in the original document.)
CREATE TABLE orders (
channel VARCHAR NOT NULL, #Payment channels, Alipay, WeChat, etc.
id VARCHAR NOT NULL, #Order_id
ts TIMESTAMP NOT NULL, #Order_event occurrence time
status VARCHAR, #Order_status,
location VARCHAR, #Order_occurrence location
PRIMARY KEY(channel, id, ts) #Primary_key is composed of the combination of channel, id, and ts
) WITH (DYNAMIC_COLUMNS='true'); #enable dynamic column feature, non-primary key columns can be written freely without predefined structure.
The following data model can be obtained:
Primary Key |
Non-primary Key |
||||
channel |
id |
ts |
status |
location |
…… |
alipay |
a0001 |
1705786502000 |
0 |
shanghai |
…… |
alipay |
a0002 |
1705786502001 |
1 |
beijing |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
unionpay |
u0001 |
1705786502056 |
0 |
hangzhou |
…… |
unionpay |
u0002 |
1705786502068 |
1 |
nanjing |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
|
w0001 |
1705786502056 |
0 |
shanghai |
…… |
|
w0002 |
1705786502068 |
0 |
shanghai |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
Columns in a row are divided into primary keys and non-primary keys. A primary key can consist of multiple columns. The primary key has the following characteristics:
orders
, rows with equal values in the channel
column will be stored together. For rows with equal channel
values, they will be sorted by the id
column, and for rows with equal id
values, they will be sorted by ts
.Leveraging the characteristic of the primary key being a clustered index allows for more efficient queries. During query execution, similar to MySQL, Lindorm follows the leftmost prefix match principle. You can specify as many primary key equalities as possible to narrow the query range. When designing the primary key for a table, you can place commonly used equality query conditions at the leftmost side of the primary key. If a range query is set for a certain primary key in the query statement, the leftmost prefix match ends immediately; even if other primary keys are equality queries, subsequent primary key scans cannot utilize the storage order and require extensive filtering to match the range query.
Here is a sample query for the orders
table:
-- We need to read all the data where channel is alipay and id is greater than a0089, in order to match the rows that meet the condition ts=1705786502068.
SELECT * FROM orders WHERE channel=="alipay" AND id > 'a0089' AND ts = 1705786502068;
If the leftmost primary key is not specified, querying other primary keys will result in a full table scan, even if the primary key is specified in the query conditions, as shown below:
-- This is a query scans the whole table
SELECT * FROM orders WHERE id = 'a0089';
In the aforementioned statement, the query condition specifies a primary key ID, but this primary key is located in the middle of all primary keys. Therefore, the system still needs to filter all rows to match the row with ID='a0089', resulting in very low efficiency. If this query is a primary query, it is recommended to place the ID column as the leftmost primary key when designing the table, or to create an index table for the ID column to speed up the query.
Lindorm supports the dynamic definition of non-primary keys, meaning that non-primary keys do not need to be defined in the schema. You can write non-primary key columns with any column name, similar to how HBase operates. For detailed usage instructions, please refer to Dynamic Columns. Additionally, Lindorm supports using wildcards to define non-primary key columns. For example, if you define columns ending in *_str
as STRING types, all columns with names ending in _str
will be written as STRING type. For more information, see Wildcard Columns.
Lindorm supports the updating of non-primary key columns. When writing data, it is not necessary to specify all non-primary key columns, but at least one non-primary key column must be specified. Writing only the primary key is not supported. Since Lindorm's wide table storage is ordered by primary key, performing a full table scan may occur if no index is created on the non-primary key column and this column is used as a filter condition. By default, Lindorm will reject such queries. Therefore, if you need to efficiently query non-primary keys, you can either restrict the primary key range or create indexes for the non-primary key columns.
-- If the primary key is not specified and only non-primary key columns are used as query conditions, the request will scan the entire table.
SELECT * FROM table WHERE location = 'shanghai';
-- A scan that specifies the primary key will scan all rows where channel = 'alipay' and filter out the rows where location = 'shanghai'.
SELECT * FROM table WHERE location = 'shanghai' and channel='alipay';
Lindorm is a distributed database where data in tables is partitioned according to primary key ranges and automatically distributed across the nodes (machines) of the instance. As shown in the diagram:
In Lindorm, partitions are referred to as Regions. A Region stores a segment of the table's data, and all Regions are distributed according to the primary key range, connected end-to-end to form the entire table space. Suppose we write a row of data: {alipay, a100999, 1705786502068}
.
This row of data will be stored in Region_3. When the data in Region_3
increases beyond the threshold value (the default threshold is 8 GB), or the system detects a read/write hotspot in the Region, the Region will be split. The split Regions will respectively cover the upper and lower halves of the original Region's range. The system will assign the new Regions to different servers based on the load to achieve load balancing.
The system does not guarantee that rows with the same prefix will be in the same partition. For example, the system cannot ensure that rows that meet the condition channel=alipay will be stored in the same Region. If there are enough rows that meet the condition channel=alipay
, they will be distributed across multiple Regions. For instance, in the diagram above, rows with channel=alipay
are distributed Region_1
, Region_2
, and Region_3
. During actual writing processes, you don't need to worry about overloading certain data ranges, nor do you need to specify partitions using PARTITION BY
when creating a table. No intervention is required; the system will automatically choose appropriate split points.
Since Lindorm supports automatic partitioning, there is no need to define partition ranges when creating a table. During the data writing process, the system will automatically split partitions. If you have specific requirements for data distribution, you can specify the number of pre-partitions when creating the table. The system will distribute the table's Regions across multiple machines to balance read and write operations. It's important to note that the number of partitions specified during table creation is only the initial number of Regions; the system will still split Regions as data is written. For specifying pre-partitions during table creation, refer to the CREATE TABLE.
If you anticipate a large initial write volume or plan to use Bulkload for batch data import, it is recommended to set a suitable number of pre-partitions during table creation to prevent a single server from exceeding its load during the initial write phase.
If you are writing data using SQL or HBase API, you can specify the number of pre-partitions as the number of nodes multiplied by 4 during table creation. The initial number of partitions should not be excessively large, so it is advisable to set an appropriate number of partitions based on your requirements.
For Bulkload batch data imports, it is recommended to specify the number of pre-partitions as the data volume (GB) divided by 8 during table creation. This allows data to be evenly distributed across each Region, preventing any single Region from holding too much data and triggering splits.
Additionally, you need to ensure that the range of pre-partitioned Regions aligns with the data writing patterns. Otherwise, even with a large number of pre-partitions, data skewing into a few Regions will lead to performance issues that do not meet expectations.
In a distributed system, the key is to ensure that data requests are evenly distributed across each Region, enabling the system to achieve horizontal scaling and handle more data or requests. However, when the system encounters a hotspot, it can cause a single machine to become a performance bottleneck.
● Single Row Hotspot: Frequent reads and writes to a single row of data can generate a single row hotspot. Since requests for the same row will always be sent to the same server, the performance limit of that server determines the overall system performance limit. In this case, the hotspot issue cannot be resolved through horizontal scaling but can only be handled by upgrading the server's configuration to process more requests. Therefore, in wide table design, it's crucial to consider hotspot issues and avoid generating single row hotspots.
● Small Range Hotspot: Frequent reads and writes within a very small range of data can create a small range hotspot. Since Lindorm partitions data based on the range of the primary key, a small range of data is likely to be distributed within the same partition, causing requests to be sent to the same server. Although Lindorm's hotspot self-healing feature can recognize small range hotspots and automatically split the Region to distribute the data across different nodes, you can still choose more dispersed primary keys to avoid small range requests. For example, in the sample table orders
, you can use a HASH function on the id
column before storing it to minimize hotspot issues.
● Increasing Primary Key Write Hotspot: If the value of the primary key is increasing, it means that the value of the primary key is continually growing. Since Lindorm partitions data based on the range of the primary key, even if a Region splits, subsequent data writes will only be stored in the lower half of the split child Regions, and the load cannot be dispersed. For instance, in the example table orders
, if the id
is increasing and the write volume of orders meeting the condition "channel=alipay" is very large, then even if Region_3
can split into Region_3_a
and Region_3_b
, only Region_3_b
, only will bear the write load. Because the issue of write hotspots due to increasing primary keys cannot be resolved at the system level, it is recommended to avoid having the primary key increase whenever possible when designing the primary key, and the first column of the primary key should not be an increasing value.
Important
If you are migrating data from a system using HASH partitioning design, such as Cassandra, to Lindorm, please note that in systems like Cassandra, the first column is the HASH partition key. If the values in this column are incrementally increasing, after being processed by the HASH function, these values will be evenly distributed across each partition. However, if you migrate this incrementing primary key to Lindorm, it will result in serious hotspot issues.
The primary key design is crucial in a Lindorm wide table, as data distribution and data sorting are closely related to the primary key. A good primary key is key to the reasonable allocation of wide table resources and the correct usage of the wide table. For suggestions on primary key design, please refer to Design primary keys for Lindorm wide tables.
Key points for designing the primary key are as follows:
● Keep the Primary Key Short: The primary key should be as short as possible while ensuring the uniqueness of the row. Do not include JSON, webpage content, etc., in the primary key.
● Distribute the First Column of the Primary Key: If using an incrementing value like an order ID, consider using methods such as hash(id) + id
or reverse(id)
to store the value.
● Avoid Incremental Values in the First Column: The first column of the primary key should not contain incrementing values to avoid serious hotspot issues. If an incremental field in the primary key is unavoidable, place it in the second or third column to ensure the leftmost part of the primary key is as dispersed as possible. For example, in the orders
table, if the first primary key column is channel
, and the IDs within different channels
increment, the writes will be distributed across different Regions, mitigating the risk of a single point of write.
● Optimize for Query Performance: The primary key is crucial for query performance (in the absence of an index table). Therefore, the primary key design should be considered in conjunction with the main query patterns.
When querying solely with the primary key cannot meet business performance requirements, a secondary index can be created for the columns being queried. For a detailed introduction, please refer to the section on Secondary Indexes.
A secondary index is an index table established with the index column as the primary key. The indexed column essentially serves as the primary key of the index table, so the design of the index column must still follow the principles of primary key design. Columns with low dispersion or incrementing values are not suitable for indexing.
Queries using a secondary index adhere to the principle of leftmost matching. For example, if a composite index is created for columns a, b, and c, and the query condition specifies only primary key b or c, but not the leftmost primary key a (such as in the query: SELECT * FROM tablename WHERE b=xx;
), the query will still perform a full table scan.
After creating a secondary index, data will be written to both the main table and the index table simultaneously. A single data write will involve multiple operations, such as re-reading the main table and writing to the index table, which can affect the write performance of the main table. To ensure the read and write performance of the main table, the number of secondary index tables should not be excessive.
Additionally, columns indexed by a secondary index cannot be modified. If an existing index table fails to meet query requirements, the only option is to delete the index and create a new one. Therefore, secondary indexes are more suitable for relatively static query scenarios. If there are many columns requiring indexing and composite queries, it is recommended to use Lindorm search indexes and columnar storage indexes.
Below is a comparison of the three types of indexes:
Index Type | Applicable Scenarios | Additional Dependencies | Real-Time Visibility |
---|---|---|---|
Secondary Indexes | Applicable for relatively fixed query scenarios. | None | Yes |
Search Index | Suitable for online query scenarios with multiple index columns and a variety of query conditions. | Requires enabling the search index feature and purchasing search engine nodes. | Data needs to be synchronized to the search engine, resulting in some delay. Please refer to Search Index Overview. |
Columnar Storage Index | Suitable for scenarios where analysis queries are conducted on certain columns. | Requires enabling the columnar storage index feature and purchasing computing engine nodes. | Data needs to be converted to the columnar storage format, resulting in some delay. Please refer to Column Index |
Data writing operations need to follow reasonable primary key design to ensure data can be evenly distributed across each server, thereby achieving optimal performance. Batch writing can save the number of RPC calls compared to single-row writing, and servers can process row writes in bulk, making it easier to reach higher throughput. However, when using batch writing, it is not always better to include more rows. Too much data in a single batch may cause server OOM (Out Of Memory) or Full GC (Garbage Collection), thus affecting service stability. Therefore, it is recommended to control the number of rows per batch within a reasonable range, with a batch size not exceeding 2 MB.
Lindorm supports both HBase API and SQL for data access and writing, but the two access methods cannot be mixed. If you create a table using SQL and define each column's data type (e.g., INT, LONG, etc.), using the HBase API to write to these columns may result in the inability to read using SQL. Tables created using SQL can only be read and written through SQL. Conversely, tables created using the HBase API can be accessed through SQL, and you can use column mapping to add a schema. For specific operations, please refer to Use SQL statements to access an ApsaraDB for HBase table.
Lindorm wide table commonly has two types of queries: point lookup and range query.
If all the primary keys of the table are specified during the query, it is a point lookup. For example:
SELECT * FROM orders WHERE channel='alipay' AND id='a0001' AND ts=1705786502000;
SELECT * FROM orders WHERE channel='alipay' AND id='a0001' AND ts IN (1705786502000, 1705786502222, 1705786502333);
SELECT * FROM orders WHERE channel='alipay' AND id IN ('a0001', 'a0002', 'a0003') AND ts IN (1705786502000, 1705786502222, 1705786502333);
SELECT * FROM orders WHERE channel IN ('alipay', 'wechat', 'unionpay') AND id IN ('a0001', 'a0002', 'a0003') AND ts IN (1705786502000, 1705786502222, 1705786502333);
In the above example sentences, the first SQL statement is a single-row point query, while the other SQL statements are multi-row point queries. The second SQL statement specifies multiple conditions, which will return multiple rows in a Lindorm wide table. In the third SQL statement, the primary key id
and ts
each specify three IN
conditions, so the actual query will check 3×3=9 rows of data. Similarly, the fourth SQL statement will query 3×3×3=27
rows of data. The more IN
conditions there are in a point query, the greater the likelihood of generating cross combinations (which causes a Cartesian product), thereby increasing the number of query row requests sent to the server. When executing multi-row point queries in a Lindorm wide table, it will return all results at once. The more rows queried, the more rows the server needs to process, and the larger the returned dataset, which can more easily cause the server to run out of memory (OOM) or trigger full garbage collection (Full GC). Therefore, during point queries, you need to control the number of point queries, reduce the number of data entries in the IN
conditions, and minimize the number of IN
combinations.
Lindorm imposes a limit on the number of batch point queries, with a default maximum of 2,000 point query requests in one execution. If this limit is exceeded, it will result in the error Multi Get Plan query too many rows in one select
. If you have a large batch point query requirement and an assessment indicates that memory will not become a bottleneck, you can contact Lindorm technical support (DingTalk ID: s0s3eg3) to request a relaxation of this limit.
If no primary key is specified in the query conditions, or only part of the primary key is specified, then the query is a range query. For example:
SELECT * FROM orders;
SELECT * FROM orders WHERE channel='alipay';
SELECT * FROM orders WHERE channel='alipay' AND id='1705786502001';
SELECT * FROM orders WHERE channel='alipay' AND id IN ('a0001', 'a0002', 'a0003');
SELECT * FROM orders WHERE channel IN ('alipay', 'wechat', 'unionpay') AND id IN ('a0001', 'a0002', 'a0003');
In the above example SQL statements, no primary keys were specified, and all were range queries. The results of range queries are returned in a streaming fashion. Even with a full table scan, the server can still return all data via streaming without causing an out-of-memory (OOM) issue on the server. Therefore, there is no limit on the number of rows in a single range query, and you don't need to add conditions like LIMIT or OFFSET to the query statement to implement pagination or cursors.
The data in Lindorm wide tables is ordered by primary keys. Even if the SELECT
statement does not include an ORDER BY
clause, the system will return query results in primary key order. However, if a query request hits an index table, the results will be returned in the order of the index table by default. The order of the data in the index table is determined by the order of the index columns, so the order of the index columns dictates the order of the query return.
The columns in the ORDER BY
clause follow the most-left matching principle. If you want to sort efficiently, you should not skip the left-most primary key fields and directly use the middle primary key fields in the ORDER BY
clause for sorting. Sorting by the middle primary key or non-primary key fields in the ORDER BY
clause involves substantial sorting computations. Therefore, it is recommended to place the required sorting primary key at the left-most position or create a secondary index for the non-primary key that requires sorting. Additionally, if you want the query results to be returned in descending order (ORDER BY DESC
), it is advisable to add the DESC
keyword when creating the table. This ensures that the data is stored in descending order to achieve optimal query performance. For detailed instructions on using ORDER BY
, please refer to Use ORDER BY in large result sets.
Lindorm is a NoSQL database based on the LSM-Tree storage structure. Due to multiple version updates, delete markers, and other data (meaning data may be modified or deleted multiple times), the underlying storage does not record the row count of the table in the metadata. If you want to perform an accurate row count on the table, you need to scan the entire table, and the larger the table, the longer the time it will take. If you only need an estimated row count of a table, please refer to Count the number of rows in a table.
If the Count request has qualifying conditions, the query time will depend on the amount of table data that needs to be scanned under the qualifying conditions. For example, SELECT count(*) FROM table WHERE channel = 'alipay'
will read and count all rows that meet the condition channel = 'alipay'
.
When deleting a row of data in a Lindorm wide table, the data is not immediately cleaned up. Instead, a "delete marker" is first written, and the data is only completely removed after the system performs a COMPACTION
operation. Before the data is thoroughly cleaned up, both the deleted data and the delete marker will still be queried and filtered out through computation logic. Therefore, a large number of delete operations will increase the number of delete markers and affect query performance. It is recommended to set a TTL (time-to-live) to phase out unnecessary data rather than deleting it directly. If there is a substantial need for deletion during business operations, you can shorten the COMPACTION
operation cycle to quickly clean up deleted data and delete markers. For information on how to set TTL and the COMPACTION
cycle, please refer to ALTER TABLE.
If secondary indexes are used, when updating data in the main table, the data in the secondary index table before the update in the main table will be deleted, and then new data will be written. Therefore, updating data in the main table will also generate a large number of delete markers in the secondary index table, and frequent updates to the main table will affect the query performance of the index table. You can mitigate the impact of delete markers by shortening the COMPACTION
operation cycle.
Important:
Shortening theCOMPACTION
operation cycle will increase the system load. Please set it reasonably.
The Lindorm wide-table engine is based on the LSM-Tree storage structure. If you are unfamiliar with the multi-version, timestamp, TTL, and other features of the LSM-Tree structure, improper use may result in write queries not meeting expectations. In such cases, you can refer to the Common causes for unexpected query results.
When using Lindorm wide-tables, you need to continually monitor instance metrics such as CPU, network, and disk levels to avoid performance drops due to insufficient resources. Additionally, keep an eye on the number of files on the server, the size of Regions, and whether there is a backlog in Compaction to prevent exceeding Lindorm's limits, which could impact data writes and queries. For details on Lindorm usage restrictions, refer to Quotas and Limits, and for key monitoring items to watch in monitoring and alerts, refer to Monitoring and Alerts Best Practices.
For other common errors and their solutions, please refer to the FAQ.
Graph Analysis Based on PolarDB: Fast Import of Graph Data via Tables
Alibaba Cloud Community - December 28, 2022
zhuoran - February 5, 2021
Alibaba Cloud Community - October 29, 2024
Hologres - July 24, 2020
digoal - May 16, 2019
Alibaba Clouder - April 12, 2019
Follow our step-by-step best practices guides to build your own business case.
Learn MoreLindorm is an elastic cloud-native database service that supports multiple data models. It is capable of processing various types of data and is compatible with multiple database engine, such as Apache HBase®, Apache Cassandra®, and OpenTSDB.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMigrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.
Learn MoreMore Posts by ApsaraDB