This topic describes the basic principles, scenarios, and policies of specifying table groups and shard counts in Hologres for optimized query performance. This topic also provides answers to some frequently asked questions about table groups and shard counts.
Recommended instance specifications
In practice, you may need to determine the most suitable range of the shard count when the data amount is predictable. The most suitable shard count is not only related to the amount of data to be stored, but also related to the actual access frequency, the actual amount of accessed data, the computing load type such as point query or data analysis, the write throughput, and the number of tables in the table group. Therefore, you cannot obtain the accurate value for the most suitable shard count. The following table describes the recommended shard counts and instance specifications that correspond to specific data amount ranges. You can select suitable settings based on the estimated data amount.
Number of data rows | Recommended shard count | Recommended instance specifications |
Less than 40 million | 10 to 20 | More than 32 CPU cores |
40 million to 400 million | 20 to 40 | More than 64 CPU cores |
400 million to 4 billion | 40 to 80 | More than 128 CPU cores |
4 billion to 40 billion | 80 to 240 | More than 256 CPU cores. In this case, you may need to create a table group. |
40 billion to 400 billion | 160 to 400 | More than 512 CPU cores. In this case, you can create multiple table groups. |
The recommended shard counts and instance specifications that correspond to different numbers of data rows in the preceding table are not the only criteria. A table whose data amount is small can also be added to a table group that has a large shard count. A table whose data amount is large can also be added to a table group that has a single shard. You must select an appropriate shard count based on your business scenario to meet requirements of high concurrency, high computing efficiency, and high data concentration, and prevent unnecessary shuffle overheads.
We recommend that you configure no more than 3,000 tables (including partitioned child tables) in a table group. If a large number of tables exist in a table group, metadata accumulates, and the execution of data definition language (DDL) statements slow down.
You can determine whether to create a table group and how to select a table group for your tables based on the best practices in the following plans:
Plan 1: Use the default table group
If your Hologres instance meets the following conditions, we recommend that you use the default table group. After you upgrade or downgrade the specifications of a Hologres instance, the shard count of the default table group does not change. You can execute the following statement to query the shard count:
SELECT * FROM hologres.hg_table_group_properties; -- Sample result tablegroup_name | property_key | property_value -----------------+---------------+---------------- test_tg_default | is_default_tg | 1 test_tg_default | shard_count | 40 test_tg_default | tg_version | 1 test_tg_default | table_num | 1 (4 rows)
Data amount
The shard count of the default table group meets the requirement of the data amount. In this case, you can create tables in the default table group.
Overall size
The total data amount of all tables is controllable and predictable. The usage mode does not significantly change.
Local join
You need to perform efficient local join operations on tables in the default table group.
Plan 2: Create a table group
The default table group cannot meet your requirements, and you may need multiple table groups. In most cases, multiple table groups may be required in your instance under the following conditions:
Data amount
The shard count of the existing table group is inappropriate for the estimated data amount in the current table. If the shard count is large and the data amount is small, excessive small files are generated and the I/O overhead is high. If the shard count is small and the data amount is large, the concurrency of queries is reduced. In this case, multiple table groups are required.
Independent load
The existing table group contains a large number of tables, and data needs to be simultaneously written to most of the tables. As a result, the instance load is high. In addition, the table to be created requires high query and write throughput. In this case, multiple table groups are required to make data write and query independent to some extent. For more information, see topics about how to isolate computing resources. If you determine that the existing table group cannot meet the write and query requirements, multiple table groups are also required.
Table correlation
If a set of tables in your business have a unique data write or query pattern, have or will have local join requirements, and have few or no correlation with tables in the existing table group, you can create multiple independent table groups for the set of tables. Local join operations can be performed only on tables with the join key as the distribution key and in the same table group. You can create a table group for a set of tables that have a strong correlation with each other but have few correlation and a low probability of local join with tables in the existing table group.
Instance resource scaling
If your instance has been scaled in or out by more than five times, the original shard count may no longer meet the requirements. In this case, you can change the default table group. The number of shards must be greater than the number of compute nodes and less than 60% of the total number of CPU cores.
Plan 3: Add tables to multiple table groups
If you need to plan multiple table groups, we recommend that you plan the roles and significance of the table groups and the table group to which each table belongs before stress testing and production. You can consider the following factors during planning:
Data amount
The shard count is determined by the amount of data to be stored in a table. A table group that has a large shard count is appropriate for large tables, and a table group that has a small shard count is appropriate for small and medium tables.
Required write performance
The shard count has a positive correlation with data write performance. The write capability of a single shard has an upper limit. More shards indicate higher write concurrency and throughput. If you need to write data to a table at a high records per second (RPS), a larger shard count may be required. If the CPU utilization for a single core is 100%, a single shard in Hologres writes data at an RPS of 3,000 to 5,000 (1 KB per record). You can estimate the required shard count based on your required RPS. Each shard also needs to perform read operations such as data query. Therefore, the CPU utilization for data write cannot reach 100%. A shard that uses a 1/3 CPU core writes data at an RPS of 1,000 (1 KB per record). For example, if you want to write data at an RPS of 60,000 and the size of each record is 1 KB, the shard count must be greater than 60. The shard count can be fine-tuned.
Load of each table group
When you create a table group, you must consider the number of tables to be added to this table group. If a large number of tables will be added to this table group in the future, and most of the tables are frequently accessed, a small shard count may fail to support highly concurrent queries.
FAQ
I have an instance that has 512 CPU cores and use it to perform online analytical processing (OLAP) on a real-time event table. The table contains about 20 billion to 40 billion rows. In this case, how do I specify the table group and shard count?
A single computing load is involved so that you can use one table group. The default shard count for an instance that has 512 CPU cores is 160. If the event table contains a large number of columns, such as hundreds of columns, you can appropriately increase the shard count to improve OLAP concurrency. For example, change the shard count of the default table group in the database to 200 or more to store the event table.
I have an instance that has 256 CPU cores and a large number of column-oriented tables, and use the instance to perform fast OLAP in milliseconds. Each table contains tens of millions of rows. I want to group data based on multiple fields and query details based on conditions. How do I specify the table group and shard count?
A single computing load is involved so that you can use one table group. The default shard count for an instance that has 256 CPU cores is 120. For the table that contains tens of millions of rows, we recommend that you specify 10 to 20 shards. Especially for aggregate operations such as grouping, more shards cause more shuffle overheads, and analysis in milliseconds cannot be implemented. Therefore, the default table group may not be able to meet your requirements. To achieve a better effect, you can change the shard count of the default table group in the database to a value between 16 and 40 based on the specific situation, and perform a stress test.
How do I check whether a slow query is caused by an inappropriate shard count?
For an inappropriate shard count, the shard count may be too large or too small.
If the shard count is too large, the query startup overhead or shuffle overhead is large. The query startup overhead can be learned from the start query cost line in the execution result of the EXPLAIN ANALYZE statement. The shuffle overhead can be determined based on the Max_GetNext_Time size of Redistribution Motion in the execution result of the EXPLAIN ANALYZE statement. In Hologres V0.10 and later, you can view these overheads of historical queries in slow query logs.
If the shard count is too small, the CPU utilization cannot reach 100% during long-term computing, the overhead of scanning data is large because of insufficient concurrency, or the data write performance is poor. The overhead of scanning data can be determined based on the Max_GetNext_Time size of Scan Node in the execution result of the EXPLAIN ANALYZE statement. If the CPU utilization for a single core is 100%, a single shard in Hologres writes data at an RPS of 3,000 to 5,000. You can compare the actual RPS with this range to check whether the shard count is too small.
The queries per second (QPS) is not high enough in point query scenarios. Is this issue caused by insufficient shards?
First of all, determine whether another cause exists. For example, analysis query rather than point query is performed, no indexes are used, shards are not split, or the CPU utilization reaches 100%. If other possible causes do not result in the issue, a single SQL statement achieves the highest performance, and the QPS still does not meet your requirement, you can increase the shard count to increase the backend concurrency of the point query.
How do I troubleshoot data skew of a shard?
Hologres provides an internal field hg_shard_id. The field specifies the ID of the shard in which data resides. You can execute an SQL statement to check whether data skew exists in a shard.
SELECT hg_shard_id, COUNT(1) FROM <Table_Name> GROUP BY hg_shard_id ORDER BY COUNT(1) DESC;
If the amount of data in a shard is significantly larger than that in other shards, data skew exists. In this case, you may need to adjust the distribution key.