Senior Technical Experts from Alibaba Discuss Data Warehouse Tuning

Part 1 of this 2-part series summarizes the best practices in the design of AnalyticDB tables, data writing, efficient queries, and some common problems.

We have accumulated some best practices using the promotion and application of the cloud-native data warehouse named AnalyticDB for MySQL (hereinafter referred to as AnalyticDB) in various business lines of Alibaba Group and all walks of life. This article organizes those practices for your reference and hopes to be helpful to readers. This article summarizes the best practices in the design of AnalyticDB tables, data writing, efficient queries, and some common problems.

Note:

Before reading this article, we recommend reading the official documentation of AnalyticDB.
The best practices written in this article are mainly for AnalyticDB 3.0, but AnalyticDB 2.0 is also applicable in principle.

1. Best Practices for Table Design

AnalyticDB is a distributed data warehouse that can bring excellent performance experience in the real-time analysis of large amounts of data. It is expected to pay attention to the following rules when designing tables to give full play to the performance of AnalyticDB in data analysis.

1.1 Select the Appropriate Table Type (Dimension Table or Common Table)

Dimension Table: It is also known as a broadcast table, which is a concept in data warehouse, and generally stores dimension data. The statements to create a table in AnalyticDB contain the keywords of DISTRIBUTED BY BROADCAST. These tables store one copy of data in each node of a cluster. Therefore, the amount of data in a dimension table should not be too large. We recommend storing no more than 20,000 rows of data in each dimension table.

Note: If the dimension table is too large, it will cause the expansion of data storage space. More nodes equal greater expansion. At the same time, it will cause performance degradation during real-time writes, and the IOPS will be relatively high.

Common Table: It is also known as a partition table or fact table, which generally stores the thematic data of a business. The amount of data that can be stored in a common table is usually relatively large, and a common table can store tens of millions or trillions of pieces of data. You can shard data by the level-1 partition and manage the lifecycle of data by the level-2 partition.

1.2 Select Appropriate Distribution Keys (Level-1 Partition Keys)

If you want to create a common table in AnalyticDB, you must use the DISTRIBUTED BY HASH(column_name,...) statement to specify distribution keys. Then, the table is partitioned by the hash value of the columns specified by column_name.

AnalyticDB allows you to use multiple fields as distribution keys.

You can specify distribution keys based on the following rules:

Select fields whose values are evenly distributed as the distribution keys, such as the transaction ID, device ID, user ID, or auto-increment columns.
Select fields that can be used to join tables as the distribution keys. For example, you can select the user_id field as the distribution key when you perform user profile analysis.

Note: Uneven distribution keys can lead to uneven data distribution, affecting the efficiency of writing and querying. In addition, it is prone to cause a full single-node disk and make the entire cluster locked and unavailable. In addition to special business scenarios, we preferentially consider whether data is uniform when creating a table and the issue of JOIN KEY alignment.

1.3 Select Appropriate Partition Keys (Level-2 Partition Keys)

You need to consider creating a level-2 partition table for tables with a large amount of data to split the data further. After setting the level-2 partition, it can also bring two benefits:

1) Perform Lifecycle Management on Data: For example, after a certain number of level-2 partitions are set, expired level-2 partitions will be eliminated automatically.

2) When the query condition includes a field related to the level-2 partition, the level-2 partition can be clipped to improve the query performance.

Directly use the value of ds for partition – PARTITION BY VALUE(ds)
Use the day for partition after ds conversion – PARTITION BY VALUE (DATE_FORMAT(ds, '%Y%m%d'))
Use the month for partition after ds conversion PARTITION BY VALUE (DATE_FORMAT(ds, '%Y%m'))
Use the year for partition after ds conversion – PARTITION BY VALUE (DATE_FORMAT(ds, '%Y'))

Notes on the Level-2 Partition:

Please plan the level-2 partition keys of all tables in instances in advance. Make full use of the level-2 partition. Do not let the amount of data of each level-2 partition be too small. If you perform level-2 partition by day and the amount of data per day is very small, you can consider performing level-2 partition by month. If the amount of data in the level-2 partition is too small, there will be a lot of metadata in the database that needs to save the partition data. The metadata is stored in the memory, and too much metadata will occupy much memory space, resulting in GC or OOM in the system and high IOPS of real-time writing.

Recommendations for the amount of data of the level-2 partition:

1.4 Select Appropriate Primary Keys

Defining a primary key in a table can realize data deduplication (REPLACE INTO) and data update (DELETE and UPDATE). Operations (such as DELETE and UPDATE) are supported only for tables for which a primary key is specified.

You can specify a primary key based on the following rules:

If you select a single field of the numeric type as the primary key, the performance of the table is better.
If the numeric single primary key cannot meet your business needs, you can use a string or a combination of multiple fields as the primary key.
The primary key must contain distribution keys and partition keys. If a level-2 partition key is defined in the table, the primary key must contain a level-2 partition key.

Note: The field used as the primary key should not be too large, and the length of the field should not be too long. Otherwise, the write performance will be affected.

1.5 Select Appropriate Clustered Indexes

A clustered index sorts one or more fields to ensure data with the same or similar fields are stored in the same or similar places on the disk. When the fields in the clustered index are used as query conditions, the query results remain in the same places on the disk, which can reduce the I/O of the disk.

You can specify clustered indexes based on the following rule:

The fields of the filter conditions that the query must carry can be designed as clustered indexes. For example, on a data analysis platform for e-commerce sellers, they can only view their data. A seller ID can be defined as the clustered index to ensure data locality and improve data query performance.

Note: Currently, only one clustered index is supported, but one clustered index can contain multiple columns. Currently, clustered indexes are of little help to performance unless they perform point query on very scattered data.

1.6 Design Appropriate Data Types

We recommend using numeric data when possible but using fewer string data.

AnalyticDB performs far better in processing numeric data than in processing string data. Here are the reasons:

Numeric data has a fixed length and requires less memory and storage space.
Numeric data is faster for computing, especially in data association scenarios.
From the internal index mechanism, the string data is suitable for equivalent queries and range queries, while time data and numeric data have better performance.
Select a field length as small as possible, such as Boolean or Byte type. Int type can be used when the data length is not very long.
The same field is designed in the same business model to have the same data type and field length, and the field naming is also consistent. In particular, you should pay more attention to fields that involve primary and foreign key associations to avoid implicit conversion caused by field associations of different data types.

Suggestions for processing common string data:

The string data contains character prefixes or suffixes, such as E12345 and E12346. It is recommended to remove the prefix or map the prefix to a number.
The field only has a few values, such as a country name. It is recommended to code each country, and each country corresponds to a unique number.
Avoid using the Varchar character type for storage for data like time or date. Try using the Date, Timestamp, or Int type.
We recommend using the Double type for storage for longitude or latitude data in the geography.

If you do not know the data distribution characteristics of your business before creating a table, you can use optimization suggestions to optimize the data after importing the data. Please visit the table creation diagnosis page of the console for AnalyticDB to view related problems and optimization suggestions.

2. Best Practices for Data Writing

2.1 Real-Time Writing

2.1.1 Submit in Batch Packaging

When writing data to a table, you can execute INSERT INTO and REPLACE INTO statements to write data in batch packaging mode to improve data write performance. Here are the considerations:

The number of rows of data written by each INSERT or REPLACE statement can be greater than 1,000. However, the total amount of data to be written cannot exceed 16 MB.
When data is written in batch packaging mode, the write latency of a single batch is relatively high, but the overall performance is improved.
If an error occurs, you need to retry operations to make sure data is written. The retry operations may cause repeatedly written data. Then, you can remove the duplicate rows based on the primary key of the table.
If you do not need to modify the original data, you can use INSERT INTO to write data, which is more than three times more efficient than REPLACE INTO.

For example:

INSERT INTO test

    (id, name,sex,age,login_time)

VALUES

  (1,'dcs',0,23,'2018-03-02 10:00:00'),

  (2,'hl',0,23,'2018-03-02 10:01:00'),

  (3,'xx',0,23,'2018-03-02 10:02:00')

......;

2.1.2 Update Data

There are many ways to update data. The differences are listed below:

High-frequency primary key-based row-level overwrite updates. If applications can fill all columns, use REPLACE INTO VALUES for the batch package.
High-frequency primary key-based row-level overwrite updates. If applications cannot fill all columns, use INSERT ON DUPLICATE KEY UPDATE for the batch package.
Use the UPDATE SET WHERE for any condition update at low frequency.

Note: UPDATE needs to look up the table to fill in the missing old values in the update, so it queries one more time than the REPLACE INTO. The performance is lower. Therefore, we do not recommend performing high-frequency and large-scale UPDATE operations. If the performance of the online UPDATE cannot meet the requirements, you need to consider replacing it with the REPLACE INTO and fill in the old values by the application end.

2.1.3 Delete Data

There are many ways to delete data. The differences are listed below:

If you want to delete data under the low-frequency primary key condition, use the DELETE FROM WHERE primary key = xxx.
If you want to delete any low-frequency condition, use DELETE FROM WHERE.
If you delete a single level-2 partition, use TRUNCATE PARTITION.
If you delete a single table (including all level-2 partitions), use TRUNCATE TABLE or DROP TABLE.

2.2 Batch Import

2.2.1 How to Choose Batch or Real-Time Import

We recommend using INSERT OVERWRITE SELECT for batch importing and importing AnalyticDB from ODPS and OSS. There are two reasons. First, batch import is suitable for the large amount of data imports and has good performance. Second, batch import is suitable for data warehouse semantics. The old data can be checked during import, and the new data can be switched with one click after the import is completed. If the import fails, the new data will be rolled back without affecting the query of the old data.
For importing AnalyticDB from ApsaraDB RDS, MySQL, and AnalyticDB, according to the amount of data, if the amount is small (millions of tables), we recommend using INSERT INTO SELECT for real-time import. If the amount is large, we recommend using INSERT OVERWRITE SELECT for batch import.
INSERT OVERWRITE SELECT or INSERT INTO SELECT operations cannot be used for the same table at the same time. Otherwise, the data will be overwritten.

2.2.2 Import Concurrency and Resource Description

The import of a single table will be queued serially within the system, while the import of multiple tables will generate several parallel import tasks. The degree of parallelism can be adjusted, and the default degree of parallelism is 2. Considering resource control, tasks exceeding the degree of parallelism will also be queued.
Data import, like queries, consumes the computing resources of AnalyticDB instances. Therefore, we recommend importing data when the QPS is low and using scheduled tasks to avoid peak periods.

3. Best Practices for Efficient Queries

The benefit of AnalyticDB is that it can achieve real-time online analysis when faced with complex queries in massive data scenarios. The query tuning of AnalyticDB is compatible with the general methods of database query optimization and provides some special optimization methods to make full use of its performance in distributed computing.

3.1 General Rules for Query Optimization

According to the article written by Ye Zhengsheng in ORACLE DBA Notes in the early years, data access optimization conforms to the following funnel rule:

3.1.1 Reduce Data Access (Reduce Disk Access)

Use filter conditions as much as possible to filter data in advance, thus reducing the amount of data involved in calculations. For example, filter data that can be filtered in advance in subqueries.

3.1.2 Return Less Data (Reduce Network Transmission or Disk Access)

In OLAP databases, since the number of columns in a table is often large and it is based on column storage or column-row mixed storage, SELECT * operations will lead to more request I/O. Therefore, avoid SELECT * queries as much as possible.

3.1.3 Reduce the Number of Interactions (Reduce Network Transmission)

We recommend using the batch import mentioned above to reduce the number of interactions.

3.1.4 Reduce Server CPU Overhead (Reduce CPU and Memory Overhead)

Reduce unnecessary sorting and paging, especially the sorting in subqueries.
Reduce COUNT DISTINCT operations as much as possible while meeting the business requirements.
Under the premise of satisfying the business, especially with large amounts of data, approximate calculations similar to Hyperloglog are used to replace accurate calculations.

3.1.5 Utilize More Resources (Increase Resources)

When designing a table, we recommend avoiding partition skew and placing storage and computing on a certain node. We recommend hashing data evenly on all nodes as much as possible to make full use of the capabilities of all machines and maximize the performance of distributed databases.
AnalyticDB is a typical system for large-scale MPP parallel processing. It has done a lot of optimization at the kernel level and can make full use of more resources.

3.2 Optimization for Special Scenarios of AnalyticDB

3.2.1 Best Practices for Queries of External Tables

We do not recommend using external tables for complex calculations. External table computing pulls all data. Therefore, complex external table computing can cause severe GC and add more pressure on network bandwidth.
External tables do not support DML operations, such as DELETE, UPDATE, and TRUNCATED. Perform DML operations in the source table to modify external table data.

3.2.2 Reasonable Use of Indexes

Rational use of indexes is a very important method for database tuning, and AnalyticDB is no exception. Indexes are created for each column by default in AnalyticDB, but there are exceptions. If the value of cardinality in a column is low and the selectivity of indexes is low, the performance of index queries may be even worse. We recommend disabling the feature of automatic index creation when creating a table during this time. If the table has been created, you can use the following SQL statement to delete the index or use hints to bypass the index.

ALTER TABLE table_name DROP FULLTEXT INDEX index_name;
--Method 1: Delete the index of an enumerated column
/+no_index_columns=[t_order_content.fdelete;fdbid]/
--Method 2: Use hints to make the query bypass the index

3.2.3 Clever Use of Clustered Indexes

When a query condition must contain a column, especially when the data in the column is stored very scattered, creating a clustered index for the column will significantly improve the performance. You can use an SQL statement similar to the following one to add a clustered index:

ALTER TABLE table_name ADD CLUSTERED INDEX index_cls (d_fdbid);

Note: If there has been data in the table, the direct ADD CLUSTER INDEX will not sort the existing data. You need to rebuild the table and add the clustered column keyword when creating the table or perform a build operation on the table after adding the clustered index: build table table_name force=true.

3.2.4 Reduce Data Interaction between Nodes

Distributed databases give full play to the advantages of distributed computing but can increase the network overhead across nodes sometimes. In particular, when the amount of data requested is small, but the data is scattered in more nodes, the cross-network overhead is high. This article provides the following two ideas:

Join within local nodes as much as possible and make full use of the local Join feature to reduce cross-network access. The specific method is to use the level-1 partition key association.
Perform aggregation analysis within local nodes as much as possible to reduce the amount of data used to access shuffle across networks. The specific method is to perform GROUP BY on the level-1 partition key as far as possible.

4. Best Practices for AnalyticDB Connections

In terms of usage, the compatibility of AnalyticDB with MySQL is more than 99%. AnalyticDB supports multiple connection methods, including MySQL command lines, JDBC connection, Python connection, C# connection, and PHP connection.

Please refer to the official document for more detailed usage.

After considering the length of the article, we divided this article into two parts. Part 2 focuses on the Best Practices in Business Industry Online. Part 2 will help you understand the data warehouse AnalyticDB through different usage scenarios and discuss popular FAQs.

Community

Senior Technical Experts from Alibaba Discuss Data Warehouse Tuning – Part 1