All Products
Search
Document Center

AnalyticDB:Optimize data import performance

更新時間:Jul 29, 2024

Although AnalyticDB for MySQL provides several data import methods to meet requirements in different scenarios, import performance can be affected by a variety of factors. For example, inappropriate table modeling may lead to data skew, and inappropriate configurations for data import may lead to ineffective resource utilization. This topic describes how to optimize the data import performance in different scenarios.

Optimize the performance when you use external tables to import data

Check the distribution key

The distribution key determines how to concurrently import table data by shard. If data is unevenly distributed, the shards that contain large amounts of data may take a long time to import, which incurs data skew issues and affects the performance of the entire data import job. To resolve these issues, make sure that data is evenly distributed when you import data. For information about how to select a distribution key, see the "Select a distribution key" section of the Schema design topic.

Determine whether the distribution key is appropriate:

  • Before data is imported, determine whether the distribution key is appropriate based on the meaning of the distribution key. For example, assume that the l_discount column is selected as the distribution key in the Lineitem table and has only 11 distinct values. Data with the same value in the l_discount column is distributed to the same shard, which causes severe data skew issues and affects the data import performance. In this case, the l_orderkey column is a more appropriate distribution key. Data can be evenly distributed because the order ID values are all unique.

  • After data is imported, determine whether the distribution key is appropriate based on the diagnostics on distribution key reasonability. For information about how to view the distribution key diagnostics, see the "Diagnostics on distribution field skew" section of the Data modeling diagnostics topic.

Check the partition key

If a partition is written to AnalyticDB for MySQL by using the INSERT OVERWRITE INTO SELECT statement, data of the existing partition that uses the same name is overwritten. Data in each shard is imported to the corresponding partitions. To prevent external sorting and reduced data import performance, we recommend that you do not import a large number of partitions at a time. For information about how to select a partition key, see the "Select a partition key" section of the Schema design topic.

Determine whether the partition key is appropriate:

  • Before data is imported, determine whether the partition key is appropriate based on the business data requirements and data distribution. For example, assume that the l_shipdate column is used to partition the Lineitem table, and the table data spans seven years. If data is partitioned by year, the table contains seven partitions. If data is partitioned by date, the table contains more than 2,000 partitions and each partition contains about 30 million rows of data. In this case, partitioning the table by month or year is more appropriate.

  • After data is imported, determine whether the partition key is appropriate based on the diagnostics on partition key reasonability. For information about how to view the partition key diagnostics, see the "Diagnostics on partition field reasonability" section of the Data modeling diagnostics topic.

Check the index

By default, AnalyticDB for MySQL indexes all columns when a table is created, and specific resources are consumed when an index is created for all columns of a wide table. When you import data to a wide table, we recommend that you use a primary key index. The primary key index can be used for deduplication, but the use of large numbers of primary key columns reduces performance. For information about how to select a primary key index, see the "Select a primary key" section of the Schema design topic.

Determine whether the index is appropriate:

  • In most cases, you do not need to specify a primary key index in batch import scenarios because data has been deduplicated during batch computing.

  • On the Monitoring Information page of the AnalyticDB for MySQL console, click the Table Information Statistics tab to view the amount of table data, index data, and primary key index data. If the amount of index data exceeds the amount of table data, check whether long character strings exist in columns. For such columns, indexes may take a long time to be created and occupy a large amount of storage space. We recommend that you delete such indexes. For more information, see the "Delete an index" section of the ALTER TABLE topic.

    Note

    Primary key indexes cannot be deleted. If you want to delete a primary key index, you must create another table.

Add a hint to accelerate import

You can add a hint (/direct_batch_load=true) before a data import statement to accelerate the import job.

Note

This hint is supported only for AnalyticDB for MySQL Data Warehouse Edition (V3.0) in elastic mode for Cluster Edition V3.1.5. If performance is not improved, Submit a ticket.

Example:

SUBMIT JOB /*+ direct_batch_load=true*/INSERT OVERWRITE adb_table
SELECT * FROM adb_external_table;

Use the elastic import feature to accelerate import

Note
  • You can use elastic import only for AnalyticDB for MySQL clusters of V3.1.10.0 and later.

  • You can use elastic import for AnalyticDB for MySQL Data Lakehouse Edition (V3.0) clusters that have job resource groups.

  • You can use elastic import to import only MaxCompute data and CSV, Parquet, and ORC data that is stored in Object Storage Service (OSS).

  • When you use elastic import, make sure that job resource groups have sufficient resources to prevent long waiting time for jobs, long execution durations, and job failures.

AnalyticDB for MySQL allows you to run multiple elastic import jobs at the same time, and increase the maximum amount of resources for an elastic import job to accelerate the job. For more information, see Data import methods.

Example:

/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
submit job insert overwrite adb_table select * from adb_external_table;

For more information, see the "Hint parameters" section of the Use external tables to import data to Data Lakehouse Edition topic.

Optimize the performance when you use DataWorks to import data

Optimize job configurations

  • Optimize the value of Data Records Per Write

    This parameter specifies the number of data records that are imported per batch. The default value is 2048. We recommend that you use the default value in most cases.

    However, if the size of a single data record reaches hundreds of KB, we recommend that you modify the value of this parameter. For example, if the size of a single data record is 512 KB, you can set the parameter to 16. This ensures that the amount of data that is imported per batch does not exceed 8 MB and prevents high memory usage of frontend nodes.

  • Optimize the channel control configurations

    • The data synchronization performance of a data import job is proportional to the value of Expected Maximum Concurrency. We recommend that you increase the value of this parameter.

      Important

      The greater the value of Expected Maximum Concurrency, the more the occupied DataWorks resources. Increase this parameter to an appropriate value.

    • We recommend that you turn on Distributed Execution for better synchronization performance.

Common issues and solutions

  • When a small amount of data is imported from the client, the data can be processed by the database server in a timely manner, but the CPU and disk I/O resources are underutilized. As a result, the write response time and write transactions per second (TPS) may not meet your expectations.

    Solution: Increase the values of Data Records Per Write and Expected Maximum Concurrency to increase the amount of imported data. The data import performance linearly increases as the amount of imported data increases.

  • If the imported table has data skew issues, specific cluster nodes are overloaded and import performance is reduced. In this case, the CPU and disk I/O resources are underutilized, and write response time is prolonged. On the Data Modeling Diagnostics tab of the Diagnostics and Optimization page, the data skew diagnostic results of the imported table are displayed. > Data Modeling Diagnostics

    Solution: Redesign the table schema and import the data again. For more information, see Table schema design.

Optimize the performance when you use a JDBC-based program to import data

Optimize the client

  • Batch data import from the client

    • When you use a Java Database Connectivity (JDBC)-based program to import data, we recommend that you batch import data to reduce network and connection overheads. Do not import a single record of data unless special requirements apply.

    • We recommend that you import 2,048 data records per batch. If the size of a single data record reaches hundreds of KB, we recommend that you import no more than 8 MB of data per batch. You can calculate the optimal number of data records per batch by dividing 8 MB by the size of a single data record. Importing a large amount of data per batch may cause high memory usage on frontend nodes and reduce the import performance.

  • Concurrency configurations of the client

    • When you import data from the client, we recommend that you concurrently import data because a single process cannot fully utilize system resources and the batch operations on the client slow down the import speed. You can accelerate data import by using the concurrency policy.

    • The concurrency capability is determined by the number of batches, data sources, and workloads of the host where the client is located. We recommend that you find an appropriate concurrency capability by testing. For example, if the import performance cannot meet your requirements, you can double the number of concurrent processes. If the import speed decreases, you can gradually decrease the number of concurrent processes.

Common issues and solutions

If the performance does not meet your requirements when you use a JDBC-based program to import data to AnalyticDB for MySQL, first check the client for performance bottlenecks.

  • Make sure that the data source can produce data at a high speed. If the data comes from other systems or files, check the client for output bottlenecks.

  • Make sure that data can be processed at a high speed. Check whether data can be synchronously produced and consumed and make sure that sufficient amount of data is waiting to be imported to AnalyticDB for MySQL.

  • While ensuring appropriate workloads on the host where the client is located, check the CPU utilization or disk I/O usage to determine whether system resources are sufficient.