Use external tables to import data
AnalyticDB for MySQL allows you to access different data sources and create external tables to map external data sources. This way, you can concurrently read external data and import the data to AnalyticDB for MySQL. This data import method maximizes the use of cluster resources to improve import performance.
Basic characteristics
This method is suitable for importing a large amount of data in a single job.
A data import job consumes large amounts of cluster resources to ensure high performance. We recommend that you use this method during off-peak hours.
The imported data is invisible when the data import job is in progress, and becomes visible only after the job is complete.
If you import a partition by using this method, data of the existing partition that uses the same name is overwritten.
When a data import job is completed by using an external table, table indexes are created to improve query performance.
Common scenarios
Data initialization in data warehouses
If you want to initialize and import terabytes of data to AnalyticDB for MySQL for analysis, we recommend that you store the data in OSS or HDFS and then use external tables to import the data.
Accelerated analysis of batch data
Batch data is stored in batch data warehouses such as MaxCompute. Each day, gigabytes or terabytes of data are imported from batch data warehouses to AnalyticDB for MySQL for accelerated analysis.
Methods
You can import data by using the regular import or elastic import method. By default, the regular import method is used. If you use the regular import method, data is read from compute nodes and indexes are created on storage nodes. The regular import method consumes computing and storage resources. If you use the elastic import method, data is read and indexes are created for Serverless Spark jobs. The elastic import method consumes resources of job resource groups and generates fees. The following table describes the comparison between the regular import method and the elastic import method.
Comparison between the regular import method and the elastic import method
Item | Regular import | Elastic import |
Supported minor versions | No limits. | V3.1.10.0 and later. |
Use scenarios | | You want to consume fewer storage node resources to accelerate data import. An import job involves multiple tables. You want to import a large amount of data to partitioned tables.
|
Supported data sources | | |
How to enable | By default, the regular import method is enabled. | By default, the elastic import method is enabled for AnalyticDB for MySQL clusters of V3.1.10 and later. |
Limits | No limits. | You can use elastic import to import data to AnalyticDB for MySQL Data Lakehouse Edition (V3.0) based on external tables by using only the INSERT OVERWRITE INTO statement. Take note of the following limits for SQL operations: You can use elastic import to import data from a single table, but not multiple joined tables. Elastic import supports only the WHERE and LIMIT clauses. Other syntax such as ORDER BY is not supported. When you run an elastic import job, you can use asterisks (*), column names, default values, or constants for the column and value fields in the SELECT statement. Other syntax such as SQL functions is not supported.
You can use elastic import to import data only to fact tables. Elastic import supports the following data types: BOOLEAN , TINYINT , SMALLINT , INT , BIGINT , FLOAT , DOUBLE , DECIMAL , VARCHAR , DATE , TIME , DATETIME , and TIMESTAMP .
|
Concurrent import jobs | The regular import method supports up to two concurrent jobs. | The elastic import method supports up to 32 concurrent jobs. Note To improve import concurrency, you can specify the adb.load.job.max.acu parameter to increase the maximum amount of resources that can be used by an elastic import job. The default value of the adb.load.job.max.acu parameter is the number of shards plus 1. You can set the adb.load.job.max.acu parameter to K × default value (K ≥ 1). We recommend that you set the maximum value of K to a value that is less than or equal to the number of partitions of the partitioned table in the current import job. Use the following methods to query the numbers of shards and partitions: Execute the following statement to query the number of shards in an AnalyticDB for MySQL cluster:
SELECT count(1) FROM information_schema.kepler_meta_shards;
For information about how to view the number of partitions, see the "View the data size of a table" section of the Storage analysis topic.
|
Consumed resources | The regular import method reads source data from interactive resource groups (resident compute nodes) and creates indexes on storage nodes. This process consumes a large amount of resources of interactive resource groups (resident compute nodes) and storage nodes. Note When you run a regular import job, the values of the storage node metrics, such as CPU utilization and I/O usage, are high. This affects the read and write performance. | The elastic import method reads source data from and creates indexes on job resource groups (dynamically scaled compute nodes). This process does not consume storage node resources. Only a small amount of storage node resources is consumed to download the data of the built hot partitions, including data formats and indexes, to the on-premises device. Note An elastic import job consumes a small amount of storage node resources, but does not consume the resources of interactive resource groups (resident compute nodes). On the monitoring page, the values of the storage node metrics, such as CPU utilization and I/O usage, are low and the transactions per second (TPS) metric is not displayed. |
Import speed | The import speed of regular import jobs varies based on the import configuration, import method, and table schema. For information about the data import optimization methods for different scenarios, see Optimize data import performance. | Sufficient resources of job resource groups ensure high concurrency and increased speed of elastic import jobs. You can run multiple elastic import jobs and increase the maximum amount of resources that can be used by an elastic import job to improve the import speed. |
Recommendations:
An elastic import job requires at least 2 minutes to 3 minutes to complete and is not suitable for small amounts of data. If an import job must be completed within 3 minutes, we recommend that you use the regular import method.
An elastic import job requires a longer period of time to complete compared with a regular import job that uses the same resources. If you want to accelerate an import job, we recommend that you increase the maximum amount of resources that can be used by the elastic import job.
Use DataWorks to import data
DataWorks provides a visualized data import method that allows you to import data from a variety of data sources to AnalyticDB for MySQL. Compared with external tables, DataWorks is more suitable when you want to import only a small amount of data.
Note
DataWorks To import several hundred gigabytes of data, we recommend that you use external tables instead of DataWorks. For more information, see the "Use external tables to import data" section of this topic.
Common scenarios
Data import every minute or hour
You want to import small amounts of data to AnalyticDB for MySQL every minute or hour for analysis.
Data import from multiple external data sources
You want to import data from multiple data sources such as Tablestore, Redis, and PostgreSQL to AnalyticDB for MySQL.
Methods
Optimization of data import performance
For information about how to improve the performance of the DataWorks-based data import method, see the "Optimize the performance when you use DataWorks to import data" section of the Optimize data import performance topic.
Use DTS to import data
Data Transmission Service (DTS) is a real-time data streaming service. DTS supports data transmission between data sources such as relational database management system (RDBMS) databases, NoSQL databases, and online analytical processing (OLAP) databases. DTS has the following advantages over traditional data migration and synchronization tools: diverse features, high performance, security, reliability, and ease of use. DTS helps simplify data transmission and allows you to focus on business development. You can use DTS to import data to AnalyticDB for MySQL from a variety of data sources for real-time analysis.
Common scenarios
Real-time data synchronization within seconds
You want to synchronize data in real time from RDS for MySQL or PolarDB for MySQL to AnalyticDB for MySQL for analysis.
Multi-source aggregation
You want to synchronize data from multiple RDS for MySQL instances or PolarDB for MySQL clusters to an AnalyticDB for MySQL cluster for analysis. You can use the multi-table merging feature of DTS to synchronize multiple source tables that use the same schema to a single table of an AnalyticDB for MySQL cluster.
Methods
Use a JDBC-based program to import data
If the method in which external tables or DataWorks is used to import data cannot meet your business requirements, you can use a JDBC-based program to import data in data cleansing and other complex scenarios that involve unstructured data.
Common scenarios
Import after data preprocessing
You want to parse and import the generated log files to AnalyticDB for MySQL in real time.
On-premises data import
You want to import on-premises data that fails to be uploaded to OSS, HDFS, or MaxCompute to AnalyticDB for MySQL.
Usage notes
Optimization of data import performance
For information about how to improve the performance of the data import method that uses a JDBC-based program, see the "Optimize the performance when you use a JDBC-based program to import data" section of the Optimize data import performance topic.
Use data synchronization to import data
AnalyticDB for MySQL provides a variety of data synchronization features, such as AnalyticDB Pipeline Service (APS), metadata discovery, and Hive data migration. The data synchronization features allow you to discover and organize OSS metadata, synchronize data in real time from Simple Log Service (SLS) and ApsaraMQ for Kafka to AnalyticDB for MySQL, and migrate Hive data to OSS for subsequent data analysis.
Common scenarios
Low-cost storage and analysis
If you want to store and analyze a large amount of logs such as SLS data and messages such as Kafka data in a cost-effective manner, you can use the APS feature to synchronize data in real time from SLS and Kafka to AnalyticDB for MySQL.
Metadata discovery
A large amount of OSS data is difficult to manage and analyze due to the absence of metadata. AnalyticDB for MySQL uses the metadata discovery feature to build metadata levels, such as databases, tables, and partitions, and identify data formats and field information to generate a table schema. The metadata built by using the metadata discovery feature allows data to be identified by other analysis engines for subsequent analysis.
Hive data migration
If you want AnalyticDB for MySQL to perform operations on the data stored in Hive, including metadata management and subsequent analysis, you can use the Hive data migration feature to migrate Hive data to OSS. Then, AnalyticDB for MySQL automatically organizes the metadata and performs subsequent analysis.
Methods