Data Lake Formation (DLF) provides a metadata export tool to help you export DLF metadata to Hive metastores.
Prerequisites
An E-MapReduce (EMR) cluster is created. The ApsaraDB RDS database specified in the export tool is a metadatabase configured for the Metastore service in the EMR cluster. A synchronization task uses the Metastore service in the EMR cluster and runs as a Spark job to export the metadata from the EMR cluster.
The locations of all metadata are in Object Storage Service (OSS). If the locations of all metadata are in Apsara File Storage for HDFS (HDFS), the HDFS namespace for the locations of the databases and tables must be the same as that of the EMR cluster. Otherwise, an error occurs during metadata export. This means that cross-cluster metadata export is not supported if the locations of all metadata are in HDFS.
The ApsaraDB RDS database specified in the export tool contains metadata tables. For more information about how to initialize ApsaraDB RDS metadata, see Configure a self-managed ApsaraDB RDS for MySQL database.
Procedure
1. Prepare a configuration file
Prepare a YAML file based on your environment and upload the file to your OSS path. The following section provides an example of the file.
!!com.aliyun.dlf.migrator.app.config.MigratorConfig
clientInfo:
accessKeyId: <Your AccessKey ID.>
accessKeySecret: <Your AccessKey secret.>
endPoint: dlf-vpc.cn-hangzhou.aliyuncs.com
regionId: cn-hangzhou
catalogId: <Your catalog ID. The default value is the user ID of your Alibaba Cloud account.>
mysql:
connectionUri: jdbc:mysql://emr-header-1:3306/hivemeta
driver: com.mysql.cj.jdbc.Driver
userName: root
password: xxxx
runOptions:
batchSize: 100
lowerCaseTableNames: false
schema: hivemeta
records: oss://xxxx/migrator/validate/log/
objectTypes:
- database
- table
- partition
- function
operations:
- validate
fixIfInConsistence: true
fixMode: to_hive
validateDatabases: [db1,db2]
excludeTables: [aa,bb]
excludeTablePrefixes: [xx,yy]
ignoreValidateCreateTime: true
skipFixTime: 1
ignoreDropOperation: false
Descriptions of configuration items
Client-side DLF SDK configurations
Parameter | Required | Description |
accessKeyId | Yes | The AccessKey ID of your Alibaba Cloud account. |
asscessKeySecret | Yes | The AccessKey secret of your Alibaba Cloud account. |
endPoint | Yes | The endpoint of DLF. The value is dlf-vpc.[region-id].aliyuncs.com. |
catalog | Yes | The DLF data catalog. |
Your Alibaba Cloud account must have read/write permissions on all DLF metadatabases and tables, such as ListDatabase and ListTables. You can log on to the DLF console and configure the required permissions on the Data Permissions page.
Parameters related to a MySQL metadatabase
Parameter | Required | Description |
connectionUrl | Yes | The database URL that is used by Java Database Connectivity (JDBC) to connect to the MySQL metadatabase. |
driver | Yes | The MySQL driver. You do not need to modify the setting of this parameter in most cases. |
userName | Yes | The username that is used to access the metadatabase. |
password | Yes | The password that is used to access the metadatabase. |
runOptions parameters
Parameter | Required | Description |
schema | Yes | The name of the Hive metadatabase. Note The database name specified in the value of connectionUrl for the MySQL metadatabase must be consistent with the database name specified by the schema parameter. If you modify the setting of either parameter, you must also modify the setting of the other parameter. |
batchSize | Yes | The batch size for calling the DLF SDK cannot exceed 500. If the batch size is too large, the call operation may time out. If the batch size is too small, the call efficiency is slow. In most cases, 100 is a suitable size. |
lownerCaseTableNames | Yes | Specifies whether the names of tables in your ApsaraDB RDS metadatabase are in uppercase or lowercase letters. The value false indicates that the names of tables in your ApsaraDB RDS metadatabase are in uppercase letters. |
records | Yes | The path to store the running result logs of the tool. The logs contain processing records and error information. |
objectTypes | Yes | The types of objects to be exported, which can be database, function, table, and partition. |
operations | Yes | Set the value to validate. |
fixIfInConsistence | Yes | Set the value to true. |
fixMode | Yes | The value to_hive indicates that Hive metadata is compared based on DLF metadata. The value to_dlf indicates that DLF metadata is compared based on Hive metadata. |
validateDatabases | No | The names of databases to compare. Only the specified databases are exported. |
excludeTablePrefixes | No | The names of tables to be excluded from comparison. |
excludeTables | No | The prefixes of table names. Tables whose names start with the specified prefixes are excluded from comparison. |
compareTotalNumber | No | Specifies whether to return the final result of comparing the amounts of DLF metadata and ApsaraDB RDS metadata. Default value: false. |
ignoreValidateCreateTime | No | Specifies whether to ignore table creation time during comparison. |
skipFixTime (unit: minute) | No | Only the metadata created before the time specified by skipFixTime is compared. The metadata created within the time specified by skipFixTime is not compared. Default value: 240. |
ignoreDropOperation | No | Set this parameter to true if you do not want to delete the metadata after it is exported. |
locationMappings | No | Specifies whether an object supports location mapping. You need to set this parameter if you want to change the location of an object such as a database, table, or partition. |
source | No | The source path for replacement. We recommend that you end the path with a forward slash (/). Example:
|
target | No | The destination path for replacement. Example:
|
hiveConfPath | No | The path of the Hive configuration file in the cluster. Default value: /etc/ecm/hive-conf/hive-site.xml. |
keberosInfo | No | The configuration that needs to be added to runOptions for a task running in a cluster with Kerberos enabled. Example:
|
2. Prepare to run the Spark job
Log on to the header node of the EMR cluster.
Obtain the JAR package of the export tool.
wget https://dlf-lib.oss-cn-hangzhou.aliyuncs.com/migrator_jar/application-1.1.jar
Write an execution script for the Spark job.
vim /root/migrator_validate.sh
Enter the following content:
#!/bin/bash
source /etc/profile
spark-submit --master yarn --deploy-mode client --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 4 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class com.aliyun.dlf.migrator.app.MigratorApplication application-1.1.jar oss://xxxx/migrator_config_validate.yml
The path of the configuration file at the end of the code block is the OSS path to which you uploaded the configuration file in the section "1. Prepare a configuration file" of this topic.
Modify the permissions to execute the script.
chmod +x /root/migrator_validate.sh
3. Run the job
Manual execution:
Run the preceding script directly.
Scheduled execution:
crontab -e
0 22 * * * nohup sh /root/migrator_validate.sh > /root/validate.txt 2>&1 &
In this example, the cron job is configured to run at 22:00 every day.