All Products
Search
Document Center

AnalyticDB:Import data from a Hive data source

Last Updated:Aug 20, 2024

AnalyticDB for MySQL allows you to migrate data from a Hive data source to an Object Storage Service (OSS) bucket. This topic describes how to use the data migration feature of AnalyticDB for MySQL to migrate data from a Hive data source to an OSS bucket, analyze data, and manage the data migration job.

Overview

AnalyticDB for MySQL allows you to migrate data from a Hive data source to an OSS bucket. You can migrate the metadata and data from the data source to OSS, or concurrently migrate multiple databases and tables from the data source to OSS.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.

  • A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.

  • A database account is created for the AnalyticDB for MySQL cluster.

  • One of the following clusters is created in the same region as the AnalyticDB for MySQL cluster:

  • An E-MapReduce (EMR) cluster that has Business Scenario set to Data Lake, Metadata set to Self-managed RDS or Built-in MySQL, is configured with the Hive service, and has Hive Storage Mode set to HDFS (the Data Lake Storage check box cleared). For information about how to create an EMR cluster, see Create a cluster.

    Important
    • The data of a Hive data source in an EMR cluster that has Metadata set to DLF Unified Metadata cannot be migrated to an OSS bucket.

    • The data of Hive data sources in an EMR cluster that has Hive Storage Mode set to Data Lake Storage is stored in OSS. You can use metadata discovery to import data to AnalyticDB for MySQL. For more information, see Use metadata discovery to import data to Data Lakehouse Edition.

  • A Cloudera's Distribution Including Apache Hadoop (CDH) cluster that is deployed on an Elastic Compute Service (ECS) instance.

Billing rules

If you use the data migration feature of AnalyticDB for MySQL to migrate data to OSS, the following fees are incurred:

Procedure

Create a Hive data source

Note

If you want to migrate data from an existing Hive data source, skip this step and create a data migration job. For more information, see the "Create a data migration job" section of this topic.

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

  2. In the left-side navigation pane, choose Data Ingestion > Data Sources.

  3. In the upper-right corner of the page, click Create Data Source.

  4. On the Create Data Source page, configure the parameters that are described in the following table.

    EMR cluster

    Parameter

    Description

    Data Source Type

    The data source. Select Hive.

    Data Source Name

    The name of the data source. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.

    Data Source Description

    The description of the data source. For example, you can enter the use case and business limits.

    Deployment Mode

    The deployment mode of the EMR cluster. Select Alibaba Cloud Instance.

    Instance

    The EMR cluster to which the Hive date source belongs.

    Hive Metastore URI

    The Uniform Resource Identifier (URI) of the Hive Metastore. Specify this parameter in the thrift://<IP address of the master node>:<Port number> format. By default, the port number is 9083.

    To view the private IP address of the master node, log on to the EMR console and click EMR on ECS in the left-side navigation pane. Click the ID of the EMR cluster that you want to manage, click the Nodes tab, and then click the 加号..png icon to the left of the emr-master node.

    ECS-based CDH cluster

    Parameter

    Description

    Data Source Type

    The data source. Select Hive.

    Data Source Name

    The name of the data source. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.

    Data Source Description

    The description of the data source. For example, you can enter the use case and business limits.

    Deployment Mode

    The deployment mode of the CDH cluster. Select ECS-based CDH.

    Instance

    The ECS instance where the CDH cluster is deployed.

    Hive Metastore URI

    The URI of the Hive Metastore. In this case, specify the public IP address of the ECS instance where the CDH cluster is deployed. Specify this parameter in the thrift://<IP address of the ECS instance>:<Port number> format. By default, the port number is 9083.

    Host Configuration Information

    The host-to-IP mappings. Specify the mappings in different lines.

    Examples:

    192.168.2.153 master.cdh

    192.168.2.154 node1.cdh

    192.168.2.155 node2.cdh

  5. Click Create.

Create a data migration job

  1. In the left-side navigation pane, click Data Migration.

  2. In the upper-right corner, click Create Migration Job.

  3. On the Hive Data Source tab of the Create Migration Job page, configure the parameters in the Source and Destination Settings, Database/Table Migration Settings, and Migration Settings sections.

    Parameters in the Source and Destination Settings section

    Parameter

    Description

    Job Name

    The name of the data migration job. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.

    Data Source

    The name of the data source. You can select an existing Hive data source or create a data source.

    Destination Type

    The data storage type in AnalyticDB for MySQL. Only Data Lake - OSS Storage is supported.

    OSS Path

    The OSS storage path of the AnalyticDB for MySQL cluster data.

    Important
    • All buckets that reside in the same region as the AnalyticDB for MySQL cluster are displayed. Configure this parameter based on your business requirements. After you configure this parameter, it cannot be modified.

    • We recommend that you select an empty directory that does not have nested relationships with the directories of other data migration jobs. This prevents historical data from being overwritten. For example, assume that the involved OSS storage paths of two data migration jobs are oss://adb_demo/test/sls1/ and oss://adb_demo/test/. In this case, data overwriting occurs during data migration because these two paths have nested relationships with each other.

    Parameters in the Database/Table Migration Settings section

    Important

    If a database name or table name is added to both a whitelist and a blacklist, the blacklist prevails. In this case, the system does not migrate the database or table.

    Parameter

    Description

    Database/Table Migration Whitelist

    The names of databases and tables that you want to migrate. Specify the names by using regular expressions. Separate multiple regular expressions with commas (,).

    Database/Table Migration Blacklist

    The names of databases and tables that you do not want to migrate. Specify the names by using regular expressions. Separate multiple regular expressions with commas (,).

    Parameters in the Migration Settings section

    Parameter

    Description

    Handling Same Named Destination Table

    The way in which to handle a table that uses the same name as another table in the destination OSS bucket. Valid values:

    • Skip Migration: skips the table and continues to migrate other tables.

    • Report Error and Abort Migration: pauses the entire data migration job.

    Job Resource Group

    The job resource group that runs the data migration job.

    Required ACUs

    The number of ACUs that are required for the job resource group to run the data migration job. The value ranges from 4 to the maximum number of computing resources that are available in the job resource group. To improve the stability and performance of data migration, we recommend that you specify a relatively large number of ACUs.

    Parallel Tasks

    The number of tasks that are started to run in parallel. Default value: 1. Valid values: 1 to 8.

    Each task migrates a table. Each task requires at least 4 ACUs. If the number of ACUs is not enough, the tasks are run in sequence.

    Advanced Settings

    The custom settings for the data migration job. If you want to configure custom settings, contact technical support.

  4. Click Submit.

Start the data migration job

  1. On the Data Migration page, find the data migration job that you created and click Start in the Actions column.

  2. In the upper-right corner, click Search. If the status of the job changes to Starting, the data migration job is started.

Analyze data

After the data migration job is completed, you can use metadata discovery to import data from OSS to Data Lakehouse Edition. Then, you can use Spark JAR Development to analyze the data that is imported to Data Lakehouse Edition. For more information about Spark development, see Spark editor and Overview.

    1. In the left-side navigation pane, choose Job Development > Spark JAR Development.

    2. Enter SQL statements in the default template and click Run Now.

      -- Here is just an example of SparkSQL. Modify the content and run your spark program.
      
      conf spark.driver.resourceSpec=medium;
      conf spark.executor.instances=2;
      conf spark.executor.resourceSpec=medium;
      conf spark.app.name=Spark SQL Test;
      conf spark.adb.connectors=oss;
      
      -- Here are your sql statements
      show tables from lakehouse20220413156_adbTest;
    3. (Optional) On the Applications tab, find an application and click Log in the Actions column to view the Spark SQL running log of the application.

Manage the data migration job

On the Data Migration page, you can perform the operations described in the following table in the Actions column.

Operation

Description

Start

Starts the data migration job.

View Details

Views the details of the data migration job, such as the data migration job settings and number of migrated tables.

Edit

Modifies the configuration properties of the data migration job.

Pause

Pauses the data migration job.

Delete

Deletes the data migration job.