All Products
Search
Document Center

DataWorks:Develop a MaxCompute MapReduce task

最終更新日:Sep 19, 2023

MaxCompute supports MapReduce APIs. You can create and commit ODPS MR nodes that call the MapReduce Java APIs to write MapReduce programs and process MaxCompute data.

Prerequisites

Important

The required resources must be uploaded, committed, and deployed before you create an ODPS MR node.

Background information

MapReduce is a programming framework for a distributed computing program. MapReduce can integrate business logic code written by users with the built-in MapReduce components to generate a complete distributed computing program, which allows you to concurrently run jobs on a Hadoop cluster. MaxCompute provides two versions of MapReduce APIs. For more information, see Overview.

  • MaxCompute MapReduce: the native MapReduce API of MaxCompute. This API version runs fast and can be used to develop a program in an efficient manner without exposing file systems.

  • Extended MaxCompute MapReduce (MR2): an extension of MaxCompute MapReduce. This API version supports the logic that is used to schedule complex jobs. The implementation method of MR2 is the same as that of MaxCompute MapReduce.

In DataWorks, you can create, commit, and deploy an ODPS MR node to schedule a MaxCompute MapReduce task and integrate the MaxCompute MapReduce task with other jobs.

Limits

For information about the limits on MaxCompute MapReduce tasks, see Limits.

Simple code editing example

This example shows how to use an ODPS MR node. In this example, the number of occurrences of each string in the wc_in table is counted and the result is written to the wc_out table.

  1. Upload, commit, and deploy the resource mapreduce example.jar. For more information, see Create and use MaxCompute resources.

    Note

    For information about the implementation logic inside the mapreduce example.jar package, see WordCount example.

  2. Enter and run the following code on the ODPS MR node:

    -- Create an input table. 
    CREATE TABLE if not exists wc_in (key STRING, value STRING);
    -- Create an output table. 
    CREATE TABLE if not exists wc_out (key STRING, cnt BIGINT);
        --- Create a dual table. 
        drop table if exists dual;
        create table dual(id bigint); -- If no dual table exists in the current workspace, create a dual table. 
        --- Initialize the dual table. 
        insert overwrite table dual select count(*)from dual;
        --- Insert the sample data into the wc_in table. 
        insert overwrite table wc_in select * from (
        select 'project','val_pro' from dual 
        union all 
        select 'problem','val_pro' from dual
        union all 
        select 'package','val_a' from dual
        union all 
        select 'pad','val_a' from dual
          ) b;
    -- Reference the uploaded JAR resource. To reference the resource, find the JAR resource in the resource list, right-click the JAR resource, and then select Reference Resources. 
    --@resource_reference{"mapreduce-examples.jar"}
    jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_out

    Code description:

    • --@resource_reference: This statement appears when you right-click the resource name and select Reference Resources.

    • -resources: the name of the referenced JAR resource.

    • -classpath: the path of the referenced JAR resource. You need to enter only ./ and the name of the referenced resource.

    • com.aliyun.odps.mapred.open.example.WordCount: the name of the main class in the JAR resource that is called during node running.

    • wc_in: the name of the created input table of the ODPS MR node.

    • wc_out: the name of the created output table of the ODPS MR node.

    • If you use multiple JAR resources in a single ODPS MR node, separate the paths of the referenced JAR resources with commas (,), such as -classpath ./xxxx1.jar,./xxxx2.jar.

    The result returned is OK.

  3. Query data in the output table wc_out by using an ODPS SQL node.

    select * from wc_out;

    Returned result:

    +------------+------------+
    | key        | cnt        |
    +------------+------------+
    | package    | 1          |
    | pad        | 1          |
    | problem    | 1          |
    | project    | 1          |
    | val_a      | 2          |
    | val_pro    | 2          |
    +------------+------------+

Advanced code editing examples

For information about how to develop MaxCompute MapReduce tasks in other scenarios, see the following topics:

What to do next

After you complete the development of a task by using the created node, you can perform the following operations:

  • Configure scheduling properties: You can configure properties for periodic scheduling of the node. If you want the system to periodically schedule and run the task, you must configure items for the node, such as rerun settings and scheduling dependencies. For more information, see Overview.

  • Debug the node: You can debug and test the code of the node to check whether the code logic meets your expectations. For more information, see Debugging procedure.

  • Deploy the node: After you complete all development operations, you can deploy the node. After the node is deployed, the system periodically schedules the node based on the scheduling properties of the node. For more information, see Deploy nodes.

  • FAQ about MaxCompute MapReduce: You can learn the frequently asked questions in MaxCompute MapReduce. This way, you can identify and troubleshoot issues in an efficient manner when exceptions occur.