All Products
Search
Document Center

:Create an EMR Spark Streaming node

Last Updated:Sep 11, 2024

E-MapReduce (EMR) Spark Streaming nodes can be used to process streaming data with high throughput. This type of node supports fault tolerance and can help you restore data streams on which errors occur. This topic describes how to create and use an EMR Spark Streaming node to develop data.

Prerequisites

  • An Alibaba Cloud EMR cluster is created and registered to DataWorks. For more information, see Register an EMR cluster to DataWorks.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add workspace members and assign roles to them.

  • A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.

  • A workflow is created in DataStudio.

    Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.

Limits

  • This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group.

  • EMR Spark Streaming nodes that you create in DataWorks cannot be used to develop data in Spark clusters created on the EMR on ACK page.

Step 1: Create an EMR Spark Streaming node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose Data Modeling and Development > DataStudio in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. Create an EMR Spark Streaming node.

    1. Find the desired workflow, right-click the name of the workflow, and then choose Create Node > EMR > EMR Spark Streaming.

      Note

      Alternatively, you can move the pointer over Create icon and choose Create Node > EMR > EMR Spark Streaming.

    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters. Click Confirm. The configuration tab of the EMR Spark Streaming node appears.

      Note

      The node name can contain only letters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Spark Streaming task

You can develop a Spark Streaming task on the configuration tab of the EMR Spark Streaming node.

Create and reference an EMR JAR resource

If you use an EMR DataLake cluster, you can perform the following steps to reference an EMR JAR resource:

  1. Prepare the EMR JAR sample code.

    spark-submit --master yarn
    --deploy-mode cluster
    --name SparkPi
    --driver-memory 4G
    --driver-cores 1
    --num-executors 5
    --executor-memory 4G
    --executor-cores 1
    --class org.apache.spark.examples.JavaSparkPi
    hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100
    Note

    If an EMR Spark Streaming node depends on large amounts of resources, the resources cannot be uploaded by using the DataWorks console. In this case, you can store the resources in Hadoop Distributed File System (HDFS) and then reference the resources in the code of the EMR Spark Streaming node. Sample code:

  2. Create an EMR JAR resource. For more information, see Create and use an EMR resource. The first time you use an EMR JAR resource, click Authorize to authorize DataWorks to access the EMR JAR resource.

  3. Reference the EMR JAR resource.

    1. Open the EMR Spark Streaming node. The configuration tab of the node appears.

    2. Find the resource that you want to reference below Resource in the EMR folder, right-click the resource name, and then select Insert Resource Path.

    3. If the clause that is in the ##@resource_reference{""} format appears on the configuration tab of the EMR Spark Streaming node, the resource is referenced. Then, run the following code. You must replace the information in the following code with the actual information. The information includes the resource package name, bucket name, and directory.

      ##@resource_reference{"examples-1.2.0-shaded.jar"}
      --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>

Develop SQL code

On the configuration tab of the EMR Spark Streaming node, write code for the node. Sample code:

spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
Note
  • In this example, the examples-1.2.0-shaded.jar JAR package is uploaded in the DataWorks console.

  • You must replace access-key-id and access-key-secret with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain the AccessKey ID and AccessKey secret, you can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner, and then select AccessKey Management.

  • You cannot add comments when you write code for the EMR Spark Streaming node.

  • If multiple EMR data sources are associated with DataStudio in your workspace, you must select one from the data sources based on your business requirements. If only one EMR data source is associated with DataStudio in your workspace, you do not need to select a data source.

(Optional) Configure advanced parameters

You can configure advanced parameters on the Advanced Settings tab of the configuration tab of the current node. For more information about how to configure the parameters, see Spark Configuration. The following table describes the advanced parameters that can be configured.

DataLake cluster: EMR on ECS

Advanced parameter

Description

queue

The scheduling queue to which jobs are committed. Default value: default. For information about EMR YARN, see YARN schedulers.

priority

The priority. Default value: 1.

Others

You can also add a SparkConf parameter on the Advanced Settings tab for the EMR Spark Streaming node. When you commit the code for the EMR Spark Streaming node in DataWorks, DataWorks adds the custom parameter to the command. For example, you can add a custom parameter whose key is "spark.driver.memory" : "2g".

Run the Spark Streaming task

  1. In the toolbar, click the 高级运行 icon. In the Parameters dialog box, select the desired resource group from the Resource Group Name drop-down list and click Run.

    Note
    • If you want to access a data source over the Internet or a virtual private cloud (VPC), you must use the resource group for scheduling that is connected to the data source. For more information, see Network connectivity solutions.

    • If you want to change the resource group in subsequent operations, you can click the 高级运行 (Run with Parameters) icon to change the resource group in the Parameters dialog box.

  2. Click the 保存 icon in the top toolbar to save the SQL statements.

  3. Optional. Perform smoke testing.

    You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

What to do next

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered nodes.