Edit jobs - E-MapReduce - Alibaba Cloud Documentation Center

Background information

You can perform the following operations on jobs:

Create a job
Configure a job
Add annotations
Run a job
Operations that you can perform on jobs
Job submission modes

Prerequisites

A project is created. For more information, see Manage projects.

Create a job

Go to the Data Platform tab.
1. Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
3. Click the Data Platform tab.
In the Projects section of the page that appears, find the project that you want to manage and click Edit Job in the Actions column.
Create a job.
1. In the Edit Job pane on the left side of the page that appears, right-click the folder on which you want to perform operations and select Create Job.
  
  Note You can also right-click the folder and select Create Subfolder, Rename Folder, or Delete Folder to perform the corresponding operation.
2. In the Create Job dialog box, specify Name and Description, and then select a specific job type from the Job Type drop-down list.
  
  E-MapReduce (EMR) supports the following types of jobs in data development: Shell, Hive, Hive SQL, Spark, Spark SQL, Spark Shell, Spark Streaming, MapReduce, Sqoop, Pig, Flink, Streaming SQL, Presto SQL, and Impala SQL.
  
  Note After the job is created, you cannot change the type of the job.
3. Click OK.
  After a job is created, you can configure and edit the job.

Configure a job

For more information about how to develop and configure each type of job, see Jobs. This section describes how to configure the parameters of a job on the Basic Settings, Advanced Settings, Shared Libraries, and Alert Settings tabs in the Job Settings panel.

In the upper-right corner of the job page, click Job Settings.

In the Job Settings panel, configure the parameters on the Basic Settings tab.


Section and parameter		Description
Job Overview	Name	The name of the job.
	Job Type	The type of the job.
	Retries	The number of retries that are allowed if the job fails. The value of this parameter ranges from 0 to 5.
	Actions on Failures	The action that you can perform if the job fails. Valid values: Pause: Suspend the current workflow if the job fails. Run Next Job: Continue to run the next job if the job fails. You can determine whether to turn on the Use Latest Job Content and Parameters switch based on your business requirements. If you turn off this switch, a job instance is generated based on the original job content and parameters after you rerun a job that fails. If you turn on this switch, a job instance is generated based on the latest job content and parameters after you rerun a job that fails.
	Description	The description of the job. If you want to modify the description of the job, you can click Edit on the right side of this parameter.
Resources		The resources that are required to run the job, such as JAR packages and user-defined functions (UDFs). Click the icon on the right side to add resources. Upload the resources to Object Storage Service (OSS) first. Then, you can add them to the job.
Configuration Parameters		The variables that you want to reference in the job script. You can reference a variable in your job script in the format of `${Variable name}`. Click the icon on the right side to add a variable in the key-value pair format. You can determine whether to select Password to hide the value based on your business requirements. The key indicates the name of the variable. The value indicates the value of the variable. In addition, you can configure a time variable based on the start time of scheduling. For more information, see Configure job time and date.

Click the Advanced Settings tab and configure the parameters.


Section	Parameter and description
Mode	Job Submission Node: the mode to submit the job. For more information, see Job submission modes. Valid values: Worker Node: The job is submitted to YARN by using a launcher, and YARN allocates resources to run the job. Header/Gateway Node: The job runs as a process on the allocated node. Estimated Maximum Duration: the estimated maximum running duration of the job. Valid values: 0 to 10800. Unit: seconds.
Environment Variables	The environment variables that are used to run the job. You can also export environment variables from the job script. Example 1: Configure a Shell job with the code `echo ${ENV_ABC}`. If you set the ENV_ABC variable to `12345`, a value of `12345` is returned after you run the `echo` command. Example 2: Configure a Shell job with the code `java -jar abc.jar`. Content of the abc.jar package: `public static void main(String[] args) {System.out.println(System.getEnv("ENV_ABC"));}` If you set the ENV_ABC variable to 12345, a value of `12345` is returned after you run the job. The effect of setting the ENV_ABC variable in the Environment Variables section is equivalent to running the following script: `export ENV_ABC=12345 java -jar abc.jar`
Scheduling Parameters	The parameters used to schedule the job, including Queue, Memory (MB), vCores, Priority, and Run By. If you do not configure these parameters, the default settings of the Hadoop cluster are used. Note The Memory (MB) parameter specifies the memory quota for the launcher.

Click the Shared Libraries tab.
In the Dependent Libraries section, specify Libraries.
Job execution depends on some library files related to data sources. EMR publishes the libraries to the repository of the scheduling center as dependency libraries. You must specify dependency libraries when you create a job. To specify a dependency library, enter its reference string, such as sharedlibs:streamingsql:datasources-bundle:2.0.0.

Click the Alert Settings tab and configure the alert parameters.


Parameter	Description
Execution Failed	Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the job fails.
Action on Startup Timeout	Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the job startup times out.
Job execution timed out.	Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the job execution times out.

Add annotations

You can add annotations to job scripts to configure job parameters in data development. Add an annotation in the following format:

!!! @<Annotation name>: <Annotation content>

Note Do not indent the three exclamation points (!!!) that start an annotation. Add one annotation in a line.

The following table describes all annotations that are supported.


Annotation name	Description	Example
rem	Adds a comment.	`!!! @rem: This is a comment.`
env	Adds an environment variable.	`!!! @env: ENV_1=ABC`
var	Adds a custom variable.	`!!! @var: var1="value1 and \"one string end with 3 spaces\" " !!! @var: var2=${yyyy-MM-dd}`
resource	Adds a resource file.	`!!! @resource: oss://bucket1/dir1/file.jar`
sharedlibs	Adds dependency libraries. This annotation is valid only in Streaming SQL jobs. Separate multiple dependency libraries with commas (,).	`!!! @sharedlibs: sharedlibs:streamingsql:datasources-bundle:1.7.0,...`
scheduler.queue	Specifies the queue to which the job is submitted.	`!!! @scheduler.queue: default`
scheduler.vmem	Specifies the memory required to run the job. Unit: MiB.	`!!! @scheduler.vmem: 1024`
scheduler.vcores	Specifies the number of vCores required to run the job.	`!!! @scheduler.vcores: 1`
scheduler.priority	Specifies the priority of the job. Valid values: 1 to 100.	`!!! @scheduler.priority: 1`
scheduler.user	Specifies the user who submits the job.	`!!! @scheduler.user: root`

Notice

When you add annotations, take note of the following points:

Invalid annotations are automatically skipped. For example, an unknown annotation or an annotation whose content is in an invalid format will be skipped.
Job parameters specified in annotations take precedence over job parameters specified in the Job Settings panel. If a parameter is specified both in an annotation and in the Job Settings panel, the parameter setting specified in the annotation takes effect.

Run a job

Run the job that you created.
1. On the job page, click Run in the upper-right corner to run the job.
2. In the Run Job dialog box, select a resource group and the cluster that you created.
3. Click OK.
View running details.
1. Click the Log tab in the lower part of the job page to view the operational logs.
2. Click the Records tab to view the execution records of the job instance.
3. Click Details in the Action column of a job instance to go to the Scheduling Center tab. On this tab, you can view the details about the job instance.

Operations that you can perform on jobs

In the Edit Job pane, you can right-click a job and perform the operations that are described in the following table.


Operation	Description
Clone Job	Clones the configurations of a job to generate a new job in the same folder.
Rename Job	Renames a job.
Delete Job	Deletes a job. You can delete a job only if the job is not associated with a workflow or the associated workflow is not running or being scheduled.

Job submission modes

The spark-submit process, which is the launcher in a data development module, is used to submit Spark jobs. In most cases, this process occupies more than 600 MiB of memory. The Memory (MB) parameter in the Job Settings panel specifies the size of the memory allocated to the launcher.

The following table describes the modes in which jobs can be submitted in the latest version of EMR.


Job submission mode	Description
Header/Gateway Node	In this mode, the spark-submit process runs on the master node and is not monitored by YARN. The spark-submit process requests a large amount of memory. A large number of jobs consume many resources of the master node, which undermines cluster stability.
Worker Node	In this mode, the spark-submit process runs on a core node, occupies a YARN container, and is monitored by YARN. This mode reduces the resource usage on the master node.

In an EMR cluster, the memory consumed by a job instance is calculated by using the following formula:

Memory consumed by a job instance = Memory consumed by the launcher + Memory consumed by a job that corresponds to the job instance

For a Spark job, the memory consumed by a job is calculated by using the following formula:

Memory consumed by a job = Memory consumed by the spark-submit logical module (not the process) + Memory consumed by the driver + Memory consumed by the executor

The process in which the driver runs varies based on the mode in which Spark applications are launched in YARN.


Launch mode of Spark application		Process in which spark-submit and driver run	Process description
yarn-client mode	Submit a job in LOCAL mode.	The driver runs in the same process as spark-submit.	The process used to submit a job runs on the master node and is not monitored by YARN.
yarn-client mode	Submit a job in YARN mode.	The driver runs in the same process as spark-submit.	The process used to submit a job runs on a core node, occupies a YARN container, and is monitored by YARN.
yarn-cluster mode		The driver runs in a different process from spark-submit.	The driver occupies a YARN container.