E-MapReduce (EMR) Serverless Spark allows you to edit and run jobs by using SQL code. This topic describes how to create, start, and maintain a Spark SQL job.
Prerequisites
An Alibaba Cloud account is created. For more information, go to the Alibaba Cloud signup page.
The required roles are assigned to your Alibaba Cloud account. For more information, see Assign roles to an Alibaba Cloud account.
A workspace and a compute are created. For more information, see Create a workspace and Manage SQL computes.
Step 1: Create and publish jobs
A job must be published before it can be added to a workflow.
Go to the development page.
Log on to the EMR console.
In the left-side navigation pane, choose
.On the Spark page, click the name of the workspace in which you want to create and publish a job.
In the left-side navigation pane of the EMR Serverless Spark page, click Development.
Create a job named users_task.
On the Development tab, click New.
In the New dialog box, specify a job name in the Name field, set Task Type to the default value SparkSQL, and then click OK. In this example, the job name is users_task.
This way, a Spark SQL job is created.
Copy the following code to the new Spark SQL job tab named users_task:
CREATE TABLE IF NOT EXISTS students ( name VARCHAR(64), address VARCHAR(64) ) USING PARQUET PARTITIONED BY (data_date STRING); INSERT OVERWRITE TABLE students PARTITION (data_date = '${ds}') VALUES ('Ashua Hill', '456 Erica Ct, Cupertino'), ('Brian Reed', '723 Kern Ave, Palo Alto');
The following date and time variables are supported. The default value is the day before you create the job.
Variable
Data type
Description
{data_date}
str
A variable that indicates the date. Format:
YYYY-MM-DD
.Example: 2023-09-18.
{ds}
str
{dt}
str
{data_date_nodash}
str
A variable that indicates the date. Format:
YYYYMMDD
.Example: 20230918.
{ds_nodash}
str
{dt_nodash}
str
{ts}
str
A timestamp. Format:
YYYY-MM-DDTHH:MM:SS
.Example: 2023-09-18T16:07:43.
{ts_nodash}
str
A timestamp. Format:
YYYYMMDDHHMMSS
.Example: 20230918160743.
Select a database from the Default Database drop-down list and select a running SQL compute from the SQL compute drop-down list.
You can also select Create SQL compute from the SQL compute drop-down list to create a new SQL compute. For more information about SQL computes, see Manage SQL computes.
Click Run to run the job that you created.
You can view the running results in the Results section in the lower part of the page. If an exception occurs, you can view the exception in the Problems section.
Publish the users_task job.
NoteThe parameters specified for a job are published along with the job and are used when the job runs on the production line. The parameters of the SQL compute are used when the job runs in an SQL editor.
On the new Spark SQL job tab, click Publish.
In the Publish Task dialog box, you can enter information in the Commit Message field. Then, click OK.
Create a job named users_count.
On the Development tab, click New.
In the New dialog box, specify a job name in the Name field, set Task Type to the default value SparkSQL, and then click OK. In this example, the job name is users_count.
This way, a Spark SQL job is created.
Copy the following code to the new Spark SQL job tab named users_count:
SELECT COUNT(1) FROM students;
Select a database from the Default Database drop-down list and select a running SQL compute from the SQL compute drop-down list.
You can also select Connect to SQL Compute from the SQL compute drop-down list to create a new SQL compute. For more information about SQL computes, see Manage SQL computes.
Click Run to run the job that you created.
You can view the running results in the Execution Results section in the lower part of the page. If an exception occurs, you can view the exception in the Execution Issues section.
Publish the users_count task.
NoteThe parameters specified for a job are published along with the job and are used when the job runs on the production line. The parameters of the SQL compute are used when the job runs in an SQL editor.
On the new Spark SQL job tab, click Publish.
In the Publish dialog box, you can enter information in the Commit Message field. Then, click OK.
Step 2: Create a workflow and its nodes
In the left-side navigation pane, click Workflows.
On the Workflows page, click Create Workflow.
In the Create Workflow panel, specify Name and click Next. In this example, Workflow Name is set to spark_workflow_task.
You can configure the parameters in the Other Settings section based on your business requirements. For more information about the parameters, see Manage workflows.
Add the users_task job to the workspace as a node.
On the node canvas of the workspace that you created, click Add Node.
In the Add Node panel, select users_task from the Source drop-down list and click Save.
Add the users_count job to the workspace as a node.
On the node canvas, click Add Node.
In the Add Node panel, select users_count from the Source Path drop-down list, select users_task from the Upstream Nodes drop-down list, and then click Save.
Click Publish Workflow above the node canvas.
In the Publish dialog box, you can enter information in the Publish Message field. Then, click OK.
Step 3: Run the workflow
In the Name column on the Workflows page, click the name of the workflow that you created. In this example, the workflow name is spark_workflow_task.
On the Workflow Runs tab, click Manual Run.
NoteIf you configured Schedule Period when you created the workflow, you can turn on the switch in the Schedule Status column of the workflow on the left side of the Workflows page.
In the Start Workflow dialog box, click OK.
Step 4: View the states of workflow instances
On the Workflows page, click the name of the workflow that you want to view, such as spark_workflow_task.
On the Workflow Runs tab, you can view all workflow runs that correspond to the workflow and the duration and state of each workflow run.
Click the ID of a workflow run in the Workflow Run ID column on the Workflow Runs tab or click the Workflow Runs Diagram tab to view the corresponding workflow run diagram.
Click a node in the diagram. In the message that appears, you can perform operations or view information based on your business requirements.
For more information, see View node runs.
For example, you can click Spark UI to go to the Spark Jobs page. On this page, you can view the real-time information about the Spark job.
You can also click the value to the right of Job Run ID to go to the Job History page. On this page, you can view the metrics, health check results, and logs.
Step 5: Perform O&M operations on the workflow
On the Workflows page, click the name of the workflow on which you want to perform O&M operations. On the Workflow Runs page, you can perform the following operations:
In the Workflow Runs section, you can edit specific parameters.
In the Workflow Runs Diagram section, you can view all workflow runs. You can go to the Workflow Runs Diagram tab by clicking a value in the Workflow Run ID column.
Step 6: View job details
In the left-side navigation pane, click Development.
Create a new job. Enter and run the following command to view the details of the job in a table.
SELECT * FROM students;
The following figure provides an example of the returned information:
References
For information about how to create a resource queue, see Manage resource queues.
For information about how to create an SQL compute, see Manage SQL computes.