Get started with the development of Spark SQL jobs - E-MapReduce

E-MapReduce (EMR) Serverless Spark allows you to edit and run jobs by using SQL code. This topic describes how to create, start, and maintain a Spark SQL job.

Prerequisites

An Alibaba Cloud account is created. You can create an Alibaba Cloud account on the Alibaba Cloud signup page.

The required roles are assigned to the Alibaba Cloud account. For more information, see Assign roles to an Alibaba Cloud account.
A workspace and a session are created. For more information, see Create a workspace and Manage SQL sessions.

Step 1: Create and publish jobs

Important

A job must be published before it can be added to a workflow.

Go to the Development tab.
1. Log on to the EMR console.
2. In the left-side navigation pane, choose EMR Serverless > Spark.
3. On the Spark page, find the desired workspace and click the name of the workspace.
4. In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.

Create a job named users_task.

On the Development tab, click Create.
In the Create dialog box, set the Name parameter to users_task, use the default value SparkSQL for the Type parameter, and then click OK.

On the configuration tab of the users_task job, copy the following code to the code editor:

CREATE TABLE IF NOT EXISTS students (
  name VARCHAR(64),
  address VARCHAR(64)
)
USING PARQUET
PARTITIONED BY (data_date STRING);


INSERT OVERWRITE TABLE students PARTITION (data_date = '${ds}') VALUES
  ('Ashua Hill', '456 Erica Ct, Cupertino'),
  ('Brian Reed', '723 Kern Ave, Palo Alto');

The following table describes supported date and time variables. The default value is the day before you create the job.

Variable	Data type	Description
{data_date}	str	A variable that indicates the date. Format:`YYYY-MM-DD`. Example: 2023-09-18.
{ds}	str
{dt}	str
{data_date_nodash}	str	A variable that indicates the date. Format: `YYYYMMDD`. Example: 20230918.
{ds_nodash}	str
{dt_nodash}	str
{ts}	str	A variable that indicates the timestamp. Format: `YYYY-MM-DDTHH:MM:SS`. Example: 2023-09-18T16:07:43.
{ts_nodash}	str	A variable that indicates the timestamp. Format: `YYYYMMDDHHMMSS`. Example: 20230918160743.

Select a database from the Default Database drop-down list and select a running session from the SQL Sessions drop-down list.
You can also select Create SQL Session from the SQL Sessions drop-down list to create a session. For information about session management, see Manage SQL sessions.
Click Run to run the job that you created.
You can view the results on the Execution Results tab in the lower part of the configuration tab of the job. If an exception occurs, you can view the exception on the Execution Issues tab in the lower part of the configuration tab of the job.

Publish the users_task job.
Note
The parameters specified for a job are published along with the job and are used when the job runs on the production line. The parameters of the session are used when the job runs in an SQL editor.
1. In the upper-right corner of the configuration tab of the job, click Publish.
2. In the Publish dialog box, configure the Remarks parameter and click OK.
Create a job named users_count.
1. On the Development tab, click Create.
2. In the Create dialog box, set the Name parameter to users_count, use the default value SparkSQL for the Type parameter, and then click OK.
3. On the configuration tab of the users_count job, copy the following code to the code editor:
```
SELECT COUNT(1) FROM students;
```
4. Select a database from the Default Database drop-down list and select a running session from the SQL Sessions drop-down list.
  You can also select Create SQL Session from the SQL Sessions drop-down list to create a session. For information about session management, see Manage SQL sessions.
5. Click Run to run the job that you created.
  You can view the results on the Execution Results tab in the lower part of the configuration tab of the job. If an exception occurs, you can view the exception on the Execution Issues tab in the lower part of the configuration tab of the job.
Publish the users_count job.
Note
The parameters specified for a job are published along with the job and are used when the job runs on the production line. The parameters of the session are used when the job runs in an SQL editor.
1. In the upper-right corner of the configuration tab of the job, click Publish.
2. In the Publish dialog box, configure the Remarks parameter and click OK.

Step 2: Create a workflow and nodes

In the left-side navigation pane of the EMR Serverless Spark page, click Workflows.
On the Workflows tab of the Workflows page, click Create Workflow.
In the Create Workflow panel, configure the Name parameter and click Next. In this example, the Name parameter is set to spark_workflow_task.
You can configure the parameters in the Other Settings section based on your business requirements. For more information about parameter settings, see Manage workflows.
Add the users_task job to the workflow as a node.
1. On the canvas that appears, click Add Node.
2. In the Add Node panel, select the users_task job that is published from the Source File Path drop-down list and click Save.
Add the users_count job to the workflow as a node.
1. On the canvas, click Add Node.
2. In the Add Node panel, select the users_count job that is published from the Source File Path drop-down list, select users_task from the Upstream Node drop-down list, and then click Save.
Click Publish Workflow in the upper-right corner of the canvas of the workflow.
In the Publish dialog box, configure the Remarks parameter and click OK.

Step 3: Run the workflow

On the Workflows tab of the Workflows page, click the name of the created workflow in the Name column.
In the upper-right corner of the Workflow Runs tab, click Manually Run.
Note
If you specified a scheduling cycle when you created the workflow, you can turn on the switch in the Scheduling Status column of the workflow on the Workflows tab of the Workflows page.
In the Run message, click OK.

Step 4: View the status of a workflow run

On the Workflows tab of the Workflows page, find the desired workflow and click the workflow name in the Name column.
In the Workflow Runs section of the Workflow Runs tab, view all generated workflow runs and information such as the running duration and status of each workflow run.
Click the ID of a workflow run in the Workflow Run ID column in the Workflow Runs section of the Workflow Runs tab or click the Workflow Run Graph tab to view the directed acyclic graph (DAG) of the desired workflow run.
Click a node in the DAG. In the dialog box that appears, perform operations or view information based on your business requirements.
For more information, see the View node runs section of the "Manage workflow runs and node runs" topic.
For example, you can click Spark UI to go to the Spark Jobs page. On this page, you can view real-time information about the Spark SQL job.
Click the value of the Job Run ID parameter to go to the Job History page to view details on the Metrics, Diagnosis, and Log Exploration tabs.

Step 5: Perform O&M on the workflow

On the Workflows tab of the Workflows page, click the name of the workflow on which you want to perform O&M operations. On the Workflow Runs tab, you can perform the following operations:

In the Workflow Information section, you can edit specific parameters.
In the Workflow Runs section, you can view all workflow runs. You can click the ID of a workflow run in the Workflow Run ID column in the Workflow Runs section to open the DAG of the workflow run.

Step 6: View data

In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
Create a Spark SQL job and execute the following statement to view detailed table information:
```
SELECT * FROM students;
```
The output that is shown in the following figure is returned.

References

For more information about how to create a resource queue, see Manage resource queues.
For more information about how to create an SQL session, see Manage SQL sessions.