All Products
Search
Document Center

E-MapReduce:Get started with SQL

Last Updated:Jul 15, 2024

E-MapReduce (EMR) Serverless Spark allows you to edit and run jobs by using SQL code. This topic describes how to create, start, and maintain a Spark SQL job.

Prerequisites

Step 1: Create and publish jobs

Important

A job must be published before it can be added to a workflow.

  1. Go to the development page.

    1. Log on to the EMR console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of the workspace in which you want to create and publish a job.

    4. In the left-side navigation pane of the EMR Serverless Spark page, click Development.

  2. Create a job named users_task.

    1. On the Development tab, click New.

    2. In the New dialog box, specify a job name in the Name field, set Task Type to the default value SparkSQL, and then click OK. In this example, the job name is users_task.

      This way, a Spark SQL job is created.

    3. Copy the following code to the new Spark SQL job tab named users_task:

      CREATE TABLE IF NOT EXISTS students (
        name VARCHAR(64),
        address VARCHAR(64)
      )
      USING PARQUET
      PARTITIONED BY (data_date STRING);
      
      
      INSERT OVERWRITE TABLE students PARTITION (data_date = '${ds}') VALUES
        ('Ashua Hill', '456 Erica Ct, Cupertino'),
        ('Brian Reed', '723 Kern Ave, Palo Alto');

      The following date and time variables are supported. The default value is the day before you create the job.

      Variable

      Data type

      Description

      {data_date}

      str

      A variable that indicates the date. Format:YYYY-MM-DD.

      Example: 2023-09-18.

      {ds}

      str

      {dt}

      str

      {data_date_nodash}

      str

      A variable that indicates the date. Format: YYYYMMDD.

      Example: 20230918.

      {ds_nodash}

      str

      {dt_nodash}

      str

      {ts}

      str

      A timestamp. Format: YYYY-MM-DDTHH:MM:SS.

      Example: 2023-09-18T16:07:43.

      {ts_nodash}

      str

      A timestamp. Format: YYYYMMDDHHMMSS.

      Example: 20230918160743.

    4. Select a database from the Default Database drop-down list and select a running SQL compute from the SQL compute drop-down list.

      You can also select Create SQL compute from the SQL compute drop-down list to create a new SQL compute. For more information about SQL computes, see Manage SQL computes.

    5. Click Run to run the job that you created.

      You can view the running results in the Results section in the lower part of the page. If an exception occurs, you can view the exception in the Problems section.

  3. Publish the users_task job.

    Note

    The parameters specified for a job are published along with the job and are used when the job runs on the production line. The parameters of the SQL compute are used when the job runs in an SQL editor.

    1. On the new Spark SQL job tab, click Publish.

    2. In the Publish Task dialog box, you can enter information in the Commit Message field. Then, click OK.

  4. Create a job named users_count.

    1. On the Development tab, click New.

    2. In the New dialog box, specify a job name in the Name field, set Task Type to the default value SparkSQL, and then click OK. In this example, the job name is users_count.

      This way, a Spark SQL job is created.

    3. Copy the following code to the new Spark SQL job tab named users_count:

      SELECT COUNT(1) FROM students;
    4. Select a database from the Default Database drop-down list and select a running SQL compute from the SQL compute drop-down list.

      You can also select Connect to SQL Compute from the SQL compute drop-down list to create a new SQL compute. For more information about SQL computes, see Manage SQL computes.

    5. Click Run to run the job that you created.

      You can view the running results in the Execution Results section in the lower part of the page. If an exception occurs, you can view the exception in the Execution Issues section.

  5. Publish the users_count task.

    Note

    The parameters specified for a job are published along with the job and are used when the job runs on the production line. The parameters of the SQL compute are used when the job runs in an SQL editor.

    1. On the new Spark SQL job tab, click Publish.

    2. In the Publish dialog box, you can enter information in the Commit Message field. Then, click OK.

Step 2: Create a workflow and its nodes

  1. In the left-side navigation pane, click Workflows.

  2. On the Workflows page, click Create Workflow.

  3. In the Create Workflow panel, specify Name and click Next. In this example, Workflow Name is set to spark_workflow_task.

    You can configure the parameters in the Other Settings section based on your business requirements. For more information about the parameters, see Manage workflows.

  4. Add the users_task job to the workspace as a node.

    1. On the node canvas of the workspace that you created, click Add Node.

    2. In the Add Node panel, select users_task from the Source drop-down list and click Save.

  5. Add the users_count job to the workspace as a node.

    1. On the node canvas, click Add Node.

    2. In the Add Node panel, select users_count from the Source Path drop-down list, select users_task from the Upstream Nodes drop-down list, and then click Save.

  6. Click Publish Workflow above the node canvas.

    image

  7. In the Publish dialog box, you can enter information in the Publish Message field. Then, click OK.

Step 3: Run the workflow

  1. In the Name column on the Workflows page, click the name of the workflow that you created. In this example, the workflow name is spark_workflow_task.

  2. On the Workflow Runs tab, click Manual Run.

    Note

    If you configured Schedule Period when you created the workflow, you can turn on the switch in the Schedule Status column of the workflow on the left side of the Workflows page.

  3. In the Start Workflow dialog box, click OK.

Step 4: View the states of workflow instances

  1. On the Workflows page, click the name of the workflow that you want to view, such as spark_workflow_task.

  2. On the Workflow Runs tab, you can view all workflow runs that correspond to the workflow and the duration and state of each workflow run.

    image.png

  3. Click the ID of a workflow run in the Workflow Run ID column on the Workflow Runs tab or click the Workflow Runs Diagram tab to view the corresponding workflow run diagram.

  4. Click a node in the diagram. In the message that appears, you can perform operations or view information based on your business requirements.

    For more information, see View node runs.

    image.png

    For example, you can click Spark UI to go to the Spark Jobs page. On this page, you can view the real-time information about the Spark job.

    image

  5. You can also click the value to the right of Job Run ID to go to the Job History page. On this page, you can view the metrics, health check results, and logs.

    image

Step 5: Perform O&M operations on the workflow

On the Workflows page, click the name of the workflow on which you want to perform O&M operations. On the Workflow Runs page, you can perform the following operations:

  • In the Workflow Runs section, you can edit specific parameters.

  • In the Workflow Runs Diagram section, you can view all workflow runs. You can go to the Workflow Runs Diagram tab by clicking a value in the Workflow Run ID column.

    image.png

Step 6: View job details

  1. In the left-side navigation pane, click Development.

  2. Create a new job. Enter and run the following command to view the details of the job in a table.

    SELECT * FROM students;

    The following figure provides an example of the returned information:

    image.png

References