All Products
Search
Document Center

Container Service for Kubernetes:Use Spark Operator to run Spark jobs

Last Updated:Oct 11, 2024

Apache Spark is a computing engine for large-scale data processing. It is widely used to analyze workloads in big data computing and machine learning scenarios. Spark Operator provides the capability to automate Spark job deployment and manage the lifecycle of Spark jobs in Kubernetes clusters. This topic describes how to use Spark Operator to run Spark jobs in ACK clusters. This helps data engineers quickly and efficiently run and manage big data processing jobs.

Prerequisites

Introduction to Spark Operator

Spark Operator is designed for running Spark jobs in Kubernetes clusters and automating the lifecycle management of Spark jobs. You can submit and manage Spark jobs by using CustomResourceDefinitions (CRDs) such as SparkApplication and ScheduledSparkApplication. Spark Operator can efficiently monitor and optimize the execution of Spark jobs by leveraging Kubernetes features such as auto scaling, health checks, and resource management. ACK provides the ack-spark-operator component based on the open source kubeflow/spark-operator component. For more information, see Spark Operator | Kubeflow.

Advantages:

  • Simplified management: Automates the deployment of Spark jobs and manage their lifecycle by using declarative job configurations in Kubernetes.

  • Support for multi-tenancy: You can use the namespace and resource quota mechanisms of Kubernetes to allocate and isolate user-wide resources. You can also use the Kubernetes node selection mechanism to ensure that Spark jobs can use dedicated resources.

  • Elastic resource provisioning: Elastic container instances or elastic node pools are used to provide a large amount of elastic resources during peak hours to balance performance and costs.

Applicable scenarios:

  • Data analysis: Data scientists can use Spark for interactive data analysis and data cleansing.

  • Batch data computing: You can run scheduled batch jobs to process large numbers of datasets.

  • Real-time data processing: The Spark Streaming library provides the capability to stream real-time data.

Procedure overview

This topic describes how to use Spark Operator to run and manage Spark jobs in ACK clusters to efficiently process big data.

  1. Install the ack-spark-operator component: Install Spark Operator in the ACK cluster to manage and run Spark jobs.

  2. Submit a Spark job: Create and submit the configuration file of a Spark job to run a data processing task.

  3. View the Spark job: Monitor the status of the job and obtain detailed information and logs.

  4. Access the Spark web UI: View the execution of the Spark job on the web interface.

  5. Update the Spark job: You can modify the job configurations based on your business requirements and dynamically modify parameters.

  6. Delete the Spark job: You can delete Spark jobs that are completed or no longer required to reduce costs.

Step 1: Install the ack-spark-operator component

  1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

  2. On the Marketplace page, click the App Catalog tab. Find and click ack-spark-operator.

  3. On the ack-spark-operator page, click Deploy.

  4. In the Deploy panel, select a cluster and namespace, and then click Next.

  5. In the Parameters step, configure the parameters and click OK.

    The following table describes some parameters. You can find the parameter configurations in the Parameters section on the ack-spark-operator page.

    Parameter

    Description

    Example

    controller.replicas

    The number of controller replicas.

    Default value: 1.

    webhook.replicas

    The number of webhook replicas.

    Default value: 1.

    spark.jobNamespaces

    The namespaces that can run Spark jobs. If this parameter is left empty, Spark jobs can be run in all namespaces. Separate multiple namespaces with commas (,).

    • Default value: ["default"].

    • [""]: All namespaces.

    • ["ns1","ns2","ns3"]: Specify one or more namespaces.

    spark.serviceAccount.name

    A Spark job automatically creates a ServiceAccount named spark-operator-spark and the corresponding role-based access control (RBAC) resources in each namespace specified by spark.jobNamespaces. You can specify a custom name for the ServiceAccount and then specify the custom name when you submit a Spark job.

    Default value: spark-operator-spark.

Step 2: Submit a Spark job

You can create a SparkApplication YAML file to submit a Spark job for data processing.

  1. Create a SparkApplication YAML file named spark-pi.yaml.

    apiVersion: sparkoperator.k8s.io/v1beta2
    kind: SparkApplication
    metadata:
      name: spark-pi
      namespace: default     # Make sure that the namespace is in the namespace list specified by spark.jobNamespaces. 
    spec:
      type: Scala
      mode: cluster
      image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2
      imagePullPolicy: IfNotPresent
      mainClass: org.apache.spark.examples.SparkPi
      mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar
      arguments:
      - "1000"
      sparkVersion: 3.5.2
      driver:
        cores: 1
        coreLimit: 1200m
        memory: 512m
        serviceAccount: spark-operator-spark   # Replace spark-operator-spark with the custom name that you specified. 
      executor:
        instances: 1
        cores: 1
        coreLimit: 1200m
        memory: 512m
      restartPolicy:
        type: Never
  2. Run the following command to submit a Spark job:

    kubectl apply -f spark-pi.yaml

    Expected output:

    sparkapplication.sparkoperator.k8s.io/spark-pi created

Step 3: View the Spark job

You can run the following command to query the status, pod information, and logs of the Spark job:

  1. Run the following command to view the status of the Spark job:

    kubectl get sparkapplication spark-pi

    Expected output:

    NAME       STATUS      ATTEMPTS   START                  FINISH       AGE
    spark-pi   SUBMITTED   1          2024-06-04T03:17:11Z   <no value>   15s
  2. Run the following command and set the sparkoperator. Kubernetes. io/app-name label to spark-pi to check the status of the pod that runs the Spark job:

    kubectl get pod -l sparkoperator.k8s.io/app-name=spark-pi

    Expected output:

    NAME                               READY   STATUS    RESTARTS   AGE
    spark-pi-7272428fc8f5f392-exec-1   1/1     Running   0          13s
    spark-pi-7272428fc8f5f392-exec-2   1/1     Running   0          13s
    spark-pi-driver                    1/1     Running   0          49s

    After the Spark job is completed, all executor pods are automatically deleted by the driver.

  3. Run the following command to view the details of the Spark job:

    kubectl describe sparkapplication spark-pi

    View expected output

    The output varies based on the current job status.

    Name:         spark-pi
    Namespace:    default
    Labels:       <none>
    Annotations:  <none>
    API Version:  sparkoperator.k8s.io/v1beta2
    Kind:         SparkApplication
    Metadata:
      Creation Timestamp:  2024-06-04T03:16:59Z
      Generation:          1
      Resource Version:    1350200
      UID:                 1a1f9160-5dbb-XXXX-XXXX-be1c1fda4859
    Spec:
      Arguments:
        1000
      Driver:
        Core Limit:  1200m
        Cores:       1
        Memory:           512m
        Service Account:  spark
      Executor:
        Core Limit:  1200m
        Cores:       1
        Instances:   1
        Memory:               512m
      Image:                  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2
      Image Pull Policy:      IfNotPresent
      Main Application File:  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar
      Main Class:             org.apache.spark.examples.SparkPi
      Mode:                   cluster
      Restart Policy:
        Type:         Never
      Spark Version:  3.5.2
      Type:           Scala
    Status:
      Application State:
        State:  COMPLETED
      Driver Info:
        Pod Name:             spark-pi-driver
        Web UI Address:       172.XX.XX.92:0
        Web UI Port:          4040
        Web UI Service Name:  spark-pi-ui-svc
      Execution Attempts:     1
      Executor State:
        spark-pi-26c5XXXXX1408337-exec-1:  COMPLETED
      Last Submission Attempt Time:        2024-06-04T03:17:11Z
      Spark Application Id:                spark-0042dead12XXXXXX43675f09552a946
      Submission Attempts:                 1
      Submission ID:                       117ee161-3951-XXXX-XXXX-e7d24626c877
      Termination Time:                    2024-06-04T03:17:55Z
    Events:
      Type    Reason                     Age   From            Message
      ----    ------                     ----  ----            -------
      Normal  SparkApplicationAdded      91s   spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
      Normal  SparkApplicationSubmitted  79s   spark-operator  SparkApplication spark-pi was submitted successfully
      Normal  SparkDriverRunning         61s   spark-operator  Driver spark-pi-driver is running
      Normal  SparkExecutorPending       56s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] is pending
      Normal  SparkExecutorRunning       53s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] is running
      Normal  SparkDriverCompleted       35s   spark-operator  Driver spark-pi-driver completed
      Normal  SparkApplicationCompleted  35s   spark-operator  SparkApplication spark-pi completed
      Normal  SparkExecutorCompleted     35s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] completed
  4. Run the following command to view the last 20 log entries of the driver pod:

    kubectl logs --tail=20 spark-pi-driver

    Expected output:

    24/05/30 10:05:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
    24/05/30 10:05:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 7.942 s
    24/05/30 10:05:30 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
    24/05/30 10:05:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
    24/05/30 10:05:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 8.043996 s
    Pi is roughly 3.1419522314195225
    24/05/30 10:05:30 INFO SparkContext: SparkContext is stopping with exitCode 0.
    24/05/30 10:05:30 INFO SparkUI: Stopped Spark web UI at http://spark-pi-1e18858fc8f56b14-driver-svc.default.svc:4040
    24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
    24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
    24/05/30 10:05:30 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
    24/05/30 10:05:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    24/05/30 10:05:30 INFO MemoryStore: MemoryStore cleared
    24/05/30 10:05:30 INFO BlockManager: BlockManager stopped
    24/05/30 10:05:30 INFO BlockManagerMaster: BlockManagerMaster stopped
    24/05/30 10:05:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    24/05/30 10:05:30 INFO SparkContext: Successfully stopped SparkContext
    24/05/30 10:05:30 INFO ShutdownHookManager: Shutdown hook called
    24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /var/data/spark-14ed60f1-82cd-4a33-b1b3-9e5d975c5b1e/spark-01120c89-5296-4c83-8a20-0799eef4e0ee
    24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-5f98ed73-576a-41be-855d-dabdcf7de189

Step 4: Access the Spark web UI

Spark jobs provide a web UI to monitor the execution of Spark jobs. Run the kubectl port-forward command to map the port in the cluster to a local port in order to access the Spark web UI. The Spark web UI service is available only when the Spark job is running or the driver pod is in the Running state. After the Spark job is completed, the web UI becomes inaccessible.

When you deploy a ack-spark-operator component, controller.uiService.enable is automatically set to true, and a Service is automatically created. You can map the port of the Service to a local port to access the web UI. If you set controller.uiService.enable to false when you deploy the component, no Service is created. In this case, you can access the web UI by mapping the port of the pod.

Important

The local port specified by the kubectl port-forward command is suitable only for testing environments and is not suitable for production environments. Exercise caution when you use this method.

  1. You can map the port of the Service or the port of the pod to a local port based on your business requirements. The following section describes the relevant commands:

    • Run the following command to access the web UI by mapping the port of the Service:

      kubectl port-forward services/spark-pi-ui-svc 4040
    • Run the following command to access the web UI by mapping the port of the pod:

      kubectl port-forward pods/spark-pi-driver 4040

      Expected output:

      Forwarding from 127.0.0.1:4040 -> 4040
      Forwarding from [::1]:4040 -> 4040
  2. Access the web UI through http://127.0.0.1:4040.

(Optional) Step 5: Update the Spark job

To modify the parameters of a Spark job, you can update the YAML file of the Spark job.

  1. Modify the YAML file named spark-pi.yaml. For example, set the arguments parameter to 10000 and the executor parameter to 2.

    apiVersion: sparkoperator.k8s.io/v1beta2
    kind: SparkApplication
    metadata:
      name: spark-pi
    spec:
      type: Scala
      mode: cluster
      image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2
      imagePullPolicy: IfNotPresent
      mainClass: org.apache.spark.examples.SparkPi
      mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar
      arguments:
      - "10000"
      sparkVersion: 3.5.2
      driver:
        cores: 1
        coreLimit: 1200m
        memory: 512m
        serviceAccount: spark
      executor:
        instances: 2
        cores: 1
        coreLimit: 1200m
        memory: 512m
      restartPolicy:
        type: Never
  2. Run the following command to update the Spark job:

    kubectl apply -f spark-pi.yaml
  3. Run the following command to view the status of the Spark job:

    kubectl get sparkapplication spark-pi

    The Spark job runs again. Expected output:

    NAME       STATUS    ATTEMPTS   START                  FINISH       AGE
    spark-pi   RUNNING   1          2024-06-04T03:37:34Z   <no value>   20m

(Optional) Step 6: Delete the Spark job

After you perform all steps in this topic, if you no longer require the Spark job, you can release the associated resources by using the following command.

Run the following command to delete the Spark job created in the preceding step:

kubectl delete -f spark-pi.yaml

You can also run the following command:

kubectl delete sparkapplication spark-pi