Use Spark Operator to run Spark jobs - Container Service for Kubernetes

Apache Spark is a computing engine for large-scale data processing. It is widely used to analyze workloads in big data computing and machine learning scenarios. Spark Operator provides the capability to automate Spark job deployment and manage the lifecycle of Spark jobs in Kubernetes clusters. This topic describes how to use Spark Operator to run Spark jobs in ACK clusters. This helps data engineers quickly and efficiently run and manage big data processing jobs.

Prerequisites

An ACK Pro cluster or ACK Serverless Pro cluster that runs Kubernetes 1.24 or later is created. For more information, see Create an ACK managed cluster, Create an ACK Serverless cluster, and Manually upgrade ACK clusters.
kubectl is used to connect to the ACK cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

Introduction to Spark Operator

Spark Operator is designed for running Spark jobs in Kubernetes clusters and automating the lifecycle management of Spark jobs. You can submit and manage Spark jobs by using CustomResourceDefinitions (CRDs) such as SparkApplication and ScheduledSparkApplication. Spark Operator can efficiently monitor and optimize the execution of Spark jobs by leveraging Kubernetes features such as auto scaling, health checks, and resource management. ACK provides the ack-spark-operator component based on the open source kubeflow/spark-operator component. For more information, see Spark Operator | Kubeflow.

Advantages:

Simplified management: Automates the deployment of Spark jobs and manage their lifecycle by using declarative job configurations in Kubernetes.
Support for multi-tenancy: You can use the namespace and resource quota mechanisms of Kubernetes to allocate and isolate user-wide resources. You can also use the Kubernetes node selection mechanism to ensure that Spark jobs can use dedicated resources.
Elastic resource provisioning: Elastic container instances or elastic node pools are used to provide a large amount of elastic resources during peak hours to balance performance and costs.

Applicable scenarios:

Data analysis: Data scientists can use Spark for interactive data analysis and data cleansing.
Batch data computing: You can run scheduled batch jobs to process large numbers of datasets.
Real-time data processing: The Spark Streaming library provides the capability to stream real-time data.

Procedure overview

This topic describes how to use Spark Operator to run and manage Spark jobs in ACK clusters to efficiently process big data.

Install the ack-spark-operator component: Install Spark Operator in the ACK cluster to manage and run Spark jobs.
Submit a Spark job: Create and submit the configuration file of a Spark job to run a data processing task.
View the Spark job: Monitor the status of the job and obtain detailed information and logs.
Access the Spark web UI: View the execution of the Spark job on the web interface.
Update the Spark job: You can modify the job configurations based on your business requirements and dynamically modify parameters.
Delete the Spark job: You can delete Spark jobs that are completed or no longer required to reduce costs.

Step 1: Install the ack-spark-operator component

Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.
On the Marketplace page, click the App Catalog tab. Find and click ack-spark-operator.
On the ack-spark-operator page, click Deploy.
In the Deploy panel, select a cluster and namespace, and then click Next.

In the Parameters step, configure the parameters and click OK.

The following table describes some parameters. You can find the parameter configurations in the Parameters section on the ack-spark-operator page.

Parameter	Description	Example
`controller.replicas`	The number of controller replicas.	Default value: 1.
`webhook.replicas`	The number of webhook replicas.	Default value: 1.
`spark.jobNamespaces`	The namespaces that can run Spark jobs. If this parameter is left empty, Spark jobs can be run in all namespaces. Separate multiple namespaces with commas (,).	Default value: `["default"]`. `[""]`: All namespaces. `["ns1","ns2","ns3"]`: Specify one or more namespaces.
`spark.serviceAccount.name`	A Spark job automatically creates a ServiceAccount named `spark-operator-spark` and the corresponding role-based access control (RBAC) resources in each namespace specified by `spark.jobNamespaces`. You can specify a custom name for the ServiceAccount and then specify the custom name when you submit a Spark job.	Default value: `spark-operator-spark`.

Step 2: Submit a Spark job

You can create a SparkApplication YAML file to submit a Spark job for data processing.

Create a SparkApplication YAML file named spark-pi.yaml.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default     # Make sure that the namespace is in the namespace list specified by spark.jobNamespaces. 
spec:
  type: Scala
  mode: cluster
  image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar
  arguments:
  - "1000"
  sparkVersion: 3.5.2
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    serviceAccount: spark-operator-spark   # Replace spark-operator-spark with the custom name that you specified. 
  executor:
    instances: 1
    cores: 1
    coreLimit: 1200m
    memory: 512m
  restartPolicy:
    type: Never

Run the following command to submit a Spark job:

kubectl apply -f spark-pi.yaml

Expected output:

sparkapplication.sparkoperator.k8s.io/spark-pi created

Step 3: View the Spark job

You can run the following command to query the status, pod information, and logs of the Spark job:

Run the following command to view the status of the Spark job:

kubectl get sparkapplication spark-pi

Expected output:

NAME       STATUS      ATTEMPTS   START                  FINISH       AGE
spark-pi   SUBMITTED   1          2024-06-04T03:17:11Z   <no value>   15s

Run the following command and set the sparkoperator. Kubernetes. io/app-name label to spark-pi to check the status of the pod that runs the Spark job:

kubectl get pod -l sparkoperator.k8s.io/app-name=spark-pi

Expected output:

NAME                               READY   STATUS    RESTARTS   AGE
spark-pi-7272428fc8f5f392-exec-1   1/1     Running   0          13s
spark-pi-7272428fc8f5f392-exec-2   1/1     Running   0          13s
spark-pi-driver                    1/1     Running   0          49s

After the Spark job is completed, all executor pods are automatically deleted by the driver.

Run the following command to view the details of the Spark job:

kubectl describe sparkapplication spark-pi

View expected output

The output varies based on the current job status.

Name:         spark-pi
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2024-06-04T03:16:59Z
  Generation:          1
  Resource Version:    1350200
  UID:                 1a1f9160-5dbb-XXXX-XXXX-be1c1fda4859
Spec:
  Arguments:
    1000
  Driver:
    Core Limit:  1200m
    Cores:       1
    Memory:           512m
    Service Account:  spark
  Executor:
    Core Limit:  1200m
    Cores:       1
    Instances:   1
    Memory:               512m
  Image:                  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2
  Image Pull Policy:      IfNotPresent
  Main Application File:  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar
  Main Class:             org.apache.spark.examples.SparkPi
  Mode:                   cluster
  Restart Policy:
    Type:         Never
  Spark Version:  3.5.2
  Type:           Scala
Status:
  Application State:
    State:  COMPLETED
  Driver Info:
    Pod Name:             spark-pi-driver
    Web UI Address:       172.XX.XX.92:0
    Web UI Port:          4040
    Web UI Service Name:  spark-pi-ui-svc
  Execution Attempts:     1
  Executor State:
    spark-pi-26c5XXXXX1408337-exec-1:  COMPLETED
  Last Submission Attempt Time:        2024-06-04T03:17:11Z
  Spark Application Id:                spark-0042dead12XXXXXX43675f09552a946
  Submission Attempts:                 1
  Submission ID:                       117ee161-3951-XXXX-XXXX-e7d24626c877
  Termination Time:                    2024-06-04T03:17:55Z
Events:
  Type    Reason                     Age   From            Message
  ----    ------                     ----  ----            -------
  Normal  SparkApplicationAdded      91s   spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
  Normal  SparkApplicationSubmitted  79s   spark-operator  SparkApplication spark-pi was submitted successfully
  Normal  SparkDriverRunning         61s   spark-operator  Driver spark-pi-driver is running
  Normal  SparkExecutorPending       56s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] is pending
  Normal  SparkExecutorRunning       53s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] is running
  Normal  SparkDriverCompleted       35s   spark-operator  Driver spark-pi-driver completed
  Normal  SparkApplicationCompleted  35s   spark-operator  SparkApplication spark-pi completed
  Normal  SparkExecutorCompleted     35s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] completed

Run the following command to view the last 20 log entries of the driver pod:

kubectl logs --tail=20 spark-pi-driver

Expected output:

24/05/30 10:05:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
24/05/30 10:05:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 7.942 s
24/05/30 10:05:30 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
24/05/30 10:05:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
24/05/30 10:05:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 8.043996 s
Pi is roughly 3.1419522314195225
24/05/30 10:05:30 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/05/30 10:05:30 INFO SparkUI: Stopped Spark web UI at http://spark-pi-1e18858fc8f56b14-driver-svc.default.svc:4040
24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
24/05/30 10:05:30 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
24/05/30 10:05:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/05/30 10:05:30 INFO MemoryStore: MemoryStore cleared
24/05/30 10:05:30 INFO BlockManager: BlockManager stopped
24/05/30 10:05:30 INFO BlockManagerMaster: BlockManagerMaster stopped
24/05/30 10:05:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/05/30 10:05:30 INFO SparkContext: Successfully stopped SparkContext
24/05/30 10:05:30 INFO ShutdownHookManager: Shutdown hook called
24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /var/data/spark-14ed60f1-82cd-4a33-b1b3-9e5d975c5b1e/spark-01120c89-5296-4c83-8a20-0799eef4e0ee
24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-5f98ed73-576a-41be-855d-dabdcf7de189

Step 4: Access the Spark web UI

Spark jobs provide a web UI to monitor the execution of Spark jobs. Run the kubectl port-forward command to map the port in the cluster to a local port in order to access the Spark web UI. The Spark web UI service is available only when the Spark job is running or the driver pod is in the Running state. After the Spark job is completed, the web UI becomes inaccessible.

When you deploy a ack-spark-operator component, controller.uiService.enable is automatically set to true, and a Service is automatically created. You can map the port of the Service to a local port to access the web UI. If you set controller.uiService.enable to false when you deploy the component, no Service is created. In this case, you can access the web UI by mapping the port of the pod.

Important

The local port specified by the kubectl port-forward command is suitable only for testing environments and is not suitable for production environments. Exercise caution when you use this method.

You can map the port of the Service or the port of the pod to a local port based on your business requirements. The following section describes the relevant commands:
- Run the following command to access the web UI by mapping the port of the Service:
```
kubectl port-forward services/spark-pi-ui-svc 4040
```
- Run the following command to access the web UI by mapping the port of the pod:
```
kubectl port-forward pods/spark-pi-driver 4040
```
  Expected output:
```
Forwarding from 127.0.0.1:4040 -> 4040
Forwarding from [::1]:4040 -> 4040
```
Access the web UI through http://127.0.0.1:4040.

(Optional) Step 5: Update the Spark job

To modify the parameters of a Spark job, you can update the YAML file of the Spark job.

Modify the YAML file named spark-pi.yaml. For example, set the arguments parameter to 10000 and the executor parameter to 2.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
spec:
  type: Scala
  mode: cluster
  image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar
  arguments:
  - "10000"
  sparkVersion: 3.5.2
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    serviceAccount: spark
  executor:
    instances: 2
    cores: 1
    coreLimit: 1200m
    memory: 512m
  restartPolicy:
    type: Never

Run the following command to update the Spark job:
```
kubectl apply -f spark-pi.yaml
```

Run the following command to view the status of the Spark job:

kubectl get sparkapplication spark-pi

The Spark job runs again. Expected output:

NAME       STATUS    ATTEMPTS   START                  FINISH       AGE
spark-pi   RUNNING   1          2024-06-04T03:37:34Z   <no value>   20m

(Optional) Step 6: Delete the Spark job

After you perform all steps in this topic, if you no longer require the Spark job, you can release the associated resources by using the following command.

Run the following command to delete the Spark job created in the preceding step:

kubectl delete -f spark-pi.yaml

You can also run the following command:

kubectl delete sparkapplication spark-pi