Apache Spark is a computing engine for large-scale data processing. It is widely used to analyze workloads in big data computing and machine learning scenarios. Spark Operator provides the capability to automate Spark job deployment and manage the lifecycle of Spark jobs in Kubernetes clusters. This topic describes how to use Spark Operator to run Spark jobs in ACK clusters. This helps data engineers quickly and efficiently run and manage big data processing jobs.
Prerequisites
An ACK Pro cluster or ACK Serverless Pro cluster that runs Kubernetes 1.24 or later is created. For more information, see Create an ACK managed cluster, Create an ACK Serverless cluster, and Manually upgrade ACK clusters.
kubectl is used to connect to the ACK cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Introduction to Spark Operator
Spark Operator is designed for running Spark jobs in Kubernetes clusters and automating the lifecycle management of Spark jobs. You can submit and manage Spark jobs by using CustomResourceDefinitions (CRDs) such as SparkApplication
and ScheduledSparkApplication
. Spark Operator can efficiently monitor and optimize the execution of Spark jobs by leveraging Kubernetes features such as auto scaling, health checks, and resource management. ACK provides the ack-spark-operator component based on the open source kubeflow/spark-operator component. For more information, see Spark Operator | Kubeflow.
Advantages:
Simplified management: Automates the deployment of Spark jobs and manage their lifecycle by using declarative job configurations in Kubernetes.
Support for multi-tenancy: You can use the namespace and resource quota mechanisms of Kubernetes to allocate and isolate user-wide resources. You can also use the Kubernetes node selection mechanism to ensure that Spark jobs can use dedicated resources.
Elastic resource provisioning: Elastic container instances or elastic node pools are used to provide a large amount of elastic resources during peak hours to balance performance and costs.
Applicable scenarios:
Data analysis: Data scientists can use Spark for interactive data analysis and data cleansing.
Batch data computing: You can run scheduled batch jobs to process large numbers of datasets.
Real-time data processing: The Spark Streaming library provides the capability to stream real-time data.
Procedure overview
This topic describes how to use Spark Operator to run and manage Spark jobs in ACK clusters to efficiently process big data.
Install the ack-spark-operator component: Install Spark Operator in the ACK cluster to manage and run Spark jobs.
Submit a Spark job: Create and submit the configuration file of a Spark job to run a data processing task.
View the Spark job: Monitor the status of the job and obtain detailed information and logs.
Access the Spark web UI: View the execution of the Spark job on the web interface.
Update the Spark job: You can modify the job configurations based on your business requirements and dynamically modify parameters.
Delete the Spark job: You can delete Spark jobs that are completed or no longer required to reduce costs.
Step 1: Install the ack-spark-operator component
Log on to the ACK console. In the left-side navigation pane, choose .
On the Marketplace page, click the App Catalog tab. Find and click ack-spark-operator.
On the ack-spark-operator page, click Deploy.
In the Deploy panel, select a cluster and namespace, and then click Next.
In the Parameters step, configure the parameters and click OK.
The following table describes some parameters. You can find the parameter configurations in the Parameters section on the ack-spark-operator page.
Parameter
Description
Example
controller.replicas
The number of controller replicas.
Default value: 1.
webhook.replicas
The number of webhook replicas.
Default value: 1.
spark.jobNamespaces
The namespaces that can run Spark jobs. If this parameter is left empty, Spark jobs can be run in all namespaces. Separate multiple namespaces with commas (,).
Default value:
["default"]
.[""]
: All namespaces.["ns1","ns2","ns3"]
: Specify one or more namespaces.
spark.serviceAccount.name
A Spark job automatically creates a ServiceAccount named
spark-operator-spark
and the corresponding role-based access control (RBAC) resources in each namespace specified byspark.jobNamespaces
. You can specify a custom name for the ServiceAccount and then specify the custom name when you submit a Spark job.Default value:
spark-operator-spark
.
Step 2: Submit a Spark job
You can create a SparkApplication YAML file to submit a Spark job for data processing.
Create a SparkApplication YAML file named
spark-pi.yaml
.apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi namespace: default # Make sure that the namespace is in the namespace list specified by spark.jobNamespaces. spec: type: Scala mode: cluster image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2 imagePullPolicy: IfNotPresent mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar arguments: - "1000" sparkVersion: 3.5.2 driver: cores: 1 coreLimit: 1200m memory: 512m serviceAccount: spark-operator-spark # Replace spark-operator-spark with the custom name that you specified. executor: instances: 1 cores: 1 coreLimit: 1200m memory: 512m restartPolicy: type: Never
Run the following command to submit a Spark job:
kubectl apply -f spark-pi.yaml
Expected output:
sparkapplication.sparkoperator.k8s.io/spark-pi created
Step 3: View the Spark job
You can run the following command to query the status, pod information, and logs of the Spark job:
Run the following command to view the status of the Spark job:
kubectl get sparkapplication spark-pi
Expected output:
NAME STATUS ATTEMPTS START FINISH AGE spark-pi SUBMITTED 1 2024-06-04T03:17:11Z <no value> 15s
Run the following command and set the
sparkoperator. Kubernetes. io/app-name
label tospark-pi
to check the status of the pod that runs the Spark job:kubectl get pod -l sparkoperator.k8s.io/app-name=spark-pi
Expected output:
NAME READY STATUS RESTARTS AGE spark-pi-7272428fc8f5f392-exec-1 1/1 Running 0 13s spark-pi-7272428fc8f5f392-exec-2 1/1 Running 0 13s spark-pi-driver 1/1 Running 0 49s
After the Spark job is completed, all executor pods are automatically deleted by the driver.
Run the following command to view the details of the Spark job:
kubectl describe sparkapplication spark-pi
Run the following command to view the last 20 log entries of the driver pod:
kubectl logs --tail=20 spark-pi-driver
Expected output:
24/05/30 10:05:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 24/05/30 10:05:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 7.942 s 24/05/30 10:05:30 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 24/05/30 10:05:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 24/05/30 10:05:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 8.043996 s Pi is roughly 3.1419522314195225 24/05/30 10:05:30 INFO SparkContext: SparkContext is stopping with exitCode 0. 24/05/30 10:05:30 INFO SparkUI: Stopped Spark web UI at http://spark-pi-1e18858fc8f56b14-driver-svc.default.svc:4040 24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 24/05/30 10:05:30 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. 24/05/30 10:05:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 24/05/30 10:05:30 INFO MemoryStore: MemoryStore cleared 24/05/30 10:05:30 INFO BlockManager: BlockManager stopped 24/05/30 10:05:30 INFO BlockManagerMaster: BlockManagerMaster stopped 24/05/30 10:05:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 24/05/30 10:05:30 INFO SparkContext: Successfully stopped SparkContext 24/05/30 10:05:30 INFO ShutdownHookManager: Shutdown hook called 24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /var/data/spark-14ed60f1-82cd-4a33-b1b3-9e5d975c5b1e/spark-01120c89-5296-4c83-8a20-0799eef4e0ee 24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-5f98ed73-576a-41be-855d-dabdcf7de189
Step 4: Access the Spark web UI
Spark jobs provide a web UI to monitor the execution of Spark jobs. Run the kubectl port-forward
command to map the port in the cluster to a local port in order to access the Spark web UI. The Spark web UI service is available only when the Spark job is running or the driver pod is in the Running state. After the Spark job is completed, the web UI becomes inaccessible.
When you deploy a ack-spark-operator component, controller.uiService.enable
is automatically set to true
, and a Service is automatically created. You can map the port of the Service to a local port to access the web UI. If you set controller.uiService.enable
to false
when you deploy the component, no Service is created. In this case, you can access the web UI by mapping the port of the pod.
The local port specified by the kubectl port-forward
command is suitable only for testing environments and is not suitable for production environments. Exercise caution when you use this method.
You can map the port of the Service or the port of the pod to a local port based on your business requirements. The following section describes the relevant commands:
Run the following command to access the web UI by mapping the port of the Service:
kubectl port-forward services/spark-pi-ui-svc 4040
Run the following command to access the web UI by mapping the port of the pod:
kubectl port-forward pods/spark-pi-driver 4040
Expected output:
Forwarding from 127.0.0.1:4040 -> 4040 Forwarding from [::1]:4040 -> 4040
Access the web UI through http://127.0.0.1:4040.
(Optional) Step 5: Update the Spark job
To modify the parameters of a Spark job, you can update the YAML file of the Spark job.
Modify the YAML file named
spark-pi.yaml
. For example, set thearguments
parameter to10000
and theexecutor
parameter to2
.apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi spec: type: Scala mode: cluster image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.2 imagePullPolicy: IfNotPresent mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.2.jar arguments: - "10000" sparkVersion: 3.5.2 driver: cores: 1 coreLimit: 1200m memory: 512m serviceAccount: spark executor: instances: 2 cores: 1 coreLimit: 1200m memory: 512m restartPolicy: type: Never
Run the following command to update the Spark job:
kubectl apply -f spark-pi.yaml
Run the following command to view the status of the Spark job:
kubectl get sparkapplication spark-pi
The Spark job runs again. Expected output:
NAME STATUS ATTEMPTS START FINISH AGE spark-pi RUNNING 1 2024-06-04T03:37:34Z <no value> 20m
(Optional) Step 6: Delete the Spark job
After you perform all steps in this topic, if you no longer require the Spark job, you can release the associated resources by using the following command.
Run the following command to delete the Spark job created in the preceding step:
kubectl delete -f spark-pi.yaml
You can also run the following command:
kubectl delete sparkapplication spark-pi