全部產品
Search
文件中心

Elastic Container Instance:使用ECI運行Spark作業

更新時間:Jul 06, 2024

在Kubernetes叢集中使用ECI來運行Spark作業具有Auto Scaling、自動化部署、高可用性等優勢,可以提高Spark作業的運行效率和穩定性。本文介紹如何在ACK Serverless叢集中安裝Spark Operator,使用ECI來運行Spark作業。

背景資訊

Apache Spark是一個在資料分析領域廣泛使用的開源專案,它常被應用於眾所周知的巨量資料和機器學習工作負載中。從Apache Spark 2.3.0版本開始,您可以在Kubernetes上運行和管理Spark資源。

Spark Operator是專門針對Spark on Kubernetes設計的Operator,開發人員可以通過使用CRD的方式,提交Spark任務到Kubernetes叢集中。使用Spark Operator有以下優勢:

  • 能夠彌補原生Spark對Kubernetes支援不足的部分。

  • 能夠快速和Kubernetes生態中的儲存、監控、日誌等組件對接。

  • 支援故障恢複、Auto Scaling、調度最佳化等高階Kubernetes特性。

準備工作

  1. 建立ACK Serverless叢集

    Container Service管理主控台上建立ACK Serverless叢集。具體操作,請參見建立ACK Serverless叢集

    重要

    如果您需要通過公網拉取鏡像,或者訓練任務需要訪問公網,請配置公網NAT Gateway。

    您可以通過kubectl管理和訪問ACK Serverless叢集,相關操作如下:

  2. 建立OSS儲存空間。

    您需要建立一個OSS儲存空間(Bucket)用來存放測試資料、測試結果和測試過程中的日誌等。關於如何建立OSS Bucket,請參見建立儲存空間

安裝Spark Operator

  1. 安裝Spark Operator。

    1. Container Service管理主控台的左側導覽列,選擇市場>應用市場

    2. 應用目錄頁簽,找到並單擊ack-spark-operator

    3. 單擊右上方的一鍵部署

    4. 在彈出面板中選擇目的地組群,按照頁面提示完成配置。

  2. 建立ServiceAccount、Role和Rolebinding。

    Spark作業需要一個ServiceAccount來擷取建立Pod的許可權,因此需要建立ServiceAccount、Role和Rolebinding。YAML樣本如下,請根據需要修改三者的Namespace。

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: spark
      namespace: default
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      namespace: default
      name: spark-role
    rules:
    - apiGroups: [""]
      resources: ["pods"]
      verbs: ["*"]
    - apiGroups: [""]
      resources: ["services"]
      verbs: ["*"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: spark-role-binding
      namespace: default
    subjects:
    - kind: ServiceAccount
      name: spark
      namespace: default
    roleRef:
      kind: Role
      name: spark-role
      apiGroup: rbac.authorization.k8s.io

構建Spark作業鏡像

您需要編譯Spark作業的JAR包,使用Dockerfile打包鏡像。

以阿里雲Container Service的Spark基礎鏡像為例,設定Dockerfile內容如下:

FROM registry.aliyuncs.com/acs/spark:ack-2.4.5-latest
RUN mkdir -p /opt/spark/jars
# 如果需要使用OSS(讀取OSS資料或者離線Event到OSS),可以添加以下JAR包到鏡像中
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
COPY SparkExampleScala-assembly-0.1.jar /opt/spark/jars
重要

Spark鏡像如果較大,則拉取需要較長時間,您可以通過ImageCache加速鏡像拉取。更多資訊,請參見管理ImageCache使用ImageCache加速建立Pod

您也使用阿里雲Spark基礎鏡像。阿里雲提供了Spark2.4.5的基礎鏡像,針對Kubernetes情境(調度、彈性)進行了最佳化,能夠極大提升調度速度和啟動速度。您可以通過設定Helm Chart的變數enableAlibabaCloudFeatureGates: true的方式開啟,如果想要達到更快的啟動速度,可以設定enableWebhook: falsespark-3

編寫工作範本並提交作業

建立一個Spark作業的YMAL設定檔,並進行部署。

  1. 建立spark-pi.yaml檔案。

    一個典型的工作範本樣本如下。更多資訊,請參見spark-on-k8s-operator

    apiVersion: "sparkoperator.k8s.io/v1beta2"
    kind: SparkApplication
    metadata:
      name: spark-pi
      namespace: default
    spec:
      type: Scala
      mode: cluster
      image: "registry.aliyuncs.com/acs/spark:ack-2.4.5-latest"
      imagePullPolicy: Always
      mainClass: org.apache.spark.examples.SparkPi
      mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar"
      sparkVersion: "2.4.5"
      restartPolicy:
        type: Never
      driver:
        cores: 2
        coreLimit: "2"
        memory: "3g"
        memoryOverhead: "1g"
        labels:
          version: 2.4.5
        serviceAccount: spark
        annotations:
          k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
          k8s.aliyun.com/eci-auto-imc: "true"
        tolerations:
        - key: "virtual-kubelet.io/provider"
          operator: "Exists"
      executor:
        cores: 2
        instances: 1
        memory: "3g"
        memoryOverhead: "1g"
        labels:
          version: 2.4.5
        annotations:
          k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
          k8s.aliyun.com/eci-auto-imc: "true"
        tolerations:
        - key: "virtual-kubelet.io/provider"
          operator: "Exists"
  2. 部署一個Spark計算任務。

    kubectl apply -f spark-pi.yaml

配置日誌採集

以採集Spark的標準輸出日誌為例,您可以在Spark driver和Spark executor的envVars欄位中注入環境變數,實現日誌的自動採集。更多資訊,請參見自訂配置ECI日誌採集

envVars:
   aliyun_logs_test-stdout_project: test-k8s-spark
   aliyun_logs_test-stdout_machinegroup: k8s-group-app-spark
   aliyun_logs_test-stdout: stdout

提交作業時,您可以按上述方式設定driver和executor的環境變數,即可實現日誌的自動採集。spark-1

配置歷史伺服器

歷史伺服器用於審計Spark作業,您可以通過在Spark Applicaiton的CRD中增加SparkConf欄位的方式,將event寫入到OSS,再通過歷史伺服器讀取OSS的方式進行展現。配置樣本如下:

sparkConf:
   "spark.eventLog.enabled": "true"
   "spark.eventLog.dir": "oss://bigdatastore/spark-events"
   "spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
   # oss bucket endpoint such as oss-cn-beijing.aliyuncs.com
   "spark.hadoop.fs.oss.endpoint": "oss-cn-beijing.aliyuncs.com"
   "spark.hadoop.fs.oss.accessKeySecret": ""
   "spark.hadoop.fs.oss.accessKeyId": ""

阿里雲也提供了spark-history-server的Chart,您可以在Container Service管理主控台的市場>應用市場頁面,搜尋ack-spark-history-server進行安裝。安裝時需在參數中配置OSS的相關資訊,樣本如下:

oss:
  enableOSS: true
  # Please input your accessKeyId
  alibabaCloudAccessKeyId: ""
  # Please input your accessKeySecret
  alibabaCloudAccessKeySecret: ""
  # oss bucket endpoint such as oss-cn-beijing.aliyuncs.com
  alibabaCloudOSSEndpoint: "oss-cn-beijing.aliyuncs.com"
  # oss file path such as oss://bucket-name/path
  eventsDir: "oss://bigdatastore/spark-events"

安裝完成後,您可以在叢集詳情頁面的服務中看到ack-spark-history-server的對外地址,訪問對外地址即可查看歷史任務歸檔。spark-2

查看作業結果

  1. 查看Pod的執行情況。

    kubectl get pods

    預期返回結果:

    NAME                            READY      STATUS     RESTARTS   AGE
    spark-pi-1547981232122-driver   1/1       Running    0          12s
    spark-pi-1547981232122-exec-1   1/1       Running    0          3s
  2. 查看即時Spark UI。

    kubectl port-forward spark-pi-1547981232122-driver 4040:4040
  3. 查看Spark Applicaiton的狀態。

    kubectl describe sparkapplication spark-pi

    預期返回結果:

    Name:         spark-pi
    Namespace:    default
    Labels:       <none>
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"sparkoperator.k8s.io/v1alpha1","kind":"SparkApplication","metadata":{"annotations":{},"name":"spark-pi","namespace":"default"...}
    API Version:  sparkoperator.k8s.io/v1alpha1
    Kind:         SparkApplication
    Metadata:
      Creation Timestamp:  2019-01-20T10:47:08Z
      Generation:          1
      Resource Version:    4923532
      Self Link:           /apis/sparkoperator.k8s.io/v1alpha1/namespaces/default/sparkapplications/spark-pi
      UID:                 bbe7445c-1ca0-11e9-9ad4-062fd7c19a7b
    Spec:
      Deps:
      Driver:
        Core Limit:  200m
        Cores:       0.1
        Labels:
          Version:        2.4.0
        Memory:           512m
        Service Account:  spark
        Volume Mounts:
          Mount Path:  /tmp
          Name:        test-volume
      Executor:
        Cores:      1
        Instances:  1
        Labels:
          Version:  2.4.0
        Memory:     512m
        Volume Mounts:
          Mount Path:         /tmp
          Name:               test-volume
      Image:                  gcr.io/spark-operator/spark:v2.4.0
      Image Pull Policy:      Always
      Main Application File:  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
      Main Class:             org.apache.spark.examples.SparkPi
      Mode:                   cluster
      Restart Policy:
        Type:  Never
      Type:    Scala
      Volumes:
        Host Path:
          Path:  /tmp
          Type:  Directory
        Name:    test-volume
    Status:
      Application State:
        Error Message:
        State:          COMPLETED
      Driver Info:
        Pod Name:             spark-pi-driver
        Web UI Port:          31182
        Web UI Service Name:  spark-pi-ui-svc
      Execution Attempts:     1
      Executor State:
        Spark - Pi - 1547981232122 - Exec - 1:  COMPLETED
      Last Submission Attempt Time:             2019-01-20T10:47:14Z
      Spark Application Id:                     spark-application-1547981285779
      Submission Attempts:                      1
      Termination Time:                         2019-01-20T10:48:56Z
    Events:
      Type    Reason                     Age                 From            Message
      ----    ------                     ----                ----            -------
      Normal  SparkApplicationAdded      55m                 spark-operator  SparkApplication spark-pi was added, Enqueuing it for submission
      Normal  SparkApplicationSubmitted  55m                 spark-operator  SparkApplication spark-pi was submitted successfully
      Normal  SparkDriverPending         55m (x2 over 55m)   spark-operator  Driver spark-pi-driver is pending
      Normal  SparkExecutorPending       54m (x3 over 54m)   spark-operator  Executor spark-pi-1547981232122-exec-1 is pending
      Normal  SparkExecutorRunning       53m (x4 over 54m)   spark-operator  Executor spark-pi-1547981232122-exec-1 is running
      Normal  SparkDriverRunning         53m (x12 over 55m)  spark-operator  Driver spark-pi-driver is running
      Normal  SparkExecutorCompleted     53m (x2 over 53m)   spark-operator  Executor spark-pi-1547981232122-exec-1 completed
  4. 查看日誌擷取結果。

    NAME                                      READY     STATUS      RESTARTS   AGE
    spark-pi-1547981232122-driver   0/1       Completed   0          1m

    當Spark Applicaiton的狀態為Succeed或者Spark driver對應的Pod狀態為Completed時,可以查看日誌擷取結果。

    kubectl logs spark-pi-1547981232122-driver
    Pi is roughly 3.152155760778804