All Products
Search
Document Center

Container Service for Kubernetes:Create TensorFlow jobs

Last Updated:Jul 31, 2024

You can create TensorFlow jobs on a Distributed Cloud Container Platform for Kubernetes (ACK One) Fleet instance in the same way you create TensorFlow jobs in a cluster. The Fleet instance will dynamically schedule TensorFlow jobs to associated clusters based on the resource requests of the TensorFlow jobs and the remaining resources in the associated clusters. This topic describes how to create a TensorFlow job and query the status of the job.

Prerequisites

  • By default, theTensorFlow CustomResourceDefinition (CRD) of the training operator is created on the Fleet instance. The API version of the TensorFlow CRD is kubeflow.org/v1.

  • The administrator of the Fleet instance can run the following command to query the CRD:

    kubectl get crd tfjobs.kubeflow.org
  • The administrator can run the following command to modify the kubeflow.org_tfjobs.yaml file of the CRD:

    kubectl apply -f manifests/base/crds/kubeflow.org_tfjobs.yaml
  • The training operator is downloaded by the administrator and installed on all the associated clusters.

  • The kubeconfig file of the Fleet instance is obtained in the Distributed Cloud Container Platform for Kubernetes (ACK One) console and a kubectl client is connected to the Fleet instance.

  • The AMC command-line tool is installed. For more information, see Use AMC.

Procedure

  1. Developers can use the following YAML template to create TensorFlow jobs on the Fleet instance.

    In this example, the job is named pi and is created in the demo namespace.

    apiVersion: "kubeflow.org/v1"
    kind: "TFJob"
    metadata:
      name: "dist-mnist-for-e2e-test"
      namespace: demo
    spec:
      tfReplicaSpecs:
        PS:
          replicas: 2
          restartPolicy: Never
          template:
            spec:
              containers:
                - name: tensorflow
                  image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/tf-dist-mnist-test:v1.0
                  resources:
                    requests:
                      memory: "2Gi"
                      cpu: "2"
                    limits:
                      memory: "2Gi"
                      cpu: "2"
        Worker:
          replicas: 2
          restartPolicy: Never
          template:
            spec:
              containers:
                - name: tensorflow
                  image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/tf-dist-mnist-test:v1.0
                  resources:
                    requests:
                      memory: "2Gi"
                      cpu: "2"
                    limits:
                      memory: "2Gi"
                      cpu: "2"
  2. Run the following command on the Fleet instance to query the scheduling result of the job.

    If no output is returned, the job failed to be scheduled. In this case, check whether the specified namespace exists and whether you have a sufficient namespace quota. If the specified namespace does not exist or the namespace quota of your account is exhausted, the job remains in the pending state.

    kubectl get tfjob dist-mnist-for-e2e-test -n demo -o jsonpath='{.metadata.annotations.scheduling\.x-k8s\.io/placement}'
  3. Check the status of the TensorFlow job.

    • Run the following command on the Fleet instance to query the status of the job:

      kubectl get tfjob dist-mnist-for-e2e-test -n demo

      Expected output:

      NAME                      STATE     AGE
      dist-mnist-for-e2e-test   Running   ***
    • Run the following command on the Fleet instance to query the status of the pods that run the job:

      kubectl amc get pod -j tfjob/dist-mnist-for-e2e-test -n demo

      Expected output:

      Run on ManagedCluster managedcluster-c1***e5
      NAME                               READY   STATUS      RESTARTS   AGE
      dist-mnist-for-e2e-test-ps-0       1/1     Running     0          ***
      dist-mnist-for-e2e-test-ps-1       1/1     Running     0          ***
      dist-mnist-for-e2e-test-worker-0   1/1     Running     0          ***
      dist-mnist-for-e2e-test-worker-1   1/1     Running     0          ***
    • Run the following command to print the logs of the pods that are created for the Job.

      kubectl amc logs dist-mnist-for-e2e-test-worker-0 -j tfjob/dist-mnist-for-e2e-test -n demo

      Expected output:

      Run on ManagedCluster managedcluster-c1***e5
      ...
      Training ends @ ***
      Training elapsed time: *** s
      ...