You can create TensorFlow jobs on a Distributed Cloud Container Platform for Kubernetes (ACK One) Fleet instance in the same way you create TensorFlow jobs in a cluster. The Fleet instance will dynamically schedule TensorFlow jobs to associated clusters based on the resource requests of the TensorFlow jobs and the remaining resources in the associated clusters. This topic describes how to create a TensorFlow job and query the status of the job.
Prerequisites
By default, theTensorFlow CustomResourceDefinition (CRD) of the training operator is created on the Fleet instance. The API version of the TensorFlow CRD is kubeflow.org/v1.
The administrator of the Fleet instance can run the following command to query the CRD:
kubectl get crd tfjobs.kubeflow.org
The administrator can run the following command to modify the kubeflow.org_tfjobs.yaml file of the CRD:
kubectl apply -f manifests/base/crds/kubeflow.org_tfjobs.yaml
The training operator is downloaded by the administrator and installed on all the associated clusters.
The kubeconfig file of the Fleet instance is obtained in the Distributed Cloud Container Platform for Kubernetes (ACK One) console and a kubectl client is connected to the Fleet instance.
The AMC command-line tool is installed. For more information, see Use AMC.
Procedure
Developers can use the following YAML template to create TensorFlow jobs on the Fleet instance.
In this example, the job is named
pi
and is created in thedemo
namespace.apiVersion: "kubeflow.org/v1" kind: "TFJob" metadata: name: "dist-mnist-for-e2e-test" namespace: demo spec: tfReplicaSpecs: PS: replicas: 2 restartPolicy: Never template: spec: containers: - name: tensorflow image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/tf-dist-mnist-test:v1.0 resources: requests: memory: "2Gi" cpu: "2" limits: memory: "2Gi" cpu: "2" Worker: replicas: 2 restartPolicy: Never template: spec: containers: - name: tensorflow image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/tf-dist-mnist-test:v1.0 resources: requests: memory: "2Gi" cpu: "2" limits: memory: "2Gi" cpu: "2"
Run the following command on the Fleet instance to query the scheduling result of the job.
If no output is returned, the job failed to be scheduled. In this case, check whether the specified namespace exists and whether you have a sufficient namespace quota. If the specified namespace does not exist or the namespace quota of your account is exhausted, the job remains in the pending state.
kubectl get tfjob dist-mnist-for-e2e-test -n demo -o jsonpath='{.metadata.annotations.scheduling\.x-k8s\.io/placement}'
Check the status of the TensorFlow job.
Run the following command on the Fleet instance to query the status of the job:
kubectl get tfjob dist-mnist-for-e2e-test -n demo
Expected output:
NAME STATE AGE dist-mnist-for-e2e-test Running ***
Run the following command on the Fleet instance to query the status of the pods that run the job:
kubectl amc get pod -j tfjob/dist-mnist-for-e2e-test -n demo
Expected output:
Run on ManagedCluster managedcluster-c1***e5 NAME READY STATUS RESTARTS AGE dist-mnist-for-e2e-test-ps-0 1/1 Running 0 *** dist-mnist-for-e2e-test-ps-1 1/1 Running 0 *** dist-mnist-for-e2e-test-worker-0 1/1 Running 0 *** dist-mnist-for-e2e-test-worker-1 1/1 Running 0 ***
Run the following command to print the logs of the pods that are created for the Job.
kubectl amc logs dist-mnist-for-e2e-test-worker-0 -j tfjob/dist-mnist-for-e2e-test -n demo
Expected output:
Run on ManagedCluster managedcluster-c1***e5 ... Training ends @ *** Training elapsed time: *** s ...