How to create TensorFlow jobs on a Fleet instance - Container Service for Kubernetes

You can create TensorFlow jobs on a Distributed Cloud Container Platform for Kubernetes (ACK One) Fleet instance in the same way you create TensorFlow jobs in a cluster. The Fleet instance will dynamically schedule TensorFlow jobs to associated clusters based on the resource requests of the TensorFlow jobs and the remaining resources in the associated clusters. This topic describes how to create a TensorFlow job and query the status of the job.

Prerequisites

By default, theTensorFlow CustomResourceDefinition (CRD) of the training operator is created on the Fleet instance. The API version of the TensorFlow CRD is kubeflow.org/v1.
The administrator of the Fleet instance can run the following command to query the CRD:
```
kubectl get crd tfjobs.kubeflow.org
```
The administrator can run the following command to modify the kubeflow.org_tfjobs.yaml file of the CRD:
```
kubectl apply -f manifests/base/crds/kubeflow.org_tfjobs.yaml
```
The training operator is downloaded by the administrator and installed on all the associated clusters.
The kubeconfig file of the Fleet instance is obtained in the Distributed Cloud Container Platform for Kubernetes (ACK One) console and a kubectl client is connected to the Fleet instance.
The AMC command-line tool is installed. For more information, see Use AMC.

Procedure

Developers can use the following YAML template to create TensorFlow jobs on the Fleet instance.

In this example, the job is named pi and is created in the demo namespace.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
  namespace: demo
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/tf-dist-mnist-test:v1.0
              resources:
                requests:
                  memory: "2Gi"
                  cpu: "2"
                limits:
                  memory: "2Gi"
                  cpu: "2"
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/tf-dist-mnist-test:v1.0
              resources:
                requests:
                  memory: "2Gi"
                  cpu: "2"
                limits:
                  memory: "2Gi"
                  cpu: "2"

Run the following command on the Fleet instance to query the scheduling result of the job.
If no output is returned, the job failed to be scheduled. In this case, check whether the specified namespace exists and whether you have a sufficient namespace quota. If the specified namespace does not exist or the namespace quota of your account is exhausted, the job remains in the pending state.
```
kubectl get tfjob dist-mnist-for-e2e-test -n demo -o jsonpath='{.metadata.annotations.scheduling\.x-k8s\.io/placement}'
```

Check the status of the TensorFlow job.

Run the following command on the Fleet instance to query the status of the job:

kubectl get tfjob dist-mnist-for-e2e-test -n demo

Expected output:

NAME                      STATE     AGE
dist-mnist-for-e2e-test   Running   ***

Run the following command on the Fleet instance to query the status of the pods that run the job:

kubectl amc get pod -j tfjob/dist-mnist-for-e2e-test -n demo

Expected output:

Run on ManagedCluster managedcluster-c1***e5
NAME                               READY   STATUS      RESTARTS   AGE
dist-mnist-for-e2e-test-ps-0       1/1     Running     0          ***
dist-mnist-for-e2e-test-ps-1       1/1     Running     0          ***
dist-mnist-for-e2e-test-worker-0   1/1     Running     0          ***
dist-mnist-for-e2e-test-worker-1   1/1     Running     0          ***

Run the following command to print the logs of the pods that are created for the Job.

kubectl amc logs dist-mnist-for-e2e-test-worker-0 -j tfjob/dist-mnist-for-e2e-test -n demo

Expected output:

Run on ManagedCluster managedcluster-c1***e5
...
Training ends @ ***
Training elapsed time: *** s
...