This topic describes how to submit a TensorFlow training job and a cron job in AI Developer Console.
Prerequisites
The AI development console and scheduling component of the cloud-native AI component set are installed in the professional Kubernetes cluster. The cluster must run Kubernetes 1.20 or later.
A Resource Access Management (RAM) user is created in the RAM console by the cluster administrator. A quota group is added and associated with the RAM user. For more information, see Step 1: Create a quota group for the RAM user.
A dataset or source code repository is configured for the training job. For more information, see Configure datasets and source code repositories for a training job.
Submit a TensorFlow training job
Log on to the AI development console. For more information, see Step 2: Log on to AI Developer Console.
In the left-side navigation pane of AI Developer Console, click Submit Job.
In the Basic Information section:
Configure parameters such as Job Name, Job Type (default type: TF-Stand-alone), Namespace, and Execution Command.
ImportantNamespace: You can select only the namespace that is allocated to you by the cluster administrator. You can set other parameters based on your requirements.
Optional: Turn on Tensorboard to visualize the training results.
Optional: Turn on Cron to configure a cron job.
Cron Schedule: Enter a standard cron expression. For more information about how to use cron expressions, see How I use cron in Linux.
If the current training job is still in progress, you can select a concurrency policy from the Concurrency Policy drop-down list. Valid values:
Allow: allows you to create a new training job.
Forbid: forbids you from creating a new training job before the current training job is finished.
Replace: replaces the current training job with a new training job.
History Record Limit: TensorFlow training jobs that are created by the cron job are retained in the cluster. If the number of retained jobs exceeds the limit, the system deletes the TensorFlow training jobs that were created at the earliest point in time.
In the Resources section, configure the following parameters for the training job: Instances Count, Image, CPU (Cores) (default value: 4), Memory (GB) (default value: 8 GB), and GPU (Card Numbers) (default value: 0).
In the Advance Configuration section, configure the Label, Annotation, and NodeSelector parameters for Kubernetes objects.
Click Submit Job.
In the left-side navigation pane of AI Developer Console, click Job List to view the information about a job, such as the name and status of the job.