Arena is a lightweight client that is used to manage Kubernetes-based machine learning tasks. Arena allows you to streamline data preparation, model development, model training, and model prediction throughout a complete lifecycle of machine learning. This improves the work efficiency of data scientists. Arena is also deeply integrated with the basic services of Alibaba Cloud. It supports GPU sharing and Cloud Paralleled File System (CPFS). Arena can run in deep learning frameworks optimized by Alibaba Cloud. This maximizes the performance and utilization of heterogeneous computing resources provided by Alibaba Cloud. This topic describes how to configure the Arena client.
Prerequisites
A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
Nodes in the cluster can access the Internet.
The Arena component is installed. For more information, see Deploy the cloud-native AI suite.
Step 1: Configure the Arena client
Connect to the cluster.
ACK dedicated clusters
Use SSH to log on to a master node of the ACK dedicated cluster and run the
arena
command on the node. For more information, see Use SSH to connect to the master nodes of an ACK dedicated cluster.ACK managed clusters
ACK managed clusters do not contain master nodes. Therefore, you must install the Arena client on your on-premises machine, such as a computer that runs macOS. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
NoteYou can run the kubectl get nodes command to check whether the configurations in the
kubeconfig
file are correct.Configure the Arena client.
Download the Arena client.
Decompress the package.
Linux
tar -xvf arena-installer-0.9.16-881780f-linux-amd64.tar.gz
Mac
tar -xvf arena-installer-0.9.16-881780f-darwin-amd64.tar.gz
Install Arena.
cd arena-installer bash install.sh --only-binary
Optional: Install bash-completion. The auto completion feature of bash-completion can automatically fill in partially typed commands. After you install bash-completion, you can press Tab in the CLI to automatically complete a partially typed command.
Install bash-completion.
bash-completion for CentOS or Linux
sudo yum install bash-completion -y
bash-completion for Debian or Ubuntu
sudo apt-get install bash-completion
macOS
brew install bash-completion@2
Enable the auto completion feature in the profile file.
Linux
echo "source <(arena completion bash)" >> ~/.bashrc chmod u+x ~/.bashrc
macOS
echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc
Step 2: Test whether Arena works as expected
You can perform the following steps to check whether Arena works as expected:
Run the following command to query the available GPU resources in the cluster:
arena top node
The output shows information about the nodes and GPUs. This indicates that Arena works as expected.
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-huhehaote.192.168.X.XXX 192.168.0.117 <none> ready 8 0 cn-huhehaote.192.168.X.XXX 192.168.0.118 <none> ready 8 0 cn-huhehaote.192.168.X.XXX 192.168.0.119 <none> ready 8 0 cn-huhehaote.192.169.X.XXX 192.168.0.120 <none> ready 8 0 ----------------------------------------------------------------------------------------- Allocated/Total GPUs In Cluster: 0/32 (0%)
Use Arena to submit a training job. The output shows that the job is submitted.
arena submit tf \ --name=firstjob \ --gpus=1 \ --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tf-mnist-standalone:gpu \ "python /app/main.py"
Expected output:
configmap/firstjob-tfjob created configmap/firstjob-tfjob labeled tfjob.kubeflow.org/firstjob created INFO[0001] The Job firstjob has been submitted successfully INFO[0001] You can run `arena get firstjob --type tfjob` to check the job status
Run the following command to query all jobs:
arena list
Expected output:
NAME STATUS TRAINER AGE NODE firstjob RUNNING TFJOB 5s 192.168.X.XXX
Run the following command to query the status of the submitted job:
arena get firstjob
Expected output:
STATUS: SUCCEEDED NAMESPACE: default PRIORITY: N/A TRAINING DURATION: 52s NAME STATUS TRAINER AGE INSTANCE NODE firstjob SUCCEEDED TFJOB 14m firstjob-chief-0 192.168.X.XXX
Run the following command to query the log of the job:
arena logs --tail=10 firstjob
Expected output:
Accuracy at step 910: 0.9694 Accuracy at step 920: 0.9687 Accuracy at step 930: 0.9676 Accuracy at step 940: 0.9678 Accuracy at step 950: 0.9704 Accuracy at step 960: 0.9692 Accuracy at step 970: 0.9721 Accuracy at step 980: 0.9696 Accuracy at step 990: 0.9675 Adding run metadata for 999