All Products
Search
Document Center

Container Service for Kubernetes:Configure the Arena client

Last Updated:Aug 30, 2024

Arena is a lightweight client that is used to manage Kubernetes-based machine learning tasks. Arena allows you to streamline data preparation, model development, model training, and model prediction throughout a complete lifecycle of machine learning. This improves the work efficiency of data scientists. Arena is also deeply integrated with the basic services of Alibaba Cloud. It supports GPU sharing and Cloud Paralleled File System (CPFS). Arena can run in deep learning frameworks optimized by Alibaba Cloud. This maximizes the performance and utilization of heterogeneous computing resources provided by Alibaba Cloud. This topic describes how to configure the Arena client.

Prerequisites

Step 1: Configure the Arena client

  1. Connect to the cluster.

    ACK dedicated clusters

    Use SSH to log on to a master node of the ACK dedicated cluster and run the arena command on the node. For more information, see Use SSH to connect to the master nodes of an ACK dedicated cluster.

    ACK managed clusters

    ACK managed clusters do not contain master nodes. Therefore, you must install the Arena client on your on-premises machine, such as a computer that runs macOS. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

    Note

    You can run the kubectl get nodes command to check whether the configurations in the kubeconfig file are correct.

  2. Configure the Arena client.

    1. Download the Arena client.

    2. Decompress the package.

      Linux

      tar -xvf arena-installer-0.9.16-881780f-linux-amd64.tar.gz

      Mac

      tar -xvf arena-installer-0.9.16-881780f-darwin-amd64.tar.gz
    3. Install Arena.

      cd arena-installer
      bash install.sh --only-binary
  3. Optional: Install bash-completion. The auto completion feature of bash-completion can automatically fill in partially typed commands. After you install bash-completion, you can press Tab in the CLI to automatically complete a partially typed command.

    1. Install bash-completion.

      bash-completion for CentOS or Linux

      sudo yum install bash-completion -y

      bash-completion for Debian or Ubuntu

      sudo apt-get install bash-completion

      macOS

      brew install bash-completion@2
    2. Enable the auto completion feature in the profile file.

      Linux

      echo "source <(arena completion bash)" >> ~/.bashrc
      chmod u+x ~/.bashrc

      macOS

      echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc

Step 2: Test whether Arena works as expected

You can perform the following steps to check whether Arena works as expected:

  1. Run the following command to query the available GPU resources in the cluster:

    arena top node

    The output shows information about the nodes and GPUs. This indicates that Arena works as expected.

    NAME                        IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-huhehaote.192.168.X.XXX  192.168.0.117  <none>  ready   8           0
    cn-huhehaote.192.168.X.XXX  192.168.0.118  <none>  ready   8           0
    cn-huhehaote.192.168.X.XXX  192.168.0.119  <none>  ready   8           0
    cn-huhehaote.192.169.X.XXX  192.168.0.120  <none>  ready   8           0
    -----------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    0/32 (0%)
  2. Use Arena to submit a training job. The output shows that the job is submitted.

    arena submit tf \
          --name=firstjob \
          --gpus=1 \
          --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tf-mnist-standalone:gpu \
          "python /app/main.py"

    Expected output:

    configmap/firstjob-tfjob created
    configmap/firstjob-tfjob labeled
    tfjob.kubeflow.org/firstjob created
    INFO[0001] The Job firstjob has been submitted successfully
    INFO[0001] You can run `arena get firstjob --type tfjob` to check the job status
  3. Run the following command to query all jobs:

    arena list

    Expected output:

    NAME      STATUS   TRAINER  AGE  NODE
    firstjob  RUNNING  TFJOB    5s   192.168.X.XXX
  4. Run the following command to query the status of the submitted job:

    arena get firstjob

    Expected output:

    STATUS: SUCCEEDED
    NAMESPACE: default
    PRIORITY: N/A
    TRAINING DURATION: 52s
    NAME      STATUS     TRAINER  AGE  INSTANCE          NODE
    firstjob  SUCCEEDED  TFJOB    14m  firstjob-chief-0  192.168.X.XXX
  5. Run the following command to query the log of the job:

    arena logs --tail=10 firstjob

    Expected output:

    Accuracy at step 910: 0.9694
    Accuracy at step 920: 0.9687
    Accuracy at step 930: 0.9676
    Accuracy at step 940: 0.9678
    Accuracy at step 950: 0.9704
    Accuracy at step 960: 0.9692
    Accuracy at step 970: 0.9721
    Accuracy at step 980: 0.9696
    Accuracy at step 990: 0.9675
    Adding run metadata for 999