All Products
Search
Document Center

Elastic GPU Service:Deploy an NGC container environment for deep learning development

Last Updated:Sep 27, 2024

NVIDIA GPU Cloud (NGC) is a deep learning ecosystem that is developed by NVIDIA. NGC allows you to access deep learning software stacks free of charge and use the stacks to build development environments for deep learning. This topic uses the TensorFlow deep learning framework as an example to describe how to deploy an NGC environment on a GPU-accelerated instance for deep learning development.

Background information

  • The NGC website provides images of different versions of mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. You can select an image to deploy an NGC container environment based on your business requirements. In this example, the TensorFlow deep learning framework is used.

  • Alibaba Cloud provides NGC container images that are optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. When you create GPU-accelerated instances, you can select the NGC container images to quickly deploy NGC container environments and instantly access optimized deep learning frameworks. This way, you can develop and deploy services, and pre-install development environments in an efficient manner. The NGC container images also support optimized algorithm frameworks and constant updates.

Limits

You can deploy an NGC environment on an instance that belongs to one of the following instance families:

  • gn5i, gn6v, gn6i, gn6e, gn7i, gn7e, and gn7s

  • ebmgn6i, ebmgn6v, ebmgn6e, ebmgn7i, and ebmgn7e

For more information, see GPU-accelerated compute-optimized instance families.

Prerequisites

Note

Before you deploy an NGC environment on a GPU-accelerated instance, make sure that an NGC account is created on the NGC website.

Before you deploy an NGC environment, obtain the URL of the TensorFlow container image.

  1. Log on to the NGC website.

  2. Enter TensorFlow in the search box. Find the TensorFlow card and click TensorFlow.

    Tensorflow.jpg

  3. On the TensorFlow page, click the Tags tab and copy the URL of the required TensorFlow container image.

    In this example, the URL of the 22.05-tf1-py3 image is nvcr.io/nvidia/tensorflow:22.05-tf1-py3. This URL is used to download the TensorFlow image on a GPU-accelerated instance.

    TensorFlow路径.jpg

    Important

    The CUDA version in the TensorFlow image must match the driver version of the GPU-accelerated instance. Otherwise, the TensorFlow development environment fails to be deployed. For more information about the relationships between TensorFlow image versions, CUDA versions, and driver versions of GPU-accelerated instances, see TensorFlow Release Notes.

Procedure

This topic uses a gn7i instance as an example to show how to deploy an NGC environment when you create a GPU-accelerated instance.

  1. Create a GPU-accelerated instance.

    For more information, see Create an instance on the Custom Launch tab. The following section describes how to configure key parameters:

    Parameter

    Description

    Region

    Select a region where GPU-accelerated instances are available.

    You can go to the Instance Types Available for Each Region page to view the available GPU-accelerated instances in each region.

    Instance

    Select an instance type. In this example, gn7i is used.

    Image

    1. On the Marketplace Images tab, click Select Image from Alibaba Cloud Marketplace (with Operating System).

    2. In the Alibaba Cloud Marketplace dialog box, enter NVIDIA GPU Cloud Virtual Machine Image in the search box and click Search.

    3. Find the image that you want to use and click Select.

    Public IP Address

    Select Assign Public IPv4 Address.

    Note

    If no public IP address is assigned, you need to associate an elastic IP address (EIP) with the instance after the instance is created. For more information, see Associate one or more EIPs with an instance.

    Security Group

    Select a security group. You must enable TCP port 22 for the security group. If your instance is required to support HTTPS or Deep Learning GPU Training System (DIGITS) 6, you must enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.

  2. Use one of the methods that are described in the following table to connect to the instance.

    Method

    References

    Workbench

    Connect to a Linux instance by using a password or key

    VNC

    Connect to an instance by using VNC

  3. Run the nvidia-smi command to view information about the current GPU.

    The driver version is 515.48.07. The driver version of the instance (515 or later) matches the CUDA version (11.7) in the 22.05-tf1-py3 TensorFlow image.

    nvidia-smi.png

  4. Run the following command to download the TensorFlow container image:

    docker pull nvcr.io/nvidia/tensorflow:22.05-tf1-py3
    Important

    It may take a long time to download the TensorFlow container image.

  5. Run the following command to view information about the downloaded TensorFlow container image:

    docker image ls

    容器镜像信息.jpg

  6. Run the following command to deploy the TensorFlow development environment by running the container:

    docker run --gpus all --rm -it nvcr.io/nvidia/tensorflow:22.05-tf1-py3

    Dingtalk_20240819173529.jpg

  7. Run the following commands in sequence to run a simple test for TensorFlow:

    python
    import tensorflow as tf
    hello = tf.constant('Hello, TensorFlow!')
    with tf.compat.v1.Session() as sess:
        result = sess.run(hello)
        print(result.decode())
    

    If TensorFlow loads the GPU device as expected, the Hello, TensorFlow! result appears. The following figure shows an example.

    Dingtalk_20240821121930.jpg

  8. Save the modified TensorFlow image.

    1. On the GPU connection page, open a new window for remote connection.

    2. Run the following command to query the container ID that is specified by CONTAINER_ID:

      docker ps

      Dingtalk_20240821144414.jpg

    3. Run the following command to save the modified TensorFlow image:

      # Replace CONTAINER_ID with the container ID that is queried by using the docker ps command, such as f76a5a4347d. 
      docker commit   -m "commit docker" CONTAINER_ID  nvcr.io/nvidia/tensorflow:20.01-tf1-py3
      Important

      Make sure that the modified TensorFlow image is properly preserved. Otherwise, the modification may be lost the next time you log on to the instance.