Deploy an NGC container environment based on TensorFlow for deep learning development - Elastic GPU Service

NVIDIA GPU Cloud (NGC) is a deep learning ecosystem that is developed by NVIDIA. NGC allows you to access deep learning software stacks free of charge and use the stacks to build development environments for deep learning. This topic uses the TensorFlow deep learning framework as an example to describe how to deploy an NGC environment on a GPU-accelerated instance for deep learning development.

Background information

The NGC website provides images of different versions of mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. You can select an image to deploy an NGC container environment based on your business requirements. In this example, the TensorFlow deep learning framework is used.
Alibaba Cloud provides NGC container images that are optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. When you create GPU-accelerated instances, you can select the NGC container images to quickly deploy NGC container environments and instantly access optimized deep learning frameworks. This way, you can develop and deploy services, and pre-install development environments in an efficient manner. The NGC container images also support optimized algorithm frameworks and constant updates.

Limits

You can deploy an NGC environment on an instance that belongs to one of the following instance families:

gn5i, gn6v, gn6i, gn6e, gn7i, gn7e, and gn7s
ebmgn6i, ebmgn6v, ebmgn6e, ebmgn7i, and ebmgn7e

For more information, see GPU-accelerated compute-optimized instance families.

Prerequisites

Note

Before you deploy an NGC environment on a GPU-accelerated instance, make sure that an NGC account is created on the NGC website.

Before you deploy an NGC environment, obtain the URL of the TensorFlow container image.

Log on to the NGC website.
Enter TensorFlow in the search box. Find the TensorFlow card and click TensorFlow.
On the TensorFlow page, click the Tags tab and copy the URL of the required TensorFlow container image.
In this example, the URL of the 22.05-tf1-py3 image is nvcr.io/nvidia/tensorflow:22.05-tf1-py3. This URL is used to download the TensorFlow image on a GPU-accelerated instance.
Important
The CUDA version in the TensorFlow image must match the driver version of the GPU-accelerated instance. Otherwise, the TensorFlow development environment fails to be deployed. For more information about the relationships between TensorFlow image versions, CUDA versions, and driver versions of GPU-accelerated instances, see TensorFlow Release Notes.

Procedure

This topic uses a gn7i instance as an example to show how to deploy an NGC environment when you create a GPU-accelerated instance.

Create a GPU-accelerated instance.

For more information, see Create an instance on the Custom Launch tab. The following section describes how to configure key parameters:

Parameter	Description
Region	Select a region where GPU-accelerated instances are available. You can go to the Instance Types Available for Each Region page to view the available GPU-accelerated instances in each region.
Instance	Select an instance type. In this example, gn7i is used.
Image	On the Marketplace Images tab, click Select Image from Alibaba Cloud Marketplace (with Operating System). In the Alibaba Cloud Marketplace dialog box, enter NVIDIA GPU Cloud Virtual Machine Image in the search box and click Search. Find the image that you want to use and click Select.
Public IP Address	Select Assign Public IPv4 Address. Note If no public IP address is assigned, you need to associate an elastic IP address (EIP) with the instance after the instance is created. For more information, see Associate one or more EIPs with an instance.
Security Group	Select a security group. You must enable TCP port 22 for the security group. If your instance is required to support HTTPS or Deep Learning GPU Training System (DIGITS) 6, you must enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.

Use one of the methods that are described in the following table to connect to the instance.
Method
References
Workbench
Connect to a Linux instance by using a password or key
VNC
Connect to an instance by using VNC
Run the nvidia-smi command to view information about the current GPU.
The driver version is 515.48.07. The driver version of the instance (515 or later) matches the CUDA version (11.7) in the 22.05-tf1-py3 TensorFlow image.
Run the following command to download the TensorFlow container image:
```
docker pull nvcr.io/nvidia/tensorflow:22.05-tf1-py3
```
Important
It may take a long time to download the TensorFlow container image.
Run the following command to view information about the downloaded TensorFlow container image:
```
docker image ls
```
Run the following command to deploy the TensorFlow development environment by running the container:
```
docker run --gpus all --rm -it nvcr.io/nvidia/tensorflow:22.05-tf1-py3
```
Run the following commands in sequence to run a simple test for TensorFlow:
```
python
```
```
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
with tf.compat.v1.Session() as sess:
    result = sess.run(hello)
    print(result.decode())
```
If TensorFlow loads the GPU device as expected, the Hello, TensorFlow! result appears. The following figure shows an example.
Save the modified TensorFlow image.
1. On the GPU connection page, open a new window for remote connection.
2. Run the following command to query the container ID that is specified by CONTAINER_ID:
```
docker ps
```
3. Run the following command to save the modified TensorFlow image:
```
# Replace CONTAINER_ID with the container ID that is queried by using the docker ps command, such as f76a5a4347d. 
docker commit   -m "commit docker" CONTAINER_ID  nvcr.io/nvidia/tensorflow:20.01-tf1-py3
```
  Important
  Make sure that the modified TensorFlow image is properly preserved. Otherwise, the modification may be lost the next time you log on to the instance.

Method	References
Workbench	Connect to a Linux instance by using a password or key
VNC	Connect to an instance by using VNC