NVIDIA GPU Cloud (NGC) is a deep learning ecosystem that is developed by NVIDIA. NGC allows you to access deep learning software stacks free of charge and use the stacks to build development environments for deep learning. This topic uses the TensorFlow deep learning framework as an example to describe how to deploy an NGC environment on a GPU-accelerated instance for deep learning development.
Background information
The NGC website provides images of different versions of mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. You can select an image to deploy an NGC container environment based on your business requirements. In this example, the TensorFlow deep learning framework is used.
Alibaba Cloud provides NGC container images that are optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. When you create GPU-accelerated instances, you can select the NGC container images to quickly deploy NGC container environments and instantly access optimized deep learning frameworks. This way, you can develop and deploy services, and pre-install development environments in an efficient manner. The NGC container images also support optimized algorithm frameworks and constant updates.
Limits
You can deploy an NGC environment on an instance that belongs to one of the following instance families:
gn5i, gn6v, gn6i, gn6e, gn7i, gn7e, and gn7s
ebmgn6i, ebmgn6v, ebmgn6e, ebmgn7i, and ebmgn7e
For more information, see GPU-accelerated compute-optimized instance families.
Prerequisites
Before you deploy an NGC environment on a GPU-accelerated instance, make sure that an NGC account is created on the NGC website.
Before you deploy an NGC environment, obtain the URL of the TensorFlow container image.
Enter TensorFlow in the search box. Find the TensorFlow card and click TensorFlow.
On the TensorFlow page, click the Tags tab and copy the URL of the required TensorFlow container image.
In this example, the URL of the
22.05-tf1-py3
image is nvcr.io/nvidia/tensorflow:22.05-tf1-py3. This URL is used to download the TensorFlow image on a GPU-accelerated instance.ImportantThe CUDA version in the TensorFlow image must match the driver version of the GPU-accelerated instance. Otherwise, the TensorFlow development environment fails to be deployed. For more information about the relationships between TensorFlow image versions, CUDA versions, and driver versions of GPU-accelerated instances, see TensorFlow Release Notes.
Procedure
This topic uses a gn7i instance as an example to show how to deploy an NGC environment when you create a GPU-accelerated instance.
Create a GPU-accelerated instance.
For more information, see Create an instance on the Custom Launch tab. The following section describes how to configure key parameters:
Parameter
Description
Region
Select a region where GPU-accelerated instances are available.
You can go to the Instance Types Available for Each Region page to view the available GPU-accelerated instances in each region.
Instance
Select an instance type. In this example, gn7i is used.
Image
On the Marketplace Images tab, click Select Image from Alibaba Cloud Marketplace (with Operating System).
In the Alibaba Cloud Marketplace dialog box, enter NVIDIA GPU Cloud Virtual Machine Image in the search box and click Search.
Find the image that you want to use and click Select.
Public IP Address
Select Assign Public IPv4 Address.
NoteIf no public IP address is assigned, you need to associate an elastic IP address (EIP) with the instance after the instance is created. For more information, see Associate one or more EIPs with an instance.
Security Group
Select a security group. You must enable TCP port 22 for the security group. If your instance is required to support HTTPS or Deep Learning GPU Training System (DIGITS) 6, you must enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.
Use one of the methods that are described in the following table to connect to the instance.
Method
References
Workbench
VNC
Run the
nvidia-smi
command to view information about the current GPU.The driver version is 515.48.07. The driver version of the instance (515 or later) matches the CUDA version (11.7) in the
22.05-tf1-py3
TensorFlow image.Run the following command to download the TensorFlow container image:
docker pull nvcr.io/nvidia/tensorflow:22.05-tf1-py3
ImportantIt may take a long time to download the TensorFlow container image.
Run the following command to view information about the downloaded TensorFlow container image:
docker image ls
Run the following command to deploy the TensorFlow development environment by running the container:
docker run --gpus all --rm -it nvcr.io/nvidia/tensorflow:22.05-tf1-py3
Run the following commands in sequence to run a simple test for TensorFlow:
python
import tensorflow as tf hello = tf.constant('Hello, TensorFlow!') with tf.compat.v1.Session() as sess: result = sess.run(hello) print(result.decode())
If TensorFlow loads the GPU device as expected, the
Hello, TensorFlow!
result appears. The following figure shows an example.Save the modified TensorFlow image.
On the GPU connection page, open a new window for remote connection.
Run the following command to query the container ID that is specified by
CONTAINER_ID
:docker ps
Run the following command to save the modified TensorFlow image:
# Replace CONTAINER_ID with the container ID that is queried by using the docker ps command, such as f76a5a4347d. docker commit -m "commit docker" CONTAINER_ID nvcr.io/nvidia/tensorflow:20.01-tf1-py3
ImportantMake sure that the modified TensorFlow image is properly preserved. Otherwise, the modification may be lost the next time you log on to the instance.