This topic provides answers to some frequently asked questions about Data Science Workshop (DSW).
Questions
What is DSW?
DSW is a cloud-native machine learning and data science platform provided by Platform for AI (PAI). DSW comes with JupyterLab, WebIDE, and Terminal. You can also establish a remote connection between your on-premises device and DSW over SSH to use various computing resources and environments provided by DSW. DSW allows you to write and execute code online, submit the code as offline tasks, and then download the generated trained models.
How do I download a folder from Notebook?
Notebook in DSW is a development environment based on open source JupyterLab. You cannot directly download a folder by right-clicking the folder. DSW provides the Notebook, WebIDE, and Terminal development environments that are interconnected in the backend. You can use Linux commands to package a folder in Terminal, and right-click the packaged folder in Notebook to download the package.
What do I do if I fail to connect to a DSW instance by using the proxy client and the "client_loop: send disconnect: Broken pipe" message appears?
When you connect to a DSW instance by using SSH, if the session is idle for a long period of time, it is automatically disconnected.
To resolve this issue, we recommend that you directly connect to a DSW instance.
How do I mount and use my NAS file system on a DSW instance?
A DSW instance comes with a system disk to temporarily store data. After the instance is stopped or deleted, the data is cleared. To permanently store data, you must mount your File Storage NAS (NAS) file system on the instance. All your NAS files are stored in the /nas directory. You can view and use the files in this directory by using DSW Terminal.
The latest version of DSW allows you to mount your NAS file system on a DSW instance only when you create the DSW instance. For more information, see Create and manage DSW instances. After a DSW instance is created, you cannot modify the instance information or change the mounted NAS file system.
If a NAS file system is mounted on a DSW instance, the NAS file system is used to permanently store data. The temporary storage of the DSW instance is no longer used.
What do I do if the "insufficient capacity of ephemeral storage" message appears when I create an image?
Cause: The system checks the available capacity of the system disk when you create an image. If the available capacity is smaller than the size of the write layer, an error is reported.
Solution: In DSW Terminal, check the disk capacity usage of the file system by using the df -h
command. Make sure that the capacity that is used by the overlay does not exceed the available capacity of the /dev/vda4 directory. If it exceeds the capacity, you can configure the Custom Excluded Path parameter when you create an image.
What do I do if the startup period is too long when I use DSW trial resources or the "The charge of current ECI instance has been stopped, but the related resources are still being cleaned" message appears?
Cause: The trial resources are public resources and may be insufficient during peak hours. If you start a DSW instance during peak hours, the instance may take up to half an hour to start. If no resources are available within 1 hour, the system prompts that the instance type is unavailable in the current region.
Solution: If the startup period is too long, you can perform the following operations:
Select another region.
Select another instance type. You cannot change the type of an instance that is in the Waiting state. You need to stop the instance before you select another instance type.
Run the instance during off-peak hours, such as non-working hours.
How do I use a third-party library in DSW?
To install a third-party library in DSW, run the relevant commands in DSW Terminal. Sample code:
#Install a third-party library in the Python 3 environment.
pip install --user xxx
#Install a third-party library in the Python 2 environment.
source activate python2
pip install --user xxx
Replace xxx with the name of the third-party library that you want to install. After the installation is complete, choose Kernel > Restart Kernel to restart the service.
Why does the system require logon again when I pause for a period of time during the execution of machine learning code?
To ensure security, a logon session in DSW is valid for 3 hours. After the session expires, you must log on again. Task execution is not affected by logon session timeout. If you need to run a task that requires a long period of time to complete, we recommend that you run the nohup
command on the Terminal interface in DSW to run the task in the background.
I established an FTP connection by using ECS and uploaded and downloaded files to a NAS file system. What do I do if the "mount:wrong fs type,bad option,bad superblock" message appears after I run the mount command?
Problem description
Solution
Before you run the
mount
command, install the nfs-utils package.yum install nfs-utils
How do I use DSW to read data from OSS?
Go to the Terminal interface of a DSW instance and run the ossutil
command to upload and download objects. Perform the following steps:
Download, install, and then configure ossutil on the Terminal interface of a DSW instance. For more information, see Install ossutil.
Upload an object to an Object Storage Service (OSS) bucket or download an object from an OSS bucket to the DSW instance. For more information, see ossutil command reference.
Why am I unable to access the services deployed in DSW over the Internet?
DSW provides debugging services for models in the development phase and does not support Internet access. If you need to access the deployed service over the Internet, you can use Elastic Algorithm Service (EAS) to deploy the trained model. For more information, see EAS overview and Call a service over a public endpoint.
Why does the third-party library I installed fail to take effect?
After you run the pip
command to install a third-party library and run the import
command to import the library, restart the service if the library is not found. If the error persists, check whether the current environment is valid. By default, third-party libraries for DSW are installed in the Python 3 environment. Before you can install a third-party library in another environment, you must manually switch to the environment. Sample code:
Install a third-party library in the Python 2 environment.
source activate python2
pip install --user xxx
Install a third-party library in the TensorFlow 2.0 environment.
source activate tf2
pip install --user xxx
Replace xxx with the name of the third-party library that you want to install.
How do I deploy a model that is generated by DSW?
Use EAS to deploy a model service
You can run commands in DSW Terminal to use the built-in EASCMD client to deploy a model service. For more information, see Create and manage DSW instances.
Download a model to an on-premises device for deployment
You can right-click a model that is generated by DSW and download the model to an on-premises device.
How is DSW billed?
DSW can be billed based on the subscription or pay-as-you-go billing method. You can select a billing method based on your business requirements. For more information, see Billing of DSW.
How do I view the bills of DSW?
If you use the pay-as-you-go billing method, you can view billing details by choosing Expenses > User Center in the top navigation bar of the Alibaba Cloud Management Console. For more information, see View billing details.
Why am I unable to start Docker in DSW?
DSW is runs in a container. Therefore, you cannot install Docker for DSW. The underlying virtual machine is equipped with a specific version of CUDA that cannot be changed before delivery. You can use the NVIDIA System Management Interface (nvidia-smi)
to query the CUDA version.
What do I do if I fail to start a DSW instance and the "The cluster resources are fully utilized"
message appears?
If the The cluster resources are fully utilized. Please try later or other regions.
message appears and a DSW instance fails to start, you can try the following methods to resolve the issue:
Change the instance type. Some instance types may have higher resource availability.
Change the region. Some regions may have higher resource availability.
Create DSW instances during off-peak hours, such as evenings or weekends. Resources availability is generally higher during off-peak hours.
If the issue persists, contact your account manager.
What do I do if the "back-off 10s restarting failed container=dsw-notebook pod" message appears when I start a DSW instance?
The message means that your system disk is full. You can resize the system disk by clicking Change Settings in the Actions column of the instance.
After you resize the system disk, you are charged for the system disk regardless of whether the instance is running. If you want to stop all billing related to the DSW instance, delete the DSW instance. Before you delete the instance, make sure that the necessary data is backed up.
What do I do if I fail to start a DSW instance and the "available zone with vSwitch is out of stock"
message appears?
A virtual private cloud (VPC) is configured for the DSW instance that you created. The vSwitch of the VPC requires computing resources to reside in the same zone as the vSwitch. This may result in a resource shortage.
On the Interactive Modeling (DSW) page, find the DSW instance that you want to manage and click the instance name to go to the Instance Details page.
On the Instance Settings tab, click Change Settings.
Modify the network configuration and leave the VPC parameter empty.
NoteIf you need to use a VPC, we recommend that you create a vSwitch and a DSW instance in another region to expand the scope of available resources and prevent resource insufficiency.
What do I do if I fail to start a DSW instance and the "Your resource usage has exceeded the default limitation. Please contact us via ticket system to raise the limitation." message appears?
When you create DSW instances, each Alibaba Cloud account can use only two GPUs in each region. If the resource usage exceeds the limit, this issue may occur. If you want to increase the quota, submit a ticket.
Why am I unable to use bash features such as auto-completion in Terminal?
Bash features are disabled by default in specific images. You need to enter bash in Terminal and press the Enter key to enable bash features.
What do I do if the specifications of a DSW instance do not meet the requirements for AI development in DSW?
Perform the following steps to update DSW instance specifications:
On the Interactive Modeling (DSW) page, find the DSW instance that you want to manage and click the instance name to go to the Instance Details page.
On the Instance Settings tab, click Change Settings.
In the Change Instance Settings panel, update the instance specifications.
NoteWhen you update the specifications of a running DSW instance, the update operation immediately restarts the instance. Make sure that you saved the data in the instance.
What do I do if the "Input/output error" message appears when PAI accesses the mount directory after I mount an OSS dataset?
This issue occurs because OSS access permissions (AliyunPAIDLCAccessingOSSRole) are not granted to PAI. For more information about how to grant OSS access permissions to PAI, see Authorize the service-linked role.
How do I release memory when my memory usage is high?
You can use one of the following methods to resolve the issue:
If the memory usage is so high that you cannot use the command line, click Stop Instance in the upper-right corner of the development page. You can also go to the instance page of the DSW console and click Stop in the Actions column of the instance that you want to stop. Wait until the instance is stopped before you open the instance.
If you can use the command line in the instance, you can run the
top
command on the Terminal interface of the instance to view the memory usage of all processes.%MEM
is the percentage of memory used.PID
is the process ID.To end a process, run the following command:
kill PID
Replace PID with the ID of the process that you want to end. After the command is run, you can see the reduction in memory usage.