Data Science Workshop (DSW) allows you to mount datasets or Object Storage Service (OSS) paths. This allows you to expand the storage space of DSW instances, persist storage data, and share data. This topic describes how to mount datasets or OSS paths in DSW.
Background information
Public and dedicated resource groups have limited storage capacity that does not support persistent storage. To expand storage and ensure data persistence, you can mount Apsara File Storage NAS (NAS), OSS, or Cloud Parallel File Storage (CPFS) dataset or an OSS path to the specified path of DSW.
DSW instances of a public resource group use a free disk with limited capacity, which is cleared 15 days after the instance is stopped or deleted.
DSW instances of a dedicated resource group store data on the system disk, with temporary storage being cleared when stopping or deleting the instance.
You can choose to mount when startup or mount dynamically:
Mount when startup: When configuring an instance, set the Dataset or Mount Settings parameters. Restart the instance for the changes to take effect.
Dynamic mount: Use the SDK for PAI to configure mounting within the DSW instance without the need to restart the instance.
Additionally, you can configure mounting by setting the Jindo parameter of the dataset, in scenarios that require high performance, such as fast read/read, incremental read/write, read/write consistency, and read-only.
JindoData is a suite developed by the Alibaba Cloud big data team for storage acceleration of data lake systems. For more information, see Overview.
Limits
You cannot mount multiple datasets to the same path.
We recommend that you do not frequently perform write operations on the path to which an OSS path is mounted.
Configure mounting
Mount when startup
When configuring an instance, set the Dataset or Mount Settings parameters. Restart the instance for the changes to take effect.
Mount a dataset
Compared to mounting an OSS path, dataset allows for version management and dataset acceleration. Take the following steps to mount a dataset:
Create a dataset.
In the PAI console, choose AI Asset Management > Datasets. Create a custom or public dataset. For more information, see Create and manage datasets.
Mount the dataset
Create an instance or modify the settings of an existing instance. On the Create Instance or Change Settings page, click Add next to Mount Settings. Then, select the dataset and specify Mount Path. Mount Mode is left empty by default. For more information, see Mount modes.
Note if you mount a custom dataset:
If you use a CPFS dataset, you must configure a virtual private cloud (VPC) for the instance. The VPC you select must be the same as the VPC where the CPFS dataset resides. Otherwise, the DSW instance may fail to be created.
If you use a NAS dataset, you must configure network settings and select a security group for the instance.
If you select a dedicated resource group, NAS provides better support for the Filesystem in Userspace (FUSE) interface than OSS. Therefore, the first dataset that you add must be of the NAS type and mounted to the specified path and the default DSW working directory /home/admin/workspace.
Mount an OSS path
Create an OSS bucket
Activate OSS and create a bucket.
NoteThe region where the bucket resides must be the same as the region where PAI resides. You cannot change the region of a bucket after the bucket is created.
Mount an OSS path
Create an instance or modify the settings of an existing instance. On the Create Instance or Change Settings page, click Add next to Mount Settings. Select the created OSS bucket and specify Mount Path. Mount Mode is left empty by default. For more information, see Mount modes.
Dynamic mount
Use the SDK for PAI to configure mounting within the DSW instance without the need to restart the instance.
Make preparations
Install the PAI SDK for Python in the terminal of a DSW instance. The version of Python must be 3.8 or later.
python -m pip install pai>=0.4.11
Use one of the following methods to configure an AccessKey pair for PAI SDK for Python to access PAI:
Method 1: Configure the default role or a custom RAM role for the DSW instance. Go to the Change Settings page of the instance click Expand and select a RAM role. For more information, see Configure RAM roles for a DSW instance.
Method 2: Manually configure an AccessKey pair by using the CLI provided by PAI SDK for Python. Run the following command to configure the parameters. For more information, see Install and configure PAI SDK for Python.
python -m pai.toolkit.config
Sample code
The dynamic mount feature allows you to mount a specific OSS bucket directory to a DSW instance without the need to restart the DSW instance. Sample codes:
Mount to the default path.
The data is mounted to the default mount path of the instance.
from pai.dsw import mount # Mount OSS path mount_point = mount("oss://<YourBucketName>/Path/Data/Directory/") # Mount a dataset. Specify the ID of the dataset # mount_point = mount("d-m7rsmu350********")
Mount to a specific path.
The dynamic mount feature requires data to be mounted to a specific path or a subpath in a container. You can obtain the dynamic mount path by using the API provided by PAI SDK for Python.
from pai.dsw import mount, default_dynamic_mount_path # Obtain the default mount path of the instance. default_path = default_dynamic_mount_path() mount_point = mount("oss://<YourBucketName>/Path/Data/Directory" , mount_point=default_path + "tmp/output/model")
View all mounted data configurations in the instance.
from pai.dsw import list_dataset_configs print(list_dataset_configs())
Unmount mounted data.
from pai.dsw import mount, unmount mount_point = mount("oss://<YourBucketName>/Path/Data/Directory/") # Specify the mounted path, which is the MountPath returned by list_dataset_configs # Wait for a few seconds for the unmounting to take effect unmount(mount_point)
Mount modes
This topic provides suggestions on how to configure JindoFuse in specific scenarios and does not provide the optimal configurations in all scenarios. For more information, see User guide of JindoFuse.
Only OSS-type custom datasets support mount modes.
Quick Read/write: ensures quick reads and writes. However, data inconsistency may occur during concurrent reads or writes. You can mount training data and models to the mount path of this mode. We recommend that you do not use the mount path of this mode as the working directory.
{ "fs.oss.download.thread.concurrency": "Twice the number of CPU cores", "fs.oss.upload.thread.concurrency": "Twice the number of CPU cores", "fs.jindo.args": "-oattr_timeout=3 -oentry_timeout=0 -onegative_timeout=0 -oauto_cache -ono_symlink" }
Incremental Read/Write: ensures data consistency during incremental writing. If original data is overwritten, data inconsistency may occur. The reading speed is slightly slow. You can use this mode to save the model weight files for training data.
{ "fs.oss.upload.thread.concurrency": "Twice the number of CPU cores", "fs.jindo.args": "-oattr_timeout=3 -oentry_timeout=0 -onegative_timeout=0 -oauto_cache -ono_symlink" }
Consistent Read/write: ensures data consistency during concurrent reads or writes and is suitable for scenarios that require high data consistency and do not require quick reads. You can use this mode to save the code of your projects.
{ "fs.jindo.args": "-oattr_timeout=0 -oentry_timeout=0 -onegative_timeout=0 -oauto_cache -ono_symlink" }
Read-only: allows only reads. You can use this mode to mount public datasets.
{ "fs.oss.download.thread.concurrency": "Twice the number of CPU cores", "fs.jindo.args": "-oro -oattr_timeout=7200 -oentry_timeout=7200 -onegative_timeout=7200 -okernel_cache -ono_symlink" }
View mount configurations
In the instance list on the Data Science Workshop (DSW) page, click Open in the Actions column of the DSW instance that you want to manage.
In the top navigation bar of the Data Science Workshop page, click the Terminal tab. Follow the instructions to open the terminal.
On the Terminal page, run the following commands to check whether the NAS and OSS datasets are mounted:
# View all mounted datasets. mount # Query the mount path of a NAS dataset. mount | grep nas # Query the mount path of an OSS dataset. mount | grep oss
If the following output is returned, the datasets are mounted.
NAS datasets are mounted to the /mnt/data_nas, /mnt/workspace, and /home/admin/workspace paths. /mnt/data_nas indicates the mount path that you specified when you created the DSW instance. The other two paths are the default working directories of DSW provided for your first NAS dataset. As long as your NAS resources and server work as expected, your data and code persist.
The OSS dataset is mounted to the /mnt/data_oss path.