All Products
Search
Document Center

Cloud Parallel File Storage:Data flow overview

Last Updated:Dec 16, 2024

When you need to enable data flow between the CPFS Intelligent Computing Edition file system and the OSS Bucket, you must create a data flow and a data flow task to achieve high-speed data transmission.

Feature description

CPFS Intelligent Computing Edition supports the following data flow features:

  • Account-level data flow

    Supports data flow between OSS Buckets within the same account or across different accounts.

  • Directory-level data flow

    You can create a data flow to establish a mapping from any subdirectory in the CPFS Intelligent Computing Edition file system to any prefix in the OSS Bucket, enabling finer-grained access control and more flexible data transmission.

  • Data import and export

    Supports data import and export between the CPFS Intelligent Computing Edition file system and OSS by creating batch tasks or stream tasks. Batch tasks are suitable for preloading datasets before computation tasks start; stream tasks are suitable for scenarios where multiple Checkpoint files of a model are continuously written back and preloaded during computation task training. If task execution fails, you can check the failure reason through the task report.

    Warning

    CPFS Intelligent Computing Edition exports the File Modification timestamps property to the custom metadata of the OSS Bucket, named x-oss-meta-alihbr-sync-mtime. It cannot be deleted or modified, otherwise the File Modification timestamps property in the file system will be incorrect.

Limits

  • Data flow

    • CPFS Intelligent Computing Edition version 2.4.0 and above supports same-account data flow, and version 2.6.0 and above supports cross-account data flow.

    • A single CPFS Intelligent Computing Edition file system supports up to 10 data flows.

    • A file path in the CPFS Intelligent Computing Edition file system can only be linked to one OSS Bucket.

    • CPFS Intelligent Computing Edition file system does not support creating data flows with OSS Buckets in other regions.

  • Limits of data flows on file systems

    • Renaming operations cannot be performed on non-empty directories in file system paths associated with data flows, otherwise an error Permission Denied or directory not empty will occur.

    • Special characters in directory and file names should be used with caution. Only uppercase and lowercase letters, numbers, exclamation marks (!), hyphens (-), underscores (_), half-width periods (.), asterisks (*), and half-width parentheses (()) are supported. Double half-width periods (..), backslashes (\), and forward slashes (/) are not supported.

    • Overly long paths are not supported. The maximum path length supported by data flow is 1023 characters.

  • Data flow task limits

    • Only CPFS Intelligent Computing Edition version 2.6.0 and above supports stream tasks, and they can only be used through OpenAPI.

    • A maximum of 4 batch tasks can run simultaneously under a single data flow, with no limit on stream tasks.

    • Import limits

      • Files of the Symlink type will be converted to regular files containing data when imported into CPFS Intelligent Computing Edition, losing the Symlink information.

      • If multiple versions exist in the OSS Bucket, only the latest version is copied.

      • File names or subdirectory names longer than 255 bytes are not supported.

    • Export limits

      • Files of the Symlink type will not synchronize the files they point to when synced to OSS, but will become ordinary blank objects without data.

      • Files of the Hardlink type are synchronized to OSS as regular files.

      • Files of the Socket, Device, and Pipe types will become ordinary blank objects without data when exported to the OSS Bucket.

      • Directory paths longer than 1023 characters are not supported.

Usage flow

  1. Create a data flow.

    For specific operations, see Create same-account data flow or Create cross-account data flow.

  2. Create batch tasks or stream tasks.

    For specific operations, see Manage data flow tasks or Best practices for data flow stream tasks.

Performance metrics

Operation type

Metric

Description

Data import

Throughput for files larger than GB

  • The maximum throughput for single file import is 5 GB/s.

  • The maximum throughput for multiple file import is 100 GB/s.

    Note

    The actual throughput capability is limited by the bandwidth of OSS and the throughput capability of the CPFS Intelligent Computing Edition file system, and is also affected by file size, number of files, and data volume. For information on OSS bandwidth capability, see Bandwidth; for information on the throughput capability of CPFS Intelligent Computing Edition, see Product specifications.

Number of MB-level files processed per second

Single directory, multiple directory import: 1000.

Data export

Throughput for files larger than GB

  • The maximum throughput for single file export is 5 GB/s.

  • The maximum throughput for multiple file export is 100 GB/s.

    Note

    The actual throughput capability is limited by the bandwidth of OSS and the throughput capability of the CPFS Intelligent Computing Edition file system, and is also affected by file size, number of files, and data volume. For information on OSS bandwidth capability, see Bandwidth; for information on the throughput capability of CPFS Intelligent Computing Edition, see Product specifications.

Number of MB-level files processed per second

Single directory, multiple directory export: 1200.

Billing examples

The data flow feature of CPFS Intelligent Computing Edition is currently in public preview and is free to use.