OSS-HDFS (JindoFS) is a cloud-native data lake storage feature. OSS-HDFS provides centralized metadata management capabilities and is fully compatible with the Hadoop Distributed File System (HDFS) API. You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields.
Usage notes
After OSS-HDFS is enabled for a bucket, data that is written by using OSS-HDFS is stored in the .dlsdata/
directory of OSS-HDFS. To ensure the availability of the OSS-HDFS service or prevent data loss, do not perform write operations on the .dlsdata/
directory or on objects in the directory by using methods that are not supported by the OSS-HDFS service. For example, do not perform the following write operations: rename the directory, delete the directory, and delete objects in the directory.
After you enable OSS-HDFS, risks such as data loss, data contamination, and data inaccessibility may arise when you use other Object Storage Service (OSS) features to write data to the .dlsdata/
directory. For more information, see Usage notes.
Billing rules
Metadata management fees
You are charged metadata management fees for objects when you use OSS-HDFS. However, you are not charged for this billable item.
Data storage fees
When you use OSS-HDFS, data blocks are stored in Objects Storage Service (OSS). Therefore, the billing method of OSS is applicable to data blocks in OSS-HDFS. For more information, see Billing overview.
Benefits
You can use OSS-HDFS without the need to modify existing Hadoop and Spark applications. You can configure OSS-HDFS with ease to access and manage data in a similar manner in which you manage data in HDFS. You can also take advantage of OSS characteristics, such as unlimited storage space, elastic scalability, and high security, reliability, and availability.
Cloud-native data lakes are based on OSS-HDFS. You can use OSS-HDFS to analyze exabytes of data, manage hundreds of millions of objects, and obtain terabytes of throughput. OSS-HDFS provides the flat namespace feature and the hierarchical namespace feature to meet your requirements for big data storage. You can use the hierarchical namespace feature to manage objects in a hierarchical directory structure. OSS-HDFS automatically converts the storage structure between the flat namespace and the hierarchical namespace to help you manage object metadata in a centralized manner. Hadoop users can access objects in OSS-HDFS without the need to copy and convert the format of the objects. This improves job performance and reduces maintenance costs.
Features
Feature | Description | References |
Snapshot (trial) | You can use snapshots created by using the Snapshot command to restore data that is accidentally deleted or to back up data to ensure service continuity when an error occurs. You can use the snapshot feature of OSS-HDFS in the same manner as the snapshot feature of HDFS. The snapshot feature of OSS-HDFS supports directory-level operations. | |
RootPolicy | You can use RootPolicy to configure a custom prefix for OSS-HDFS. This way, jobs can run on OSS-HDFS without modifying the original access prefix | |
ProxyUser | The ProxyUser command is used to authorize a user to perform operations such as accessing sensitive data on behalf of other users. | |
UserGroupsMapping | The UserGroupsMapping command is used to manage mappings between users and user groups. |
Scenarios
OSS-HDFS is suitable for computing scenarios in the big data and AI fields. You can use OSS-HDFS in the following scenarios:
Offline data warehousing with Hive and Spark
OSS-HDFS supports operations on files and directories and allows you to manage permissions on files and directories. OSS-HDFS also supports atomic operations on directories and rename operations in milliseconds. OSS-HDFS supports features, such as time configurations by using setTimes, extended attributes (XAttrs), ACLs, and accelerated access to local cache. This makes OSS-HDFS suitable for offline data warehousing with Hive and Spark. When you use the extract, transform, and load (ETL) feature to process data, OSS-HDFS provides better performance than OSS Standard buckets.
OLAP
OSS-HDFS supports basic file-related operations, such as append, truncate, flush, and pwrite. OSS-HDFS supports POSIX by using JindoFuse. This way, when you use ClickHouse for online analytical processing (OLAP), you can replace on-premises disks to decouple storage from computing. The caching system of OSS-HDFS helps decrease the period of time that is required for operations and improve performance at low costs.
Decoupling of storage from computing for HBase
OSS-HDFS supports operations on files and directories and flush operations. You can use OSS-HDFS instead of HDFS to decouple storage from computing for HBase. Compared with the combination of HBase and OSS Standard buckets, the combination of HBase and OSS-HDFS provides a more streamlined service architecture because the latter uses HDFS to store Web Application Firewall (WAF) logs. For more information, see Use OSS-HDFS as the storage backend of HBase.
Real-time computing
OSS-HDFS supports flush operations and truncate operations. You can use OSS-HDFS instead of HDFS to store sinks and checkpoints in real-time computing scenarios of Flink.
Data migration
As a novel cloud-native data lake storage service, OSS-HDFS allows you to migrate data from HDFS in the data center to Alibaba Cloud, optimizes the experience of HDFS users, and provides storage services that are scalable and cost-effective. You can use Jindo DistCp to migrate data from HDFS to OSS-HDFS. During data migration, HDFS checksum can be used to verify the integrity of data.
Supported engines
Ecosystem | Engine/Platform | References |
Open source ecosystem | Flink | |
Flume | ||
Hadoop | ||
HBase | ||
Hive | ||
Impala | ||
Presto | ||
Spark | ||
Alibaba Cloud ecosystem | EMR | |
Flink | ||
Flume | Use Flume to synchronize data from an EMR Kafka cluster to a bucket with OSS-HDFS enabled | |
HBase | Use OSS-HDFS as the underlying storage of HBase on an EMR cluster | |
Hive | Use Hive on an EMR cluster to process data stored in OSS-HDFS | |
Impala | Use Impala on an EMR cluster to query data stored in OSS-HDFS | |
Presto | Use Presto on an EMR cluster to query data stored in OSS-HDFS | |
Spark | Use Spark on an EMR cluster to process data stored in OSS-HDFS | |
Sqoop | Use Apache Sqoop on an EMR cluster to implement read and write access to data stored in OSS-HDFS |