What is OSS-HDFS? - Object Storage Service - Alibaba Cloud Documentation Center

OSS-HDFS (JindoFS) is a cloud-native data lake storage feature. OSS-HDFS provides centralized metadata management capabilities and is fully compatible with the Hadoop Distributed File System (HDFS) API. You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields.

Usage notes

Warning

After OSS-HDFS is enabled for a bucket, data that is written by using OSS-HDFS is stored in the .dlsdata/ directory of OSS-HDFS. To ensure the availability of the OSS-HDFS service or prevent data loss, do not perform write operations on the .dlsdata/ directory or on objects in the directory by using methods that are not supported by the OSS-HDFS service. For example, do not perform the following write operations: rename the directory, delete the directory, and delete objects in the directory.

After you enable OSS-HDFS, risks such as data loss, data contamination, and data inaccessibility may arise when you use other Object Storage Service (OSS) features to write data to the .dlsdata/ directory. For more information, see Usage notes.

Billing rules

Metadata management fees
You are charged metadata management fees for objects when you use OSS-HDFS. However, you are not charged for this billable item.
Data storage fees
When you use OSS-HDFS, data blocks are stored in Objects Storage Service (OSS). Therefore, the billing method of OSS is applicable to data blocks in OSS-HDFS. For more information, see Billing overview.

Benefits

You can use OSS-HDFS without the need to modify existing Hadoop and Spark applications. You can configure OSS-HDFS with ease to access and manage data in a similar manner in which you manage data in HDFS. You can also take advantage of OSS characteristics, such as unlimited storage space, elastic scalability, and high security, reliability, and availability.

Cloud-native data lakes are based on OSS-HDFS. You can use OSS-HDFS to analyze exabytes of data, manage hundreds of millions of objects, and obtain terabytes of throughput. OSS-HDFS provides the flat namespace feature and the hierarchical namespace feature to meet your requirements for big data storage. You can use the hierarchical namespace feature to manage objects in a hierarchical directory structure. OSS-HDFS automatically converts the storage structure between the flat namespace and the hierarchical namespace to help you manage object metadata in a centralized manner. Hadoop users can access objects in OSS-HDFS without the need to copy and convert the format of the objects. This improves job performance and reduces maintenance costs.

Features

Feature	Description	References
Snapshot (trial)	You can use snapshots created by using the Snapshot command to restore data that is accidentally deleted or to back up data to ensure service continuity when an error occurs. You can use the snapshot feature of OSS-HDFS in the same manner as the snapshot feature of HDFS. The snapshot feature of OSS-HDFS supports directory-level operations.	Snapshot
RootPolicy	You can use RootPolicy to configure a custom prefix for OSS-HDFS. This way, jobs can run on OSS-HDFS without modifying the original access prefix `hdfs://`.	Access OSS-HDFS by using RootPolicy
ProxyUser	The ProxyUser command is used to authorize a user to perform operations such as accessing sensitive data on behalf of other users.	ProxyUser
UserGroupsMapping	The UserGroupsMapping command is used to manage mappings between users and user groups.	UserGroupsMapping

Scenarios

OSS-HDFS is suitable for computing scenarios in the big data and AI fields. You can use OSS-HDFS in the following scenarios:

Offline data warehousing with Hive and Spark

OSS-HDFS supports operations on files and directories and allows you to manage permissions on files and directories. OSS-HDFS also supports atomic operations on directories and rename operations in milliseconds. OSS-HDFS supports features, such as time configurations by using setTimes, extended attributes (XAttrs), ACLs, and accelerated access to local cache. This makes OSS-HDFS suitable for offline data warehousing with Hive and Spark. When you use the extract, transform, and load (ETL) feature to process data, OSS-HDFS provides better performance than OSS Standard buckets.

OLAP

OSS-HDFS supports basic file-related operations, such as append, truncate, flush, and pwrite. OSS-HDFS supports POSIX by using JindoFuse. This way, when you use ClickHouse for online analytical processing (OLAP), you can replace on-premises disks to decouple storage from computing. The caching system of OSS-HDFS helps decrease the period of time that is required for operations and improve performance at low costs.

Decoupling of storage from computing for HBase

OSS-HDFS supports operations on files and directories and flush operations. You can use OSS-HDFS instead of HDFS to decouple storage from computing for HBase. Compared with the combination of HBase and OSS Standard buckets, the combination of HBase and OSS-HDFS provides a more streamlined service architecture because the latter uses HDFS to store Web Application Firewall (WAF) logs. For more information, see Use OSS-HDFS as the storage backend of HBase.

Real-time computing

OSS-HDFS supports flush operations and truncate operations. You can use OSS-HDFS instead of HDFS to store sinks and checkpoints in real-time computing scenarios of Flink.

Data migration

As a novel cloud-native data lake storage service, OSS-HDFS allows you to migrate data from HDFS in the data center to Alibaba Cloud, optimizes the experience of HDFS users, and provides storage services that are scalable and cost-effective. You can use Jindo DistCp to migrate data from HDFS to OSS-HDFS. During data migration, HDFS checksum can be used to verify the integrity of data.

Supported engines

Ecosystem	Engine/Platform	References
Open source ecosystem	Flink	Use Apache Flink to write data to OSS-HDFS
	Flume	Use JindoSDK with Apache Flume to write data to OSS-HDFS
	Hadoop	Use Hadoop to access OSS-HDFS by using JindoSDK
	HBase	Use OSS-HDFS as the underlying storage of HBase
	Hive	Use JindoSDK with Hive to process data stored in OSS-HDFS
	Impala	Use JindoSDK with Impala to query data stored in OSS-HDFS
	Presto	Use JindoSDK with Presto to query data stored in OSS-HDFS
	Spark	Use JindoSDK with Spark to query data stored in OSS-HDFS
Alibaba Cloud ecosystem	EMR	Use OSS-HDFS in EMR Hive or Spark
	Flink	Use Apache Flink on an EMR cluster to write data to OSS-HDFS Use Realtime Compute for Apache Flink to read data from or write data to OSS or OSS-HDFS
	Flume	Use Flume to synchronize data from an EMR Kafka cluster to a bucket with OSS-HDFS enabled
	HBase	Use OSS-HDFS as the underlying storage of HBase on an EMR cluster
	Hive	Use Hive on an EMR cluster to process data stored in OSS-HDFS
	Impala	Use Impala on an EMR cluster to query data stored in OSS-HDFS
	Presto	Use Presto on an EMR cluster to query data stored in OSS-HDFS
	Spark	Use Spark on an EMR cluster to process data stored in OSS-HDFS
	Sqoop	Use Apache Sqoop on an EMR cluster to implement read and write access to data stored in OSS-HDFS