Data lake management provided by Object Storage Service (OSS) is designed to help you efficiently build and manage data lakes to meet the requirements of massive data storage, analysis, and migration. Data lake management, with OSS-HDFS as its core and combined with the OSS accelerator feature, allows you to seamlessly integrate the big data ecosystem. Data lake management is compatible with Hadoop Distributed File System (HDFS) API, supports hierarchical namespace, and optimizes data management in real-time computing scenarios, which significantly improves data analysis performance, reduces storage costs, and simplifies the complexity of migrating data from HDFS to the cloud.
OSS-HDFS
OSS-HDFS (JindoFS) is a cloud-native data lake storage feature. OSS-HDFS provides centralized metadata management capabilities and is fully compatible with HDFS API. You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields.
By integrating with various ecosystem tools, such as Hadoop, Hive, Presto, and Spark provided by open source ecosystem, and MaxCompute and Simple Log Service (SLS) provided by Alibaba Cloud ecosystem, OSS-HDFS supports full lifecycle management that covers data storage and analysis without the need for additional development.
Scenarios
Offline data warehousing: OSS-HDFS supports operations on files and directories.
Online analytical processing (OLAP): OSS-HDFS supports basic file-related operations, such as append, truncate, and flush, which meets the requirements of decoupling storage from computing.
Decoupling of storage from computing for HBase: OSS-HDFS supports operations on files and directories and flush operations. You can use OSS-HDFS instead of HDFS to decouple storage from computing for HBase.
Real-time computing: OSS-HDFS supports flush operations and truncate operations. You can use OSS-HDFS instead of HDFS to store sinks and checkpoints in real-time computing scenarios of Apache Flink.
Data migration: As a novel cloud-native data lake storage service, OSS-HDFS allows HDFS to migrate data to Alibaba Cloud and improves the experience of HDFS users. This way, OSS-HDFS provides storage services that are scalable and cost-effective.
Benefits
Compatibility: You can configure OSS-HDFS with ease to access and manage data in the same manner in which you access and manage data in HDFS without the need to change existing Hadoop applications.
Automatic scaling: You can take advantage of OSS characteristics, such as unlimited storage capacity, elastic scalability, high security, reliability, and availability.
Hierarchical namespace: You can use the hierarchical namespace feature to manage objects in the hierarchical directory structure. OSS-HDFS automatically converts the storage structure between the flat namespace and the hierarchical namespace to help you manage object metadata in a centralized manner.
High performance. You can use OSS-HDFS to analyze exabytes of data, manage hundreds of millions of objects, and obtain terabytes of throughput.
OSS accelerator
For business scenarios that require low latency and high throughput, such as AI, data warehousing, and big data analytics, accelerators cache hotspot files on high-performance Non-Volatile Memory Express (NVMe) SSDs to reduce data write latency and improve throughput. This significantly optimizes the performance of Realtime Compute jobs and helps you quickly deploy stable stream processing pipelines in the cloud.
Scenarios
Low-latency data sharing: The OSS accelerator feature is suitable for scenarios in which you want to quickly access uploaded data, such as uploading and analyzing the images of mobile applications.
Model inference: The OSS accelerator feature is suitable for scenarios in which you want to quickly load and switch the model objects to improve inference efficiency.
Big data analysis: The OSS accelerator feature is suitable for scenarios in which you want to query data from a large amount of data and the query range is uncertain. This reduces query latency.
Multi-level acceleration: The OSS accelerator feature can work with client-side caching to achieve multi-level acceleration and improve data access efficiency.
Benefits
Low latency: The NVMe SSD media of an accelerator can provide millisecond-level download latency for business. The accelerator provides better performance for hot data query in data warehouses and inference model download.
High throughput: The bandwidth of an accelerator increases linearly together with the cache capacity of the accelerator and provides burst throughput of up to hundreds of Gbit/s, which meets the requirements for fast reading of large amounts of data.
Automatic scaling: The accelerator supports at least 50 GB of cache capacity and up to 100 TB of cache capacity. You can scale up or scale down the cache capacity of an accelerator online based on your business requirements. This helps reduce costs.
Data consistency: The OSS accelerator feature ensures that you can read the latest versions of the objects.
Multiple warmup policies: The accelerator provides the warmup during read, synchronous warmup, and asynchronous warmup policies to ensure that the engine can read the latest data.