SmartData is a storage service for the E-MapReduce (EMR) Jindo engine. SmartData provides centralized storage, caching, computing optimization, and feature extension for EMR computing engines. SmartData includes JindoFileSystem (JindoFS), JindoTable, and related tool sets. This topic describes the updates in SmartData 3.0.0.
Optimization of JindoFS storage
- The stand-alone configuration of Namespace Service is improved. In this case, metadata can be updated and asynchronously written into a Tablestore instance.
- The configuration of using a Tablestore instance as the metadata storage backend is removed. The Tablestore-based HA solution is no longer supported.
- File data can be stored in Archive mode in OSS to save costs.
- JindoFS tiered storage commands Archive, Unarchive, and Status are provided. You can use these commands to archive data to a specified directory and view the archive operation progress and related status.
- The ls2 command is provided to view the file information.
- Offline export, analysis, and query of fsimage are supported.
- Cross-cluster access to JindoFS is implemented.
For more information about the tiered storage commands of JindoFS, see Use tiered storage commands of JindoFS.
Optimization of JindoFS caching
- The structure of disks used to cache data is optimized. Dependencies on the system disk are removed. Data disks are independent of each other. The operation of taking disk resources offline is enhanced.
- The cache service is improved. Node fault tolerance and node disconnection operations are enhanced.
- The policies that are used to select disks into which data from cache blocks is written are optimized. By default, round-robin scheduling is supported.
- Read and write processes are improved. Fault tolerance is enhanced.
- The JindoFS tiered storage commands Cache, Uncache, and Status are provided. You can use these commands to cache data to a specified directory, preload data, and view the cache progress and related status.
- The problem that small files occupy much cache space is resolved. Related metrics are correctly measured.
Optimization of JindoTable computing
- JindoTable provides the -optimize command. You can use this command to optimize Hive table operations, such as merging small files in partitions.
- JindoTable provides the -archive, -unarchive, and -status commands. You can use these commands to archive data to specified tables or partitions, and view the archive operation progress and related status.
- JindoTable provides the -cache, -uncache, and -status commands. You can use these commands to cache data to specified tables or partitions, preload data, and view the cache progress and related status.
- You can export a MaxCompute table to JindoFS. Then, you can preload structured data before machine learning training.
For more information about JindoTable, see Use JindoTable.
OSS storage scalability on JindoFS
- Integration of Ranger permissions on the client is supported. This way, you can obtain the permissions to perform various operations in OSS. You can use service records in JindoFS to check Ranger permissions.
- Operation audit on the client is supported. This way, you can obtain the permissions to perform various operations in OSS. Operation records that are generated by using JindoFS are used for auditing.
- A Hadoop credential provider is supported. You can use common Hadoop methods to configure AccessKey pair information of OSS.
- A Flink connector is supported. OSS serves as a source, sink, or checkpoint in a Flink engine.
- JindoFS OSS SDKs of a Lite edition, such as a Hadoop connector, are provided and suitable for non-standard environments. A self-managed data center is a non-standard environment.
JindoManager system management
You can access a web UI to view the status and file statistics about the storage system and statistics about cache metrics on the cache system of JindoFS.
JindoTools
The distribution mechanism of Jindo DistCp is improved. Different distribution packages are offered for EMR clusters and non-EMR clusters.
Jindo DistCp provides a Lite edition, which is suitable for non-standard environments, such as a self-managed data center.