SmartData is a storage service for the E-MapReduce (EMR) Jindo engine. SmartData provides centralized storage, caching, and computing optimization for EMR computing engines and extends storage features. SmartData is composed of JindoFS, JindoTable, and related tools. This topic describes the updates in SmartData 3.1.x.
Background information
Limits on SmartData 3.1.0:
- You can turn on the meta-cache switch to enable the cache mode of JindoFS to cache metadata. However, we recommend that you enable the cache mode only in training scenarios. If you use the cache mode in analysis scenarios, inappropriate parameter configurations may cause data synchronization failures between the current path and other OSS paths.
- The name of a namespace in JindoFS can contain only letters, digits, and hyphens (-).
- The size of a large file that you want to copy by using Jindo DistCp cannot exceed 78 GB.
- Although JindoFS in block storage mode supports the checksum feature, Jindo DistCp does not support this feature.
Documentation updates
Compared with SmartData 3.0.0, the following updates are made to the documentation
of SmartData 3.1.0:
Feature updates
JindoFS-based storage optimization
- JindoFS can perform file checksums by using the MD5MD5CRC or COMPOSITE_CRC algorithm. The checksum interfaces of open source HDFS are inherited. If the MD5MD5CRC algorithm is used, JindoFS provides an extended interface that supports the input of block size and also provides related Shell commands. This facilitates file comparison between JindoFS and HDFS.
- JindoFS supports transparent file compression. It allows you to specify compression policies for directories, compress data blocks of files that are newly written to the directories, and then store them in OSS. This feature significantly reduces the storage space and read/write workloads for data that has a high compression ratio.
- JindoFS supports flush semantics for data writes. After you call the flush API to flush data in a file, the data is persisted to the current location and is readable.
- The following issue is resolved: If a file is stored in a deep directory that involves
multiple levels of directories, the
hadoop fs -ls -R
command cannot be executed because the thread is in the waiting state. - The
hadoop fs -stat
command is enhanced. More information, such as atime and privilege, is returned. - The HDFS client paths for the Jindo system can be modified. This reduces the workload needed to modify paths when you migrate cluster data.
JindoFS-based caching optimization
- JindoFS optimizes the caching of a large number of small files in machine learning training scenarios and improves the small-file caching efficiency and read performance.
- A
cache
command can be executed to preload the directories of small files. This improves the preloading efficiency. - Data caching can be automatically triggered. You can specify the directory whose status you want to track and configure a time interval for directory check. This way, the system checks the directory at the specified interval and triggers data caching if any new file is found.
JindoTable-based computing optimization
- JindoTable Dump TF supports two-dimensional arrays.
- The Jindo mc dump command supports the GZIP compression format. The
-c
parameter can be used in the command.
JindoManager-based system management
JindoManager is added to manage the Jindo system, such as to perform O&M and monitor service status. JindoManager provides a web UI where you can view the status of each Jindo service.
Optimization based on Jindo tools
- Jindo DistCp optimizes the Job Committer logic for processing small files. This reduces the number of requests sent to OSS and improves the performance of DistCp when a large number of small files exist.
- Jindo DistCp optimizes the policy of dividing files into batches and improves the overall copy performance.
Ecosystem support for JindoFS
- You can run Flink streaming jobs to write data into JindoFS and store the data in block storage or cache mode. Flink streaming jobs can automatically recover from failures. The exactly_once semantics can be implemented in the jobs based on the combination with data sources that support data retransmission, such as Kafka.
- Flink supports the entropy injection feature. If you run Flink streaming jobs to write data into OSS or JindoFS in block storage or cache mode, you can use the entropy injection feature to replace a specific part of the destination path with a random string. This improves data write efficiency.
- The JindoFS TensorFlow connector is provided to support TensorFlow Filesystem. The native I/O interfaces are used. The following versions are supported: TensorFlow 1.15 and later 1.x versions, and 2.x versions later than 2.3.