Release notes - E-MapReduce - Alibaba Cloud Documentation Center

JindoData is a suite developed by the Alibaba Cloud big data team for storage acceleration of data lake systems. JindoData provides end-to-end solutions for data lake systems of Alibaba Cloud and other vendors in big data and AI scenarios. This topic describes the features supported by JindoData.

Background information

JindoData is upgraded based on the original SmartData of EMR developed by Alibaba Cloud. For more information, see Overview.

JindoData V4.6.x

Overview

JindoData V4.6.x supports smooth migration from Hadoop Distributed File System (HDFS) to OSS-HDFS. This simplifies the data migration process. JindoFS supports the Object Storage Service (OSS) inventory feature. You can better understand the distribution and ownership of data by using the feature. JindoFS significantly optimizes the performance of du and count operations by using existing data synchronization and incremental data synchronization. JindoSDK V4.6.x supports verifications by file and data block to improve the stability of the data write links. JindoSDK also supports multiple access protocols. You can use different protocols to access the path of a backend service.

JindoData V4.6.11

The following updates are made in JindoData V4.6.11:

The issue that JindoCommitter uses a MapReduce API of an earlier version in the Aliyun EMR Hadoop 2.8.5 environment to write data is fixed in JindoSDK.
JindoTable is optimized. The number of days for which a table or partition in OSS is unarchived can be specified. For more information, see Use JindoTable to archive and unarchive tables or partitions in OSS.

JindoData V4.6.10

The following updates are made in JindoData V4.6.10:

The pread logic of JindoFS is optimized.
In JindoSDK, multiple tasks can be committed in parallel. This improves the task committing performance.
The path modification logic of JindoSDK is optimized.
The issue that occurs when objects are appended by using JindoFuse is fixed.

JindoData V4.6.8

The following updates are made in JindoData V4.6.8:

The period for which you want to retain data in the recycle bin can be specified for a client in JindoFS.
MALLOC_CONF can be used to optimize memory usage in JindoSDK.
Graceful shutdown can be performed on JindoFuse when an OSS-HDFS path is mounted.
Wildcards can be used to search for the cached and prefetched list of files in JindoFSx.
The issue that clearing the cache does not take effect is fixed in JindoFSx.

JindoData V4.6.7

The following updates are made in JindoData V4.6.7:

The graceful shutdown mechanism is supported in JindoFuse.
The export of logs is optimized in JindoFuse.
The issue that O_APPEND or O_TRUNC is not supported by JindoFuse when an OSS path is mounted is fixed.

JindoData V4.6.6

The parallelism of distjob or distcp tasks is optimized, and the maximum parallelism does not exceed the total number of tasks.

JindoData V4.6.5

A lot of fixes and optimizations have been made in JindoData V4.6.5 on the basis of JindoData V4.6.4.

ServiceLoader is added to the OSS scheme to point to JindoOssFileSystem.
The logic that is used to handle exceptions when the isDirectory() method is used is optimized. For a directory in which Path * is included, the isDirectory() method returns false instead of throwing an IllegalPath exception.
The Hadoop SDK is optimized. The issue that ConcurrentModificationException may be reported in some scenarios when multiple configuration items of Hadoop are modified in parallel is fixed.
The retry logic of the JindoMagicCommitter client to write data to OSS when exceptions occur on temporary directories or disks are damaged is optimized. This ensures that jobs can be used to write data as expected and prevents the following InvalidPart exception: One or more of the specified parts could not be found or the specified entity tag might not have matched the part's entity tag.

JindoData V4.6.4

JindoData V4.6.4 supports the multi-platform feature.

For information about the supported platforms, see Download JindoData.
For the Java platform on which JindoData V4.6.4 is deployed, multiple jindo-cores can be deployed to implement the multi-platform feature. By default, jindo-core supports platforms that run the mainstream Linux operating system. If you want to use jindo-core on platforms that run other types of operating systems, you must introduce extension packages for the related platforms.
The dependency packages that are supported by the multi-platform feature are uploaded to the Maven repository of JindoData. For example, if you want to access OSS by using a Maven project, you can refer to the dependency configuration jindosdk_ide_hadoop.md.

For example, if you want to deploy a Hadoop cluster on a platform that runs the mainstream Linux operating system, you must add jindo-core-4.6.4.jar and jindo-sdk-4.6.4.jar to a specified classpath. If you want to run and debug code on a platform that runs the macOS operating system, you must install the jindo-core-4.6.4.jar and jindo-sdk-4.6.4.jar packages and introduce the jindo-core-macos-10_14-x86_64-4.6.4.jar extension package.

Note

You can go to the Download JindoData topic to download jindosdk-4.6.10-macos-10_14-x86_64.tar.gz, which contains the jindo-core-4.6.4.jar, jindo-sdk-4.6.4.jar, and jindo-core-macos-10_14-x86_64-4.6.4.jar packages that are required for this example.

JindoData V4.6.2

Many issues that exist in JindoData V4.6.1 are resolved in JindoData V4.6.2. The following issues are resolved for JindoFS:

JindoFS

The following issue is fixed: The service is stuck when hot and cold-tiered storage is converted to the Standard storage class.
The following issue is fixed: The service is stuck due to an empty manifest file generated by tiered storage.
The execution of tiered-storage tasks is accelerated.
The logic of the RootPolicy feature is fixed.
The issue that the service fails due to occasionally failed setAcl operations is fixed.
The following issue that occurs at low probability is fixed: The storage capacity is exhausted by manifest files of databases.
The batch metadata import feature of the data migration service is fixed.

JindoData V4.6.1

JindoFS
- The redundant logs are reduced during log printing.
- The following issue is fixed: The size of a file is incorrectly calculated because the file is not closed when you export the metadata from the file.
JindoFSx
Automatic cleanup is supported for temporary cache directories.
JindoSDK and tools
- The oversized output of JindoSDK is optimized.
- By default, paths can be optimized for du and count operations on the server side.
- The frequency of STS token updates is reduced to prevent throttling caused by frequent requests.
- The name of the RAM role contained in the URL for password-free access is changed to lowercase. This prevents the issue that the token for password-free access fails to be refreshed in Elastic Compute Service (ECS).

JindoData V4.6.0

JindoFS
- Allows you to export metadata from the files of an OSS-HDFS bucket. You can better understand data distribution and perform secondary development by using this feature.
- Significantly optimizes the performance of du and count operations by using existing data synchronization and incremental data synchronization on the server side.
- Supports smooth data migration from HDFS to OSS-HDFS, which greatly simplifies the data migration process.
- Supports multiple access protocols. You can use different protocols to access the path of a backend service.
JindoFSx
- The following issue is fixed: When the JindoFSx client writes data to the cache, the client exits due to an exception.
- The following issue is fixed: The JindoFSx client exits due to an exception when metrics are reported.
- The following issue is fixed: Memory leaks occur when you use Ranger for JindoFSx.
JindoSDK and tools
- JindoSDK supports cyclic redundancy check (CRC) and MD5 checksum verifications and write verifications by file and data block.
- Jindo Sync is supported for data synchronization without depending on the Hadoop environment.
- JindoSDK supports OSS-HDFS TensorFlow Connector.

JindoData V4.5.x

JindoData V4.5.1

Overview
JindoData V4.5.1 is a minor upgrade from V4.5.0. Major fixes and improvements are made based on V4.5.0. JindoFS improves service stability and handles some exceptions. JindoFS and JindoFSx further improve the adaptive configurations and prefetch algorithms to improve the prefetch efficiency. A lot of fixes and optimizations are performed for Jindo DistCp to enhance the stability of the data copy. JindoFuse adopts a new underlying architecture to further improve performance.
Major features
- JindoFS
  - Memory usage issues are resolved.
  - The exception handling and log-based alerting are supported for the ASSUME_ROLE error.
  - A dynamic AccessKey pair can be updated during a retry.
  - The adaptive configurations and prefetch algorithms are optimized for JindoFS to improve the prefetch efficiency.
  - The read/write URLs are fixed for random writes to files.
  - The checkAccess method is supported.
- JindoFSx
  - The adaptive configurations and prefetch algorithms are optimized for JindoFSx to improve the prefetch efficiency.
  - Spaces are supported in paths.
  - The possible hotspot issue is fixed for data writes in multiple replicas.
- JindoSDK and tools
  - Jindo commands fully cover Hadoop commands.
  - Jindo commands provide native support for HDFS, which greatly improves the performance and user experience.
  - Jindo DistCp can be connected to CloudMonitor.
  - Jindo DistCp supports file checksums for objects migrated from OSS to an HDFS path.
  - Jindo DistCp supports parameters for job splitting.
  - Jindo DistCp fixes the handling logic of the error reported when a source file is deleted during a copy process.
  - JindoSDK optimizes the memory usage of random reads.
- Support for Portable Operating System Interface (POSIX) by using JindoFuse
  - JindoFuse uses low-level API to greatly improve the performance of operations such as readdir.
  - The following issue is fixed: The list root directory error occurs after data is mounted by using JindoFSx.

JindoData V4.5.0

Overview
JindoFS preferentially optimizes the performance of operations on metadata, which significantly improves metadata-related performance. The tiered storage of hot data and cold data feature is improved, and the Infrequent Access (IA) and Cold Archive storage classes are supported. The batch write feature is supported to optimize the performance of large-scale extract, transform, and load (ETL) tasks. In terms of SDKs and ecosystem components, JindoSDK for Java is provided without depending on the Hadoop environment.
Major features
- JindoFS
  - JindoFS preferentially optimizes the performance of operations on metadata, which significantly improves metadata-related performance.
  - The tiered storage of hot data and cold data feature is improved, and the IA and Cold Archive storage classes are supported.
  - The batch write feature is supported to optimize the performance of large-scale ETL tasks.
  - The following issue is fixed: An exception occurs for accessing OSS due to an authorization error on the server side.
- JindoFSx
  - The issue of file handle leaks in the storage service is fixed.
  - The thread safety issues reported in metrics on the client are fixed.
  - The performance of recursively creating parent directories is optimized.
  - The performance of the path modification feature is optimized.
- JindoSDK and tools
  - The adaptive configurations and prefetch algorithms are optimized for JindoSDK to improve the prefetch efficiency.
  - JindoSDK supports the atomicity of the rename operation based on Tablestore.
  - Jindo DistCp optimizes the performance of the diff command and supports generating files of the diff command.
  - Errors for retries are handled in JindoSDK in a unified manner, and the issue that retries of the client fail due to the change of the server IP address is fixed.
  - JindoSDK for Java removes the Hadoop dependencies. JindoSDK is on the same level as Hadoop SDK and Object SDK.
- Support for POSIX by using JindoFuse
  The following issue is fixed in JindoFuse: Memory leaks occur when list operations are performed in cache.

JindoData V4.4.x

Overview
JindoFS provides tiered storage and archiving features based on the tiered storage capability of OSS. The tiered storage capability is compatible with the storage policies of HDFS. These features help you store data that is not frequently accessed at lower costs. This reduces the total storage costs. JindoFS also supports the audit logging feature of HDFS. This significantly improves compatibility with the Apache HDFS API and enhances features and data migration capabilities. In addition, you can use JindoFS to quickly import data to OSS and migrate data from a semi-hosted JindoFS cluster to OSS-HDFS. The features of JindoFS are implemented by using the Alibaba Cloud OSS-HDFS service. For more information, see What is OSS-HDFS?.
For JindoFSx, JindoData V4.4.X allows you to cache data on an on-premises client and accelerates access to the data cached on the client. This facilitates the caching of metadata and optimizes the cache-based acceleration capability of File Storage NAS.
SDKs and ecosystem components significantly improve the performance and throughput of operations. Object SDK provided by JindoData is compatible with the API operations of OSS and supports the acceleration capability of JindoFSx. This improves the performance of various operations. The JindoDistJob tool is provided. This tool allows you to migrate full or incremental metadata of files from a semi-hosted JindoFS cluster to OSS-HDFS without the need to copy data blocks. The Jindo DistCp tool is significantly enhanced to implement lossless migration of data from Apache HDFS to JindoFS. This ensures that the metadata of files can be backed up.
Major features
- JindoFS
  - JindoFS provides tiered storage and archiving features and is compatible with HDFS storage policies.
  - JindoFS provides the batch data import capability.
  - JindoFS supports the audit logging feature of HDFS.
  - JindoFS supports the concat and symlink methods.
  - JindoFS optimizes the background cleanup capability for data files.
  - JindoFS optimizes the performance of lease and lock operations.
- JindoFSx
  - JindoFSx provides caching plug-ins and supports client-side caching mode.
  - JindoFSx supports plug-in-based authentication. This way, KRB5 and Authentication and Security Layer (SASL) library dependencies are not required.
  - JindoFSx significantly optimizes the metadata caching performance and improves the cache acceleration capability of NAS.
- JindoSDK and tools
  - JindoSDK provides comprehensive support for HTTPS and enhances the fault tolerance capability of the poor network environment.
  - JindoSDK no longer depends on KRB5 files and Simple SASL libraries.
  - JindoSDK supports OSS API operations and seamlessly connects to the cache-based acceleration capability of JindoFSx. This significantly improves the performance of operations.
  - The JindoDistJob tool is added. This tool allows you to quickly migrate data from a semi-hosted JindoFS cluster in block storage mode to OSS-HDFS.
  - Jindo DistCp significantly improves the capability of migrating data from Apache HDFS to JindoFS and supports lossless migration of metadata.
- Support for POSIX by using JindoFuse
  JindoFuse optimizes the performance of sequential reads of large files.

JindoData V4.3.x

Overview
JindoData V4.3.0 fully supports the multicloud architecture and different data lake storage solutions that are based on multicloud, multiple storage platforms, accelerated and scaling methods, protocols, and programming languages. JindoFS is significantly improved based on the support for POSIX. JindoFSx system supports security authentication of Kerberos and Ranger for the first time. JindoSDK and ecosystem tools are greatly improved in terms of test coverage.
Major features
- JindoSDK and tools
  - JindoSDK supports multicloud storage services, such as Amazon Simple Storage Service (S3), IBM Cloud Object Storage (COS), and HUAWEI CLOUD Object Storage Service (OBS).
  - JindoSDK provides the JindoTable tool.
  - JindoSDK optimizes Flink connectors.
  - JindoSDK optimizes the Jindo DistCp tool.
- JindoFSx
  - JindoFSx supports various cloud storage services, such as Amazon S3, IBM COS, and HUAWEI CLOUD OBS.
  - JindoFSx optimizes the data caching and metadata caching features.
  - JindoFSx allows you to perform authentication by using Kerberos and Ranger.
  - JindoFSx optimizes its observability metrics.
  - JindoFSx is connected to Fluid.
- JindoFS
  - JindoFS supports POSIX locks and the fallocate() API operation.
  - JindoFS supports cluster upgrades in block storage mode of JindoFS.
- Support for POSIX by using JindoFuse
  - JindoFuse provides xattr-related methods, such as setxattr(), getxattr(), listxattr(), and removexattr().
  - JindoFuse supports POSIX locks and the fallocate() method.
  - JindoFuse allows you to write data to OSS in the following modes: append, flush, and write-while-read.

JindoData V4.2.x

Overview
JindoData V4.2.0 significantly improves the JindoFSx system. The cache acceleration feature is supported for Apache HDFS and File Storage NAS. Various tools are enhanced and provided, including JindoFuse, Jindo DistCp, and JindoTable.
Major features
- JindoFSx
  - The transparent data caching feature (hdfs://) and access acceleration by mounting a unified namespace of JindoFSx (fsx://) are supported for Apache HDFS.
  - Access acceleration by mounting a unified namespace of JindoFSx (fsx://) is supported for NAS.
  - JindoFSx is integrated with OSS-HDFS and supports write paths.
- JindoSDK and tools
  - JindoSDK for C or C++ is supported for the first time, and methods similar to those of POSIX are provided.
  - Support for POSIX by using JindoFuse is provided, and the JindoFuse tool is improved and built based on JindoSDK for C or C++.
  - Data migration is supported for Jindo DistCp. Jindo DistCp is refactored and improved by simplifying and removing unnecessary features in V3.x to enhance ease of use and robustness.
  - The JindoTable tool is supported. JindoTable is refactored and improved by simplifying and removing the least frequently used features in V3.x to enhance ease of use and robustness.

JindoData V4.1.x

Overview
JindoData V4.1.0 supports major features such as random writes for OSS-HDFS, and the JindoFSx storage acceleration system is also supported to provide distributed caching for OSS and OSS-HDFS.
Major features
- JindoFS
  - Capabilities of JindoFS
    - Random writes to files are supported. Modifications and writes can be performed on files.
    - The HDFS recycle bin is supported. The system automatically deletes files from the recycle bin based on the specific expiration time.
    - The HDFS snapshot feature is improved. Seek and write operations are supported for snapshots.
    - The mechanism for deleting directories is improved to greatly improve operation performance.
    - The NsWorker framework is implemented. The metadata service can allocate heavy processing workloads to the follower and learner nodes.
  - Support for JindoShell CLI commands
    - You can set the expiration time for the HDFS recycle bin by using commands.
    - The snapshotDiff command can be used to compare the differences between two snapshots.
    - The dumpFile command is improved to export the information about the file on which random writes are performed.
  - Support for POSIX by using JindoFuse
    Seek and write operations are supported for files.
- JindoFSx
  - Core capabilities of JindoFSx
    - Access to OSS is accelerated by using the transparent data caching feature. The prefix oss:// remains unchanged.
    - Transparent data caching for the OSS-HDFS service is supported. The prefix oss:// remains unchanged.
    - You can accelerate access to OSS or OSS-HDFS by mounting a unified namespace of JindoFSx. To do so, use the prefix fsx://.
    - Metadata caching acceleration is supported for a large number of files.
    - Metadata caching acceleration is supported for a large number of small files in AI training scenarios.
    - P2P acceleration is supported in scenarios where a large number of training nodes start to prefetch content and load model files at the same time. This greatly improves the cache read performance.
  - Support for Hadoop by using JindoSDK
    - JindoOssFileSystem is provided to support the transparent data caching feature for OSS and OSS-HDFS.
    - JindoFsxFileSystem is provided to support access acceleration by mounting a unified namespace of JindoFSx.
  - Support for JindoShell CLI commands
    - Data caching commands of JindoFSx are supported.
    - Metadata caching commands of JindoFSx are supported.
    - Commands for managing unified namespaces of JindoFSx are supported.
  - Support for POSIX by using JindoFuse
    - You can mount the file system in user space (FUSE) to the oss:// path. This way, you can read data from and write data to the JindoFSx cache.
    - You can mount FUSE to the fsx:// path. This way, you can read data from and write data to the JindoFSx cache.

JindoData V4.0.x

Overview
JindoData is developed by Alibaba Cloud based on the original EMR SmartData component. JindoData V4.0.0 is the first version released after the architecture of SmartData is upgraded. The latest major version of the SmartData component is V3.8.0. JindoData V4.0.0 mainly supports OSS and OSS-HDFS.
Note
JindoFSx is not released in JindoData V4.0.0.
Major features
- OSS
  - Support for Hadoop by using JindoSDK
    - Hadoop SDK for Java is provided for OSS. Hadoop SDK for Java is fully compatible with the Hadoop OSS connector and provides optimal performance.
    - Multiple methods, such as manual configuration, ECS role assignment, and the password-free feature, can be used to configure a credential provider.
    - OSS provides the Archive and Cold Archive storage classes for Standard and Cold Archive objects.
  - Support for JindoShell CLI commands
    - Additional Hadoop and HDFS Shell commands are provided. This way, OSS can provide easy-to-use commands for Hadoop users.
    - The ls2 command is provided on the basis of Hadoop ls commands. You can use this command to view the file or object storage status. For example, data is stored by using OSS Standard, IA, or Archive.
    - The archive command is supported. You can use this command to change the storage class of an object in a specific directory.
    - The restore command is supported. You can use this command to restore an object in a specific directory.
  - Support for POSIX by using JindoFuse
    An optimized FUSE client is provided for OSS. The client is implemented based on complete native code and provides industry-leading performance.
  - Support for data migration by using Jindo DistCp
    Data in the self-managed HDFS clusters can be migrated to OSS. JindoData optimizes the migration of a large number of large and small files.
- OSS-HDFS
  - JindoFS
    - You can enable the OSS-HDFS service for OSS buckets to provide the metadata acceleration feature. OSS-HDFS is binary compatible with Apache HDFS and allows HDFS to migrate data to the cloud.
    - OSS-HDFS provides native support for directories and greatly optimizes operations on directories. Atomic and millisecond-granularity rename operations on super-large directories are supported.
    - OSS-HDFS provides native support for files. HDFS leases, the one-write and multi-read feature, and the multi-write and multi-read feature are supported.
    - The following operations are supported on files: append, flush, sync, and truncate.
    - HDFS snapshots are supported. The number of snapshots is nearly unlimited to facilitate data backup, disaster recovery, and data restoration.
    - Permissions on files are supported. You can run commands in the JindoShell CLI to perform the mapping of users to groups.
    - Access control by using Hadoop proxy user rules is supported.
  - Support for Hadoop by using JindoSDK
    JindoSDK provides HDFS APIs to allow you to access and manage data in OSS-HDFS.
  - Support for JindoShell CLI commands
    - Additional Hadoop and HDFS Shell commands are provided. This way, OSS-HDFS can provide easy-to-use commands for Hadoop users.
    - The HDFS snapshot feature can be used by running HDFS commands and JindoShell commands. For more information, see Snapshot (trial).
    - You can run commands to perform the mapping of users to groups.
    - You can run commands to configure rules for Hadoop proxy users.
  - Support for POSIX by using JindoFuse
    An optimized FUSE client is provided for OSS-HDFS. The client is implemented based on complete native code and provides industry-leading performance.
Known issues
- JindoSDK does not allow you to write files that are larger than 80 GB in size to OSS.
- JindoSDK does not allow you to write data to OSS in append mode.
- JindoSDK does not allow OSS to encrypt uploaded data on the client.
- JindoSDK does not support JindoFS in block mode and cache mode.
- OSS-HDFS does not allow you to update JindoFS in block mode. You can use only the Jindo DistCp tool to migrate data from JindoFS in block mode to JindoFS.