In the field of big data computing and storage, many big data engines are developed to deal with different business scenarios and data scale needs. For example, the computing engines include Hive for data analysis, Presto for interactive analysis, Spark for iterative computing, and Flink for stream processing. The storage services include the log storage system SLS and the distributed file system HDFS. These engines and systems effectively meet the business requirements in some specific fields. However, they are also facing serious data island problems.
The comprehensive use of these systems on the same set of data faces a lot of ETL work, which is challenging. More importantly, such usage is very common in the business procedures of various companies at present. At the same time, you have to pay high costs due to operations such as data processing, dumping, and delay. It also delays business decision-making. The key to solving this problem lies in the interconnection of engine metadata. By building a unified metadata service view of a data lake that meets the need of various engines, you can achieve better data sharing. You can also avoid the extra ETL cost and the procedure delay with this approach.
The data lake metadata services aim at building a unified metadata view for different storage systems, formats, and computing engines. It provides unified permissions and metadata and works with and extends the metadata services of the open-source big data ecosystem. It also supports automatic metadata acquisition for multiple uses after initial setup. As a result, it is not only compatible with the open-source ecosystem but also easy to use. In addition, metadata must support tracing and auditing. This requires the data lake unified metadata services to have the following capabilities:
By abstracting files of different storage formats or directories, the unified metadata services can be adopted for various engines. You can also discard the inconsistency issue evident in the independent usage of metadata services by multiple engines.
Flexible and cross-engine metadata management can be realized in many ways. This helps to easily integrate metadata services, expand metadata service capabilities, and reduce management costs.
1) Web Console, SDK, and various engine clients and interfaces
2) Automatic metadata discovery
Automatic metadata discovery is another core component of metadata management. It automatically collects scattered data in various file systems. This greatly broadens the scenario scope of unified metadata services and reduces the cost and complexity of metadata management. Automatic metadata discovery can:
The data lake metadata services support big data and interconnected ecosystems. It can be improved for better services to support more big data engines. Metadata services save costs in management, labor, and storage for business users through open services, storage capabilities, unified permissions, and metadata management. Thus, it provides greater business value to users.
EMR Remote Shuffle Service: A Powerful Elastic Tool of Serverless Spark
61 posts | 6 followers
FollowAlibaba EMR - June 8, 2021
Apache Flink Community China - September 15, 2022
Alibaba Cloud MaxCompute - December 22, 2021
ApsaraDB - November 17, 2020
Alibaba EMR - February 20, 2023
ApsaraDB - July 25, 2023
61 posts | 6 followers
FollowAn end-to-end solution to efficiently build a secure data lake
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreMore Posts by Alibaba EMR