Implementation and Challenges of Data Lake Metadata Services

Status Quo of the Big Data Engine

In the field of big data computing and storage, many big data engines are developed to deal with different business scenarios and data scale needs. For example, the computing engines include Hive for data analysis, Presto for interactive analysis, Spark for iterative computing, and Flink for stream processing. The storage services include the log storage system SLS and the distributed file system HDFS. These engines and systems effectively meet the business requirements in some specific fields. However, they are also facing serious data island problems.

The comprehensive use of these systems on the same set of data faces a lot of ETL work, which is challenging. More importantly, such usage is very common in the business procedures of various companies at present. At the same time, you have to pay high costs due to operations such as data processing, dumping, and delay. It also delays business decision-making. The key to solving this problem lies in the interconnection of engine metadata. By building a unified metadata service view of a data lake that meets the need of various engines, you can achieve better data sharing. You can also avoid the extra ETL cost and the procedure delay with this approach.

Design of Data Lake Metadata Services

The data lake metadata services aim at building a unified metadata view for different storage systems, formats, and computing engines. It provides unified permissions and metadata and works with and extends the metadata services of the open-source big data ecosystem. It also supports automatic metadata acquisition for multiple uses after initial setup. As a result, it is not only compatible with the open-source ecosystem but also easy to use. In addition, metadata must support tracing and auditing. This requires the data lake unified metadata services to have the following capabilities:

Provide unified permissions and the metadata management module. Unified permissions and metadata management module are the foundation for the interconnection between various engines and storages. The permission and metadata model must meet the business needs for permission isolation and support various permission models of the current engines.
Provide storage and service for large-scale metadata. Improvements on the limit of metadata services can meet the requirements for data with ultra-large scales.
Provide a unified metadata management view for storage. Structuring data in various storage systems (objects, files, logs, and other systems) facilitates not only data management but also enables further analysis and processing.
Support rich computing engines. Various engines access and compute data through a unified metadata service view, meeting the requirements of different scenarios. For example, PAI, MaxCompute, and Hive can perform computation and analysis on the same OSS data. With diversified engine support, the business scenarios will be easier to transform and use.
Support tracing and auditing of metadata operations.
Support automatic metadata discovery and collection. The data lake unified metadata services create and maintain metadata consistency by automatically detecting directories, files, and file formats of file storage. This facilitates automatic maintenance and management of stored data.

Architecture of the Data Lake Metadata Services

The Upper Layer of Metadata Services is Engine Access Layer

The upper layer of metadata services can connect various engines and meet the metadata access requirements of the engines by providing SDKs and plug-ins of various protocols. It also provides a view to analyze and process the underlying file systems.
The plug-in system is compatible with the E-MapReduce (EMR) engine, allowing users to utilize out-of-the-box EMR services. It is imperceptible for users throughout the process. This eliminates the poor extensibility of traditional storage systems, such as MySQL.

Metadata Services Provide the Storage View

By abstracting files of different storage formats or directories, the unified metadata services can be adopted for various engines. You can also discard the inconsistency issue evident in the independent usage of metadata services by multiple engines.

Metadata Management and Automatic Discovery

Flexible and cross-engine metadata management can be realized in many ways. This helps to easily integrate metadata services, expand metadata service capabilities, and reduce management costs.

1) Web Console, SDK, and various engine clients and interfaces

Compatible with Data Definition Language (DDL) operations on various databases, tables, and partitions of the open-source ecosystem engines.
Provide multi-version metadata management and tracing capabilities.
Enable the ETL process and open-source tools through plug-ins in the future by opening up metadata capabilities. This improves the overall ecosystem.

2) Automatic metadata discovery

Automatic metadata discovery is another core component of metadata management. It automatically collects scattered data in various file systems. This greatly broadens the scenario scope of unified metadata services and reduces the cost and complexity of metadata management. Automatic metadata discovery can:

Automatically analyze the directory hierarchy and dynamically and incrementally create metadata, such as database, table, and partition.
Automatically analyze file formats. It supports various formats such as regular text formats and open-source big data formats like parquet and orc.

Future of Metadata Services

The data lake metadata services support big data and interconnected ecosystems. It can be improved for better services to support more big data engines. Metadata services save costs in management, labor, and storage for business users through open services, storage capabilities, unified permissions, and metadata management. Thus, it provides greater business value to users.

Community

Implementation and Challenges of Data Lake Metadata Services

Status Quo of the Big Data Engine

Design of Data Lake Metadata Services

Architecture of the Data Lake Metadata Services

The Upper Layer of Metadata Services is Engine Access Layer

Metadata Services Provide the Storage View

Metadata Management and Automatic Discovery

Future of Metadata Services

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Data Lake Formation

Big Data Consulting for Data Technology Solution

Data Lake Storage Solution

Big Data Consulting Services for Retail Solution