Simplified End-to-End Data Platform

This article introduces the essential services and practical implementation guidance for building a simple data pipeline and data warehouse on Alibaba Cloud.

By Ananda Budi Prasetya, Solution Architect Lead Alibaba Cloud Indonesia.

Nowadays, many organizations want to serve their data from several sources to their stakeholder. Their purpose is to empower data-driven decision-making. However, constructing a robust data platform that serves as a single source of truth presents significant challenges.

At Alibaba Cloud, we have several services that can be used to build a simple data pipeline that can ingest from several data sources into one place and let the user build a data warehouse on top of it.

In this brief blog post, we will explain it on what essential services must be implemented along with practical implementation guidance.

Let's first take a look at the high-level architecture of this simple data platform pipeline:

We can divide our architecture into several layers:

1. Extraction Layer

In this layer, we use a change data capture mechanism to capture all the changes incrementally from the source databases to the destination databases. To do this, we can use Data Transmission Service (DTS) and connect the services to our source dtabase. This service will read the DB log and perform CDC based migration.

And other than DBs, like Sheet or API, we can use batch ingestion to get the latest data with Data Integration, which is part of DataWorks.

To store the ingestion results, we use AnalyticDB for PostgreSQL act as a data lake, which a cloud-native real-time data warehouse services, supporting high-performance low-latency ad hoc analysis of structured, unstructrured, and semi-structured data.

2. Transformation Layer

Since our aim is to minimize the need of complex data transformation pipeline, we can build our data warehouse or even data mart based on the Real-time Materialized View (RMV) feature in AnalyticDB for PostgreSQL. We can just create the RMV by using SQL query as the data transformer.

By using this, we don't need to think about the transformation scheduling. Whenever there's an update from the source, the RMV will refresh the table with the latest update, and the data is ready to be utilized by the visualization tools or anything else.

3. Reporting Layer

Like stated in the previous paragraph, we can use RMV as the reporting layer or as a layer that's connected directly to our business users, where they will see or create a dashboard/report based on the flatten data built with specific business needs on the RMV

4. Monitoring Layer

To complete the data platform that we built, we can use Alibaba Cloud's CloudMonitor services to oversee the whole process from the upstream until the downstream of the data. For example, we can monitor if there's a delay or latency more than 20 minutes in the CDC data sync with Data Transmission Service (DTS).

Summary

As demonstrated above, to build a simple data pipeline platform is very feasible with Alibaba Cloud Big Data stacks. Sometimes, our customers are non-technical users, they only prioritize their data, and want a data platform as a single source of truth for their company. With Alibaba Cloud, we can deliver a straightforward solution. Our goal is to provide a unified data platform serving as a single source of truth, enabling your organization to leverage data effectively.

Community

Simplified End-to-End Data Platform

1. Extraction Layer

2. Transformation Layer

3. Reporting Layer

4. Monitoring Layer

Summary

Read previous post:

Read next post:

Alibaba Cloud Indonesia

You may also like

Comments

Alibaba Cloud Indonesia

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Data Transmission Service

Hologres