DataWorks allows you to develop and implement data governance in a visualized manner. This topic describes how to build a big data warehouse on the cloud and a real-time data warehouse.
Big data warehousing solution
We recommend that you build a big data warehouse on the cloud based on the following architecture:
- Customers: This solution is suitable for customers from all industries.
- Benefits: The big data warehouse solution is the best practice of Alibaba Cloud and features high performance, low cost, and a serverless architecture. You can use this solution to build O&M-free and fully managed big data warehouses. This way, big data developers of enterprises can focus on the development, production, and governance of business data.
- Product portfolio :MaxCompute, Realtime Compute for Apache Flink, and DataWorks.
- Use scenarios:
- User data comes from various sources, such as the cloud and external data sources. Data from different sources is integrated into a data warehouse in a unified manner for data cleansing and data modeling.
- The application scenarios are complex. You can use a big data warehouse to perform speech recognition, semantic analysis, and sentiment analysis for unstructured speech and natural language text. You can also build an enterprise-class data management platform to process structured data. This helps reduce computing and storage costs.
- A data warehouse that supports abundant applications is required. You can use machine learning algorithms for complex data analysis, BI reports for chart display, products for data display on the dashboard, and other custom methods for data consumption.
Real-time data warehousing solution
We recommend that you build a real-time data warehouse based on the following architecture:
- Customers: This solution is suitable for scenarios in which large amounts of data need to be queried in real time in Internet industries, such as e-commerce, gaming, and social networking.
- Benefits:
- Alibaba Cloud real-time data warehouses can be seamlessly integrated with offline data warehouses.
- A single cost-effective storage system can satisfy the requirements for real-time and offline computing.
- Product portfolio: DataHub, Realtime Compute for Apache Flink, Hologres, MaxCompute, DataWorks, and Quick BI or DataV.
- Use scenarios:
- Data collection: You can use DataWorks to collect offline data and DataHub to collect real-time data.
- Data development: You can use DataWorks to complete end-to-end data development. The data development process includes data integration, extract, transform, and load (ETL), data computing, and scheduling, monitoring, and alerting of nodes. DataWorks provides security control capabilities to eliminate security risks in the data development process. DataWorks also provides unified DataService Studio APIs based on the DataService Studio module.
- Real-time data processing: You can use Realtime Compute for Apache Flink to perform real-time ETL and import the results to databases. Then, you can use Hologres to build real-time data warehouses and application marts and perform real-time interactive query and analysis of large amounts of data.
- Interactive analysis: You can use a real-time data warehouse to perform real-time, offline, and federated queries. Historical offline data is stored in MaxCompute, and real-time analysis data is stored in Hologres. You can use Alibaba Cloud Quick BI or a third-party data analysis tool, such as Tableau, to visualize data and build data applications for various business units.