By Shen Hong and Zuoshang
As the most popular open-source database, MySQL is used by many enterprises in various business systems. In addition to online business logic reading and writing, MySQL databases require data analysis, such as BI reports, visual dashboards, and big data applications. However, MySQL databases are struggling to cope with data analysis needs due to the MySQL architecture. The industry offers many solutions to solve this problem. Data Lake Analytics (DLA) is a good choice for low-cost scenarios. This article first introduces the DLA solution for readers who might be unfamiliar with it. The DLA service is a serverless, interactive, and federated query service. You may use standard SQL statements to analyze and integrate data from data sources, such as Object Storage Service (OSS) devices, databases (such as PostgreSQL and MySQL), and NoSQL, such as Table store.
MySQL databases are widely used in various business systems. In addition to online business logic reading and writing, MySQL databases require data analysis, such as BI reports, visual dashboards, and big data applications. When the data volume of a single MySQL database reaches a certain level, using MySQL for data analysis leads to poor performance, which will affect the read and write performance of online businesses. In this case, we need a new data analysis solution.
MySQL data needs to be analyzed with log data. Some companies use open-source big data systems (such as Hive, Hadoop, and Spark) to build data warehouses. This solution works but involves extremely high human and resource costs. We must find a way to analyze MySQL data with data from other systems at a low cost.
When the data volume in a MySQL database exceeds the limit of a single database instance, database administrators (DBAs) split the data in a database into multiple tables in multiple databases. This makes data analysis very complicated. Therefore, a new analysis solution is required.
Which of the following factors must be considered to resolve the problems in the MySQL analysis scenarios? Which solution should we select? The key factors to keep in mind while answering these questions include:
1) Costs: The costs here involve the cost in time, labor, and risk. While selecting a solution, consider its cost-effectiveness.
2) Capabilities: Capabilities include functions and performance. In terms of functions, check whether the solution provides a complete set of analysis capabilities and scalability. In terms of performance, check whether the solution meets the requirements for timeliness and parallelism, especially in the case of massive data.
3) Maintainability: A good product is maintainable. This means you can use it in a simple way, and when a problem occurs, it is easy to troubleshoot.
4) Ease of Use: The product is easy to use. You can use data analysis services easily.
A variety of solutions are available for MySQL data analysis, including direct analysis on read-only MySQL instances, user-created open-source data warehouses, and DLA construction solutions. Let's take a closer look at the advantages and disadvantages of these solutions.
Purchase an additional server to build a read-only secondary MySQL instance, and then perform data analysis based on the read-only instance. The advantages and disadvantages of this solution are as follows:
Disadvantages
Advantages
Use open-source big data systems such as Hive, Hadoop, and Spark to build a data warehouse. Then, synchronize data from MySQL databases to the data warehouse and analyze the data based on Spark or Hive.
Disadvantages
Advantages
Use open-source or commercial AnalyticDB to synchronize data through data synchronization tools, and then perform SQL-based data analysis.
Disadvantages
Advantages
Alibaba Cloud DLA construction solution supports MySQL data analysis at a low cost.
Advantages
Let's take a closer look at the evaluation data and technical principles of the DLA construction solution.
As shown in the following figure, the monthly subscription fee for an additional high-performance ApsaraDB RDS for MySQL instance is RMB 2,344. We can use TPC-H with 10 GB of data as an example. If you run 22 SQL statements for TPC-H once a day, the monthly cost of using DLA is only RMB 26.64, an average daily cost of less than RMB 1. You only need to pay 1% of the cost for high-performance analysis. In addition, the columnar storage of DLA only uses 3 GB, while the native MySQL storage may need about 20 GB.
After the DLA construction solution synchronizes data from the source database, the data is stored in the compressed column-storage format. For example, 10 GB of TPC-H data consumes about 20 GB of storage space in MySQL, but only about 3 GB of storage space in the compressed column-storage format. DLA helps achieve efficient analytics at a very low cost. We can use TPC-H with 10 GB of data as an example. It takes 5.5s to run the 22 SQL statements for TPC-H in DLA, but 345.5s in MySQL with four SQL statements failing. The following figure compares the time consumed by 22 SQL statements for TPC-H in MySQL and DLA.
The DLA construction solution supports multiple data sources, including user-created MySQL, SQL Server, PostgreSQL, Oracle, ApsaraDB RDS for MySQL, Apsara PolarDB, and AnalyticDB data sources. Being different from traditional data warehouses, DLA is simple to use. Synchronize data to DLA through simple configurations and build data warehouses in one click.
The DLA construction solution supports the automatic synchronization of updated data and metadata operations, such as adding and deleting tables and adding, modifying, and deleting columns. In database sharding and table sharding scenarios, use the DLA construction solution to merge the logical tables distributed across multiple databases into one table so that you analyze data in one table. In addition, this solution allows synchronizing an unlimited number of tables.
The DLA team is developing an incremental DLA construction solution to support the incremental synchronization of data from source databases, which eliminates all impact on the source databases and significantly improve the timeliness of data analysis. The incremental construction solution will be available soon.
DLA-based queries have no impact on the source databases. When the DLA solution is used to synchronize data from source databases, the impact on the source databases is maintained at less than 10%. The following figure shows the CPU utilization for different specifications of source databases. As the specifications of a database increase, the number of connections automatically increases, but the average CPU utilization of the source database is kept below 10%.
We comprehensively optimized the DLA construction solution to minimize the impact of synchronization on the source database:
This minimizes the impact on the source databases. You may also manually increase the number of connections to speed up the synchronization.
Complete the following steps for using DLA to build a low-cost MySQL analytics solution. (For more information, visit the official documentation)
1) Log on to the DLA console. In the upper-left corner of the page, select the region where your DLA service is located.
2) In the left-side navigation pane, click Solutions. On the Solutions page, click Wizard in One-click Data Warehouse.
3) Configure the parameters as prompted.
4) Click Create. You can now use DLA.
Insights from Alibaba Cloud Experts: 3 Ways to Deal with High Concurrency
ApsaraDB RDS for PostgreSQL 12: Improved High Concurrency Performance for Large Partitioned Tables
Rupal_Click2Cloud - August 8, 2022
Alibaba Clouder - September 28, 2020
Alibaba Clouder - January 20, 2021
ApsaraDB - November 17, 2020
ApsaraDB - April 19, 2019
Alibaba Clouder - November 14, 2018
An end-to-end solution to efficiently build a secure data lake
Learn MoreA premium, serverless, and interactive analytics service
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreMore Posts by ApsaraDB