Learning about Distributed Systems – Part 25: Kylin in a New Way

Space for Time

In the previous articles, we mentioned because MR was too slow, Spark emerged and significantly improved its performance. However, Spark is not fast enough. The MPP architecture developed from traditional relational databases meets our high-performance query requirements. After combining with HDFS and proposing concepts (like virtual segment), it also solves the extensibility problem to a certain extent.

However, neither Batch nor MPP may be fast enough in some complex scenarios. For example, the query conditions are complex, there are many join tables, and the computing is particularly large.

Multi-Dimensional OLAP (MOLAP) is such a scene. As a type of OLAP, MOLAP usually involves many complex dimensions. Different dimensions may be arbitrarily combined, resulting in computing and data explosion.

For this scenario, it can also be optimized in the MPP architecture, but it will be difficult and costly. On the other hand, Apache Kylin offers a different solution.

The idea of Apache Kylin is not groundbreaking. To put it simply, it is space for time. Many fields are solving problems using this idea (such as HashTable), but it is unique to apply this idea to the MOLAP field.

This means all statistics are computed in advance and put on the online storage engine. After the query request arrives, the corresponding result is directly queried without performing complex computing on-site.

The overall architecture is not complex:

Let's simplify it. It is mainly divided into three parts:

Query Server is the core service of Kylin, including SQL parsing and optimization, metadata management, cube construction, and statistical results acquisition.
Build Cluster is an independent computing cluster and can reuse existing MapReduce and Spark clusters. Source data is also usually stored on Hive clusters.
Storage Cluster stores the computed statistics to HBase by default.

Usually, it may be possible to make build tasks run in cycles, so the peripherals also need a task scheduling system (such as Oozie, Azkaban, and Airflow), but it is not reflected in this architecture diagram.

Trade Time with Less Space

The core concept in Kylin or MOLAP, as mentioned above, is the Cube. As the name implies, a cube is a three-dimensional structure. If there are more dimensions, it will become a multi-dimensional cube.

Each dimension of a cube corresponds to a dimension in OLAP. Different dimensions are given different values, and there are different combinations. Each combination is called a cuboid, and all cuboids together are a cube.

MOLAP is computationally intensive because there are too many Cuboid combinations. For a 20-dimension cube, the number of cuboids is 2 ^ 20, which is a very large number. Considering the cardinality of each dimension, computing is intimidating.

Even though Kylin uses precomputation to reduce the latency of queries, precomputation does not reduce computing.

Dimensional pruning is the most intuitive and effective way to reduce computing and is the point where Kylin should focus on optimization.

Pruning can only be done from a business perspective. Otherwise, it will cause accidental injury. Kylin abstracts the following common methods from past business experience to reduce the number of cuboids:

Aggregate Group: You can select a set of dimensions that have certain business rules between them, making it possible to specify specific rules to reduce the number of cuboids.
Derived: You can set the non-primary key fields of the dimension table as derived dimensions, so these fields will not be added to the cuboid but will be replaced by the foreign key of the fact table or the primary key of the dimension table to reduce the number of cuboids.

Aggregation Group supports the following rules:

Mandatory: If you specify that the current dimension is mandatory, all cuboids that do not contain this dimension do not need to be computed.
Hierarchy: If dimensions A, B, and C have inheritance relationships, only the three cuboids A, AB, and ABC are retained, and other cuboids are discarded, such as provinces, cities, and districts.
Joint: If A and B are joint, they either appear at the same time or neither appears, and all cuboids with only one of them are discarded.

The derived dimension will be easier to understand through the following example:

As shown in the preceding tables, when no processing is performed, the dimensions have the following combinations:

XAB, XA, XB, AB, X, A, B (Dim prefix omitted)

But both A and B can be determined by X. If we set A to derived, the dimension combination becomes:

XB, X, B

The number of combinations is reduced from 6 to 3. Storage and computing overhead are also correspondingly reduced.

However, when the query is executed, the derived dimension A is still supported, so conversion is required. First, find out all X, replace them with A according to the mapping relationship of the dimension table, and aggregate A.

Since precomputation does not contain A, this conversion and aggregation operation is done on-site during the query, which will have some impact on the response time. However, compared with the saving of precomputation resources, it is usually acceptable (Strictly speaking, you need to decide whether to set it according to the business scenario).

In addition to cuboid pruning, Kylin provides other methods to reduce computing. If some scenarios do not need precise deduplication, you can use the count distinct based on HyperLogLog to do fuzzy deduplication. If precise deduplication is required, you can use the count distinct based on the bitmap.

The storage of the computing results is another point that needs optimization.

Kylin stores the results in HBase by default. Considering the data structure and querying methods, HBase is indeed a good choice.

However, Kylin defines the data of each partition as a segment, and each segment corresponds to an HBase table. This puts a lot of pressure on HBase.

The number of segments may increase rapidly, resulting in a rapid increase in the number of HBase tables and a sharp increase in metadata management pressure. These factors may pose great burdens on the stability and performance of the cluster.

There are two ideas to solve this problem.

One idea is to merge segments to reduce the number of tables.

After merging, the problem can be alleviated, but once you need to repaint some historical data, you can only repaint the entire segment. It is possible to find that a day's data is abnormal, but it is necessary to repaint the data for a whole year.

This is another scenario that requires a trade-off. You can consider adjusting through a time window by lagging for some time to avoid large-scale recalculations as much as possible.

Another more thorough idea is to replace HBase.

The community version considered this idea early on, and some companies in the industry have their practices (such as Kylin on Druid).

However, the commercial version of Kylin, Kylingence Enterprise, finally adopted the solution of Spark residence session + Parquet. The open-source version is also transforming and following up and is bound to become the mainstream solution.

In the past, it was generally believed that databases are suitable for storing result data, but with the support of extensive optimizations of Spark and Parquet, you can get enough performance, at least for the MOLAP scenario. This is not surprising. The database is also based on a custom file format, combined with a large number of optimizations to achieve high performance.

After solving the two major problems of dimensional pruning and storage engine replacement, Kylin has earned a place in the OLAP field by using precomputation.

I think this is also the most important thing when we are designing our architecture. In many cases, innovation does not need to be groundbreaking. A little change in idea may bring unexpected gains.

In the last ten articles, from MR to Spark, from MPP to Kylin, we have focused on the topic of batch processing and solved one problem after another. However, it does not mean the framework mentioned later is better than the previous ones and can replace them.

Most of the time, there are no silver bullets, even if many frameworks have the ambition to solve all the problems. More often than not, we all need to choose the most suitable framework in a specific application scenario, while in other scenarios, other frameworks may be more suitable. It is true. In large-scale companies, these frameworks often coexist.

This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!

Community

Learning about Distributed Systems – Part 25: Kylin in a New Way

Space for Time

Trade Time with Less Space

Read previous post:

Read next post:

Alibaba Cloud_Academy

You may also like

Comments

Alibaba Cloud_Academy

Related Products

Storage Capacity Unit

Hybrid Cloud Storage

Hybrid Cloud Distributed Storage

Data Lake Storage Solution