×
Community Blog Seven Common Errors in ClickHouse Materialized View

Seven Common Errors in ClickHouse Materialized View

This article discusses the principles and best practices of Materialized View, as well as common problems and their solutions that may arise during use.

By Simian and Zhouyou

1. Introduction

In OLAP business scenarios, data needs to be stored and processed efficiently. To improve data processing performance in ClickHouse, Materialized View is an effective approach. This article delves into the working principles and best practices of Materialized View (hereinafter referred to as MV) and addresses common issues that may arise during its usage.

2. Introduction to Materialized View

Materialized View can be seen as a special type of trigger. When data is inserted, MV stores the result of SELECT query performed on the data into a target table.

It is designed to accelerate queries by preprocessing data during insertion.

CREATE MATERIALIZED VIEW mv TO dest
AS SELECT a, b, count() AS cnt
FROM source
GROUP BY a, b

As shown above, source is the source table of MV and dest is the target table of MV.

When using MV, the following three principles need to be kept in mind:

  1. Only the source table Insert triggers updates to MV. MV does not detect other operations on the source table, such as truncate, alter delete, alter update, drop partition, drop table, and rename. Note two special scenarios: data replication of ReplicatedMergeTree between replicas does not trigger MV updates, and the distributed table forwarding data to its local table triggers MV updates.
  2. MV does not query the source table but the inserted data block. One exception is that if populate is specified when creating MV, it will perform a query on the source table and initialize the results into MV.
  3. MV supports any engine source table, including engines that are not able to store data, such as Null and Kafka.

3. Typical Usage Scenarios

3.1 Used to Pre-aggregate Data

Here, the target table engines for MV are mainly SummingMergeTree or AggregatingMergeTree. MV stores the aggregation result of the data in the source table. The Insert performed on the source table triggers the updates to the MV aggregation data. The amount of data in MV may be significantly reduced compared to that in the source table. Querying MV can directly acquire the aggregation results without scanning a large amount of data in the source table each time.

For example, the following example uses SummingMergeTree to aggregate the hits value for each date/project.

-- Create the target table of SummingMergeTree.
CREATE TABLE wikistat_top_projects
(
    `date` Date,
    `project` LowCardinality(String),
    `hits` UInt32
) ENGINE = SummingMergeTree
ORDER BY (date, project);

-- Create an MV.
CREATE MATERIALIZED VIEW wikistat_top_projects_mv TO wikistat_top_projects AS
SELECT
    date(time) AS date,
    project,
    sum(hits) AS hits
FROM wikistat
GROUP BY date, project;

3.2 Data redundancy to support data queries in different dimensions

The query efficiency of ClickHouse MergeTree is greatly affected by the design of the primary key and the order by key of the table. We can create a replication table through MV. The two tables have different primary keys and order by keys to meet different query requirements.

3.3 Data extraction and transformation

For example, there is a Null table where data from multiple data sources is inserted. The Null table has multiple MVs. They perform different data transformations and write the results to different destinations.

3.4 Used with Kafka or RabbitMQ

Kafka and RabbitMQ do not have data persistence capabilities. They can be used together with MV for data persistence.

4. FAQ

The logic of MV can be summarized in just one simple sentence: when the data is inserted into the source table, MV replicates the data to the target table.

ClickHouse provides almost no restrictions on usage. Users can flexibly use MV according to their needs and can match it with other capabilities, such as MV+Distributed Table and MV+ReplicatedMergeTree. In practice, we often encounter the following situations: MV has been constructed, and the data has indeed been inserted into the source table. However, when we query within MV, the results fail to match expectations. Data is lost or duplicated.

4.1 Data aggregation should use SummingMergeTree or AggregatingMergeTree.

Consider the need to use MV to aggregate data in the source table and the customer mistakenly uses MergeTree as the target table of MV.

Error Example: Use common MergeTree for aggregation

CREATE TABLE source (a Int64, b Int64)
ENGINE = MergeTree PARTITION BY (a) ORDER BY (a,b);

-- The target table incorrectly uses MergeTree.
CREATE TABLE dest (a Int64, b Int64, cnt Int64)
ENGINE = MergeTree PARTITION BY (a) ORDER BY (a,b);

-- Create an MV based on the dest table.
CREATE MATERIALIZED VIEW mv to dest
AS SELECT a, b, count() as cnt
FROM source GROUP BY a, b;

-- Insert data into the source table.
insert into source(a,b) values(1, 1), (1,1), (1,2);
insert into source(a,b) values(1, 1), (1,1), (1,2);

-- The actual MV query result does not meet expectations.
SELECT a,b,count() FROM dest GROUP BY a, b;
┌─a─┬─b─┬─count()─┐
│ 1 │ 2 │       2 │
│ 1 │ 1 │       2 │
└───┴───┴─────────┘

-- Expected results
SELECT a, b, count() as cnt FROM source GROUP BY a, b
┌─a─┬─b─┬─sum─┐
│ 1 │ 2 │   2 │
│ 1 │ 1 │   4 │
└───┴───┴─────┘

Correct Example 1: Use SummingMergeTree

Since MV involves aggregation operations, SummingMergeTree or AggregatingMergeTree should be used.

First, if the aggregation operation is sum, you can use SummingMergeTree.

CREATE TABLE source (a Int64, b Int64)
ENGINE = MergeTree PARTITION BY (a) ORDER BY (a,b);

-- Create the target table of SummingMergeTree.
-- Note that ORDER BY(a,b ) specifies a,b as the aggregation key of SummingMergeTree.
CREATE TABLE dest_2 (a Int64, b Int64, sum UInt64)
ENGINE = SummingMergeTree ORDER BY (a, b);

-- Create an MV.
CREATE MATERIALIZED VIEW mv_2 to dest_2
AS SELECT a, b, sum(a) as sum
FROM source GROUP BY a, b;

insert into source(a,b) values(1, 1), (1,1), (1,2);
insert into source(a,b) values(1, 1), (1,1), (1,2);

-- Why is the data not aggregated according to a,b?
-- The reason is that SummingMergeTree is aggregated asynchronously during merge in the backend.
select * from mv_2;
┌─a─┬─b─┬─sum─┐
│ 1 │ 1 │   2 │
│ 1 │ 2 │   1 │
│ 1 │ 1 │   2 │
│ 1 │ 2 │   1 │
└───┴───┴─────┘

-- Therefore, group by needs to be added when querying SummingMergeTree.
select a,b,sum(a) as sum from mv_2 group by a;
┌─a─┬─b─┬───sum──┐
│ 1 │ 1 │      2 │
│ 1 │ 2 │      4 │
└───┴───┴────────┘

-- Alternatively, you can manually perform optimize table to trigger the SummingMergeTree aggregation.
optimize table dest_2 final;

-- The group by is not required after optimize.
select * from mv_2;
┌─a─┬─b─┬─sum─┐
│ 1 │ 1 │   4 │
│ 1 │ 2 │   2 │
└───┴───┴─────┘

Correct Example 2: Use AggregatingMergeTree

If there are other aggregation operations, select AggregatingMergeTree. The following describes the implementation of AggregatingMergeTree.

CREATE TABLE source (a Int64, b Int64)
ENGINE = MergeTree PARTITION BY (a) ORDER BY (a,b);

-- Create the target table of AggregatingMergeTree.
CREATE TABLE dest_3(
  a Int64, b Int64, 
  cnt AggregateFunction(count, Int64))
ENGINE = AggregatingMergeTree order by (a, b);

-- Create an MV.
CREATE MATERIALIZED VIEW mv_3 TO dest_3 AS
              SELECT a, b, countState(a) AS cnt
              FROM source GROUP BY a, b

insert into source(a,b) values(1, 1), (1,1), (1,2);
insert into source(a,b) values(1, 1), (1,1), (1,2);

-- Use Merge to query data.
select a,b,countMerge(cnt) from mv_3 group by a,b;
┌─a─┬─b─┬─countMerge(cnt)─┐
│ 1 │ 2 │               2 │
│ 1 │ 1 │               4 │
└───┴───┴─────────────────┘

4.2 The aggregation engine needs to ensure that ORDER BY is consistent with GROUP BY in SQL.

When you use SummingMergeTree, make sure that its ORDER BY is consistent with the GROUP BY fields.

1

4.3 Keep in mind that MV replication only queries Insert Block, not the original table.

As shown in the MV definition below, each insert only performs SELECT on the inserted data, rather than querying the entire source table.

2

4.4 The field name of MV must be the same as the field of the query result.

3

4.5 MV is combined with ReplicationMergeTree.

Note:

  1. ReplicatedMergeTree data replication between replicas does not trigger MV updates.
  2. Data replication of MV from the source table to the target table will be conducted to ReplicatedMergeTree for replication between replicas.

4

If you need to keep the data consistency of MV between replicas, define the target table of MV as ReplicatedMergeTree.

The data replication process is as follows:

  1. Perform Insert in Node1 and write data to Table of the replica.
  2. MV in the replica replicates simultaneously during Insert.
  3. ReplicatedMergeTree replicates data to Table in Node2.
  4. The MV target table of Node1 is ReplicatedMergeTree, which triggers data replication to the MV target table of Node2.

5

Data Duplication Issues

There is one scenario in MV and ReplicatedMergeTree that causes data duplication.

If you perform Populate on ReplicatedMergeTree, data duplication is triggered.

6

For details on this issue, please refer to: https://github.com/ClickHouse/ClickHouse/issues/9081

4.6 MV and JOIN

Note that only INSERT of the driving table triggers MV updates, but not that of the driven table.

If you want the MV to be updated when both the driving table and the driven table are inserted, you need to create two MVs. The example is as follows:

CREATE MATERIALIZED VIEW mv1 TO dest
SELECT ...
FROM source left join some_dimension on (...)

CREATE MATERIALIZED VIEW mv2 TO dest
SELECT ...
FROM some_dimension right join source on (...)

4.7 MV and Distributed Table

It can be divided into four cases to discuss:

Solution MV Source MV Target INSERT INTO Data in Target Data Replication Process, Distributed(D), Local(L)
Local Local Distributed YES 1. Write data to Table L and insert data into Table D asynchronously.
2. Trigger MV updates of Table L.
Local YES 1. Write data to Table L and trigger MV updates.
Distributed Distributed Distributed YES 1. Write data to Table D and MV of Table D is updated.
2. Write data to Table L.
Local NO 1. Write data to Table L and MV updates of Table D are not triggered.
Local Distributed Distributed YES 1. Write data to Table D and asynchronously insert Table D into Table L.
2. Trigger MV updates of Table L.
Local YES 1. Write data to Table L and trigger MV updates.
Distributed Local Distributed YES 1. Write data to Table D and asynchronously insert Table D into Table L.
2. Trigger MV updates of Table L.
Local NO 1. Write data to Table L and MV updates of Table D are not triggered.

Solution ① is more common and is used in single-shard scenarios. This solution is the simplest and most efficient. It does not cause data asynchrony.

Solutions ② and ③ are used in multi-shard scenarios. Data is distributed to different shards based on the sharding key rule.

In solution ②, if data is written to Local, data in MV is not replicated.

Appendix

[1] https://clickhouse.com/blog/using-materialized-views-in-clickhouse
[2] https://developer.aliyun.com/article/1327456
[3] https://dencrane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
[4] https://github.com/ClickHouse/ClickHouse/issues/9081
[5] https://github.com/ClickHouse/ClickHouse/issues/8336


Discover what and how you can use our products to build >>

lQLPKILhF9YTktfNBKzNBZiwsJleyQwofNYGiQokzCpPAA_1432_1196

0 1 0
Share on

ApsaraDB

423 posts | 90 followers

You may also like

Comments