1. 准备测试用表

  ClickHouse 底层提供了基于 RBO(Rule Based Optimization)的 SQL 优化实现,会对一些低效的查询语句自动进行优化,这些优化的方式实际上也是我们写高效查询的一些指导。

  1. 上传官方数据集(具体参考https://clickhouse.com/docs/zh/getting-started/example-datasets/metrica/),将visits_v1.tarhits_v1.tar上传到虚拟机,解压到 ClickHouse 数据路径下:
    1
    2
    3
    4
    5
    6
    7
    // 解压到clickhouse数据路径 
    tar -xvf hits_v1.tar -C /var/lib/clickhouse
    tar -xvf visits_v1.tar -C /var/lib/clickhouse

    //修改所属用户
    chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets
    chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets
  2. 重启clickhouse-server
    1
    systemctl restart clickhouse-server
  3. 执行查询:
    1
    2
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"

  官方的 tar 包,包含了建库、建表语句、数据内容,这种方式不需要手动建库、建表。hits_v1表有 130 多个字段,880 多万条数据,visits_v1表有 180 多个字段,160 多万条数据。

2. COUNT优化

  在调用 count 时,如果使用的是count()或者count(*),且没有where条件时,会直接使用system.tablestotal_row
  例如:

1
2
3
4
5
6
7
EXPLAIN SELECT count() FROM datasets.hits_v1;

Union
Expression (Projection)
Expression (Before ORDER BY and SELECT)
MergingAggregated
ReadNothing (Optimized trivial count)

  注意Optimized trivial count,这是对 count 的优化。
  如果 count 具体的列字段,则不会使用此项优化:

1
2
3
4
5
6
7
8
EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;

Union
Expression (Projection)
Expression (Before ORDER BY and SELECT)
Aggregating
Expression (Before GROUP BY)
ReadFromStorage (Read from MergeTree)

3. 消除子查询重复字段

  下面语句子查询中有两个重复的 id 字段,会被去重:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
EXPLAIN SYNTAX SELECT 
a.UserID,
b.VisitID,
a.URL,
b.UserID
FROM
hits_v1 AS a
LEFT JOIN (
SELECT
UserID,
UserID as HaHa,
VisitID
FROM
visits_v1) AS b
USING (UserID)
limit 3;

// 返回优化语句:
SELECT
UserID,
VisitID,
URL,
b.UserID
FROM
hits_v1 AS a
ALL LEFT JOIN
(
SELECT
UserID,
VisitID
FROM
visits_v1
) AS b
USING (UserID)
LIMIT 3

4. 谓词下推

  当 group by 有 having 子句,但是没有 with cube、with rollup 或者 with totals 修饰的时候,having 过滤会下推到 where 提前过滤。谓词下推优化规则涉及到一个参数 enable_optimize_predicate_expression,默认是1,打开谓词下推,设置为 0 就会关闭谓词下推优化。例如下面的查询,HAVING name 变成了 WHERE name,在 group by 之前过滤:

1
2
3
4
5
6
7
8
9
10
11
EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID = '8585742290196126178';

// 返回优化语句
SELECT
UserID
FROM
hits_v1
WHERE
UserID = '8585742290196126178'
GROUP BY
UserID

  子查询也支持谓词下推:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
EXPLAIN SYNTAX
SELECT
*
FROM
(
SELECT
UserID
FROM
visits_v1
)
WHERE
UserID = '8585742290196126178'

// 返回优化后的语句
SELECT
UserID
FROM
(
SELECT
UserID
FROM
visits_v1
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'

  再来一个复杂例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
EXPLAIN SYNTAX 
SELECT * FROM (
SELECT
*
FROM
(
SELECT
UserID
FROM visits_v1)
UNION ALL
SELECT
*
FROM
(
SELECT
UserID
FROM visits_v1)
)
WHERE
UserID = '8585742290196126178'

// 返回优化后的语句
SELECT UserID
FROM
(
SELECT UserID
FROM
(
SELECT
UserID
FROM
visits_v1
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'
UNION ALL
SELECT
UserID
FROM
(
SELECT
UserID
FROM
visits_v1
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'

5. 聚合计算外推

  聚合函数内的计算,会外推,例如:

1
2
3
4
5
6
7
8
9
10
11
EXPLAIN SYNTAX
SELECT
sum(UserID * 2)
FROM
visits_v1

// 返回优化后的语句
SELECT
sum(UserID) * 2
FROM
visits_v1

6. 聚合函数消除

  如果对聚合键,也就是 group by key 使用 min、max、any 聚合函数,则将函数消除,例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
EXPLAIN SYNTAX
SELECT
sum(UserID * 2),
max(VisitID),
max(UserID)
FROM
visits_v1
GROUP BY
UserID

// 返回优化后的语句
SELECT
sum(UserID) * 2,
max(VisitID),
UserID
FROM
visits_v1
GROUP BY
UserID

7. 删除重复的order by key

  例如下面的语句,重复的聚合键 id 字段会被去重:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
EXPLAIN SYNTAX
SELECT
*
FROM
visits_v1
ORDER BY
UserID ASC,
UserID ASC,
VisitID ASC,
VisitID ASC

// 返回优化后的语句
select
···
FROM
visits_v1
ORDER BY
UserID ASC,
VisitID ASC

8. 删除重复的limit by key

  例如下面的语句,重复声明的 name 字段会被去重:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
EXPLAIN SYNTAX
SELECT
*
FROM
visits_v1
LIMIT 3 BY
VisitID,
VisitID
LIMIT 10

// 返回优化后的语句
select
···
FROM
visits_v1
LIMIT 3 BY
VisitID
LIMIT 10

9. 删除重复的 USING Key

  例如下面的语句,重复的关联键 id 字段会被去重:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
EXPLAIN SYNTAX
SELECT
a.UserID,
a.UserID,
b.VisitID,
a.URL,
b.UserID
FROM
hits_v1 AS a
LEFT JOIN
visits_v1 AS b
USING (UserID, UserID)

// 返回优化后的语句
SELECT
UserID,
UserID,
VisitID,
URL,
b.UserID
FROM
hits_v1 AS a
ALL LEFT JOIN
visits_v1 AS b
USING (UserID)

10. 标量替换

  如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的 total_disk_usage 字段:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
EXPLAIN SYNTAX
WITH
(
SELECT
sum(bytes)
FROM
system.parts
WHERE
active
) AS total_disk_usage
SELECT
(sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
table
FROM
system.parts
GROUP BY
table
ORDER BY
table_disk_usage DESC
LIMIT 10;

// 返回优化后的语句
WITH CAST(0, \'UInt64\') AS total_disk_usage
SELECT
(sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
table
FROM
system.parts
GROUP BY
table
ORDER BY
table_disk_usage DESC
LIMIT 10

11. 三元运算优化

  如果开启了 optimize_if_chain_to_multiif 参数,三元运算符会被替换成 multiIf 函数,例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
EXPLAIN SYNTAX 
SELECT
number = 1 ? 'hello' : (number = 2 ? 'world' : 'clickhouse')
FROM
numbers(10)
settings optimize_if_chain_to_multiif = 1;

// 返回优化后的语句
SELECT
multiIf(number = 1, 'hello', number = 2, 'world', 'clickhouse')
FROM
numbers(10)
SETTINGS optimize_if_chain_to_multiif = 1;

参考文献

  【1】https://clickhouse.com/docs/zh/
  【2】https://www.bilibili.com/video/BV1Yh411z7os?from=search&seid=4579023877699743987&spm_id_from=333.337.0.0
  【3】https://clickhouse.com/docs/zh/