SAMPLE函数命令介绍 - 云原生大数据计算服务 MaxCompute

SAMPLE是数据采样函数，基于所有读入的column_name的值，按照x、y的设置对数据进行采样，并只保留满足条件的行。

命令格式

boolean sample(<x>, <y>, [<column_name1>, <column_name2>[,...]])

参数说明

x、y：x必填。BIGINT类型，取值范围为大于0的整型常量。表示哈希为x份，取第y份。
y可选，省略时默认取第一份。如果省略参数中的y，则必须同时省略column_name。
x、y为其它类型或小于等于0时抛异常，如果y大于x时也返回异常。x、y任一输入为NULL时，返回NULL。
column_name：可选。采样的目标列。该参数省略时将根据x、y的值随机采样。任意类型，列的值可以为NULL。不做隐式类型转换。如果column_name为常量NULL，则返回报错。
说明
- 为避免NULL值带来的数据倾斜，对于column_name中为NULL的值，会在x份中进行均匀哈希。如果不指定column_name，则可能出现数据不均匀的情况，建议指定column_name，以获得较好的输出结果。
- 目前仅支持对如下数据类型的列做随机采样：bigint、datetime、boolean、double、string、binary、char、varchar。

返回值说明

返回BOOLEAN类型。

使用示例

例如存在表mf_sample，表数据如下：

+------------+------+------+------+------+
| id         | col1 | col2 | col3 | col4 |
+------------+------+------+------+------+
| 3          | eee  | rrr  | tttt | ggggg |
| 4          | yyy  | uuuu | iiiii | ccccccc |
| 1          | "abc" | "bcd" | "rthg" | "ahgjeog" |
| 2          | "a1bc" | "bc1d" | "rt1hg" | "ahgjeog" |
+------------+------+------+------+------+

对表数据做随机哈希分配为2份，取第1份：

select * from mf_sample where sample(2,1);

返回结果如下：

+------------+------+------+------+------+
| id         | col1 | col2 | col3 | col4 |
+------------+------+------+------+------+
| 3          | eee  | rrr  | tttt | ggggg |
| 1          | "abc" | "bcd" | "rthg" | "ahgjeog" |
+------------+------+------+------+------+

根据id列对表数据做随机哈希分配为2份，取第1份：

select * from mf_sample where sample(2,1,id);

返回结果如下：

+------------+------+------+------+------+
| id         | col1 | col2 | col3 | col4 |
+------------+------+------+------+------+
| 4          | yyy  | uuuu | iiiii | ccccccc |
| 2          | "a1bc" | "bc1d" | "rt1hg" | "ahgjeog" |
+------------+------+------+------+------+

云原生大数据计算服务 MaxCompute：SAMPLE

命令格式

参数说明

返回值说明

使用示例

相关函数