By Zhenyu
MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.
Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).
In the previous article, I introduced basic data types and built-in functions. In this article, I will introduce the complex type functions.
The original ODPS supports two complex types, ARRAY and MAP. However, these types may still be insufficient in certain scenarios.
• Scenario 1
In my project, I needed to generate an intermediate table where it would be optimal to have an array in one of the columns. Storing each element of the array in separate rows would result in excessive data volume due to the repetition of other columns. After extensively searching through documentation, I found that the only way to achieve this was by converting the source data column to STRING, using wm_concat to aggregate the values, and then using the split function to break it into an ARRAY<STRING>
. Although the original type information was lost, STRING seemed to work. Okay, let's proceed. In a subsequent operation, I needed to access the last element of the array. I attempted to use the array subscript with the size function, my_array[size(my_array)]
, but encountered an error stating that the subscript must be constant. However, my array was not of a fixed length. I wondered if there was a function to reverse the array, but there wasn't. Ultimately, I had to give up using arrays...
• Scenario 2
My task was to generate a curve for each advertisement, representing the expected number of impressions and clicks as the advertiser's bid increased. The most natural approach would involve a data structure with bids, impression times, and click times. However, ODPS does not support this. Consequently, I had to encode the information into a string, perform the necessary operations, and then decode it. This process was tiresome and inefficient, but there was no alternative...
MaxCompute uses an SQL engine based on ODPS V2.0, which greatly improves complex types and provides supporting built-in functions to generally solve the above problems.
MaxCompute Studio supports complex data types from V2.8.0.2. If you are using an older version, we recommend upgrading to the latest version.
First, please install MaxCompute Studio, connect to a MaxCompute project, and create a new MaxCompute script file, as follows.
After running, you can find the newly created table in the Project Explorer of MaxCompute Studio. View the details of the table and preview the data, as shown in the following figure.
You can see that MaxCompute supports the ARRAY, MAP, and STRUCT types. They can be nested.
MaxCompute Studio allows you to import and export data in tables. For more information, see this document.
• The following table describes the complex types supported by MaxCompute.
Type | Definition example | Construction example | Access example |
ARRAY | array<int>, array<struct<a:int, b:string>> |
array(1, 2, 3), array(array(1, 2), array(3, 4)) |
a[1], a[x][y] |
MAP | map<string, string>, map<tinyint, array<string>> |
map("k1", "v1", "k2", "v2"), map(1Y, array('a', 'b'), 2Y, array('x', 'y')) |
m['k1'], m[2Y][id] |
STRUCT | struct<x:int, y:int>, struct<a:array<int>, b:map<int, int>> |
named_struct('x', 1, 'y', 2), named_struct('a', array(1, 2), 'b', map(1, 7, 2, 8) |
s.x, s.b[1] |
Return type | Signature | Annotations |
MAP<K, V> |
map(K key1, V value1, K key2, V value2, ...) |
Use a given key/value pair to create a map. All keys are of the same type and must be of the basic type. All values are of the same type and can be of any type. |
ARRAY<K> |
map_keys(Map<K, V> m) |
Return all the keys of the map in the parameter as an array. If you enter NULL, NULL is returned. |
ARRAY<V> |
map_values(MAP<K, V> m) |
Return all the values of the map in the parameter as an array. If you enter NULL, NULL is returned. |
int |
size(MAP<K, V>) |
Obtain the number of given MAP elements. |
TABLE<K, V> |
explode(MAP<K, V>) |
The table-valued function expands the given map, with one row for each key/value and two columns for each row corresponding to the key and value respectively. |
ARRAY<T> |
array(T value1, T value2, ...) |
Use the given value to construct an ARRAY. All values are of the same type. |
int |
size(ARRAY<T>) |
Obtain the number of given ARRAY elements. |
boolean |
array_contains(ARRAY<T> a, value v) |
Check whether a given ARRAY a contains v. |
ARRAY<T> |
sort_array(ARRAY<T>) |
Sort the given array. |
ARRAY<T> |
collect_list(T col) |
An aggregation function that aggregates the expression specified by col into an array within a given group. |
ARRAY<T> |
collect_set(T col) |
An aggregation function that aggregates the expression specified by col into a set array without duplicate elements within a given group. |
TABLE<T> |
explode(ARRAY<T>) |
A Table-valued function that expands the given ARRAY, with one row for each value. One column for each row corresponds to the relevant array elements. |
TABLE (int, T) |
posexplode(ARRAY<T>) |
A table-valued function that expands the given ARRAY, with one row for each value. Two columns for each row correspond to the subscript and array elements of the array starting from 0. |
STRUCT<col1:T1, col2:T2, ...> |
struct(T1 value1, T2 value2, ...) |
Create a struct by using a given value list. Each value can be of any type. The names of the fields that generate the struct are col1, col2, ... |
STRUCT<name1:T1, name2:T2, ...> |
named_struct(name1, value1, name2, value2, ...) |
Use the given name/value list to create a struct. Each value can be of any type. The names of the fields that generate the struct are name1, name2, ... |
TABLE (f1 T1, f2 T2, ...) |
inline(ARRAY<STRUCT<f1:T1, f2:T2, ...>>) |
A table-valued function that expands a given struct array. Each element corresponds to one row, and each struct element in each row corresponds to one column. |
The original ODPS does not support accessing any complex types in UDFs. MaxCompute Java UDFs support all complex types. Python UDFs will also support them in the near future.
The UDFs of ODPS are classified into three types: UDFs, UDAFs, and UDTFs.
@Resolve("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint, b2:binary>")
java.util.List
, map corresponds to java.util.Map
, and struct corresponds to com.aliyun.odps.data.Struct
. Note that com.aliyun.odps.data.Struct cannot see the field name and field type from reflection. Therefore, you need to use @Resolve
as an aid. That is, if you need to use a struct in a UDF, you must also mark the @Resolve
annotation on the UDF class. This annotation only affects the overloads that contain com.aliyun.odps.data.Struct
in the parameters or return values. Currently, only one @Resolve
annotation can be provided in a class. So, there can only be one overload with a struct parameter or return value in a UDF. This is a restriction now. We are improving it and will remove it later.For example, the following code defines a UDF with three overloads. The first parameter uses an array as the parameter, the second parameter uses the map as the parameter, and the third parameter uses struct as the parameter. The third overloads use struct as the parameter or return value. So, you must add a @Resolve
annotation to the UDF class to specify the specific type of the struct.
@Resolve("struct<a:bigint>,string->string")
public class UdfArray extends UDF {
public String evaluate(List<String> vals, Long len) {
return vals.get(len.intValue());
}
public String evaluate(Map<String,String> map, String key) {
return map.get(key);
}
public String evaluate(Struct struct, String key) {
return struct.getFieldValue("a") + key;
}
}
You can directly pass complex types into a UDF:
create function my_index as 'UdfArray' using 'myjar.jar';
select id, my_index(array('red', 'yellow', 'green'), colorOrdinal) as color_name from colors;
MaxCompute expands its support for complex data types, enabling better adaptation to various application scenarios. MaxCompute will continue to enhance the type system in terms of compatibility and expression capabilities. Starting from the next article in this series, I will introduce other improvements of MaxCompute in SQL.
MaxCompute Unleashed - Part 2: Basic Data Types and Built-in Functions
137 posts | 20 followers
FollowAlibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - February 7, 2024
Alibaba Cloud MaxCompute - January 29, 2024
137 posts | 20 followers
FollowConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreMore Posts by Alibaba Cloud MaxCompute