MaxCompute Unleashed - Part 3: Complex Type Functions

By Zhenyu

MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.

Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).

In the previous article, I introduced basic data types and built-in functions. In this article, I will introduce the complex type functions.

The original ODPS supports two complex types, ARRAY and MAP. However, these types may still be insufficient in certain scenarios.

• Scenario 1

In my project, I needed to generate an intermediate table where it would be optimal to have an array in one of the columns. Storing each element of the array in separate rows would result in excessive data volume due to the repetition of other columns. After extensively searching through documentation, I found that the only way to achieve this was by converting the source data column to STRING, using wm_concat to aggregate the values, and then using the split function to break it into an ARRAY<STRING>. Although the original type information was lost, STRING seemed to work. Okay, let's proceed. In a subsequent operation, I needed to access the last element of the array. I attempted to use the array subscript with the size function, my_array[size(my_array)], but encountered an error stating that the subscript must be constant. However, my array was not of a fixed length. I wondered if there was a function to reverse the array, but there wasn't. Ultimately, I had to give up using arrays...

• Scenario 2

My task was to generate a curve for each advertisement, representing the expected number of impressions and clicks as the advertiser's bid increased. The most natural approach would involve a data structure with bids, impression times, and click times. However, ODPS does not support this. Consequently, I had to encode the information into a string, perform the necessary operations, and then decode it. This process was tiresome and inefficient, but there was no alternative...

MaxCompute uses an SQL engine based on ODPS V2.0, which greatly improves complex types and provides supporting built-in functions to generally solve the above problems.

The Extension of Complex Types

MaxCompute Studio supports complex data types from V2.8.0.2. If you are using an older version, we recommend upgrading to the latest version.

First, please install MaxCompute Studio, connect to a MaxCompute project, and create a new MaxCompute script file, as follows.

After running, you can find the newly created table in the Project Explorer of MaxCompute Studio. View the details of the table and preview the data, as shown in the following figure.

You can see that MaxCompute supports the ARRAY, MAP, and STRUCT types. They can be nested.

MaxCompute Studio allows you to import and export data in tables. For more information, see this document.

• The following table describes the complex types supported by MaxCompute.

Type	Definition example	Construction example	Access example
ARRAY	`array<int>, array<struct<a:int, b:string>>`	`array(1, 2, 3), array(array(1, 2), array(3, 4))`	`a[1], a[x][y]`
MAP	`map<string, string>, map<tinyint, array<string>>`	`map("k1", "v1", "k2", "v2"), map(1Y, array('a', 'b'), 2Y, array('x', 'y'))`	`m['k1'], m[2Y][id]`
STRUCT	`struct<x:int, y:int>, struct<a:array<int>, b:map<int, int>>`	`named_struct('x', 1, 'y', 2), named_struct('a', array(1, 2), 'b', map(1, 7, 2, 8)`	`s.x, s.b[1]`

Complex Type Constructions and Operation Functions

Return type	Signature	Annotations
`MAP<K, V>`	`map(K key1, V value1, K key2, V value2, ...)`	Use a given key/value pair to create a map. All keys are of the same type and must be of the basic type. All values are of the same type and can be of any type.
`ARRAY<K>`	`map_keys(Map<K, V> m)`	Return all the keys of the map in the parameter as an array. If you enter NULL, NULL is returned.
`ARRAY<V>`	`map_values(MAP<K, V> m)`	Return all the values of the map in the parameter as an array. If you enter NULL, NULL is returned.
`int`	`size(MAP<K, V>)`	Obtain the number of given MAP elements.
`TABLE<K, V>`	`explode(MAP<K, V>)`	The table-valued function expands the given map, with one row for each key/value and two columns for each row corresponding to the key and value respectively.
`ARRAY<T>`	`array(T value1, T value2, ...)`	Use the given value to construct an ARRAY. All values are of the same type.
`int`	`size(ARRAY<T>)`	Obtain the number of given ARRAY elements.
`boolean`	`array_contains(ARRAY<T> a, value v)`	Check whether a given ARRAY a contains v.
`ARRAY<T>`	`sort_array(ARRAY<T>)`	Sort the given array.
`ARRAY<T>`	`collect_list(T col)`	An aggregation function that aggregates the expression specified by col into an array within a given group.
`ARRAY<T>`	`collect_set(T col)`	An aggregation function that aggregates the expression specified by col into a set array without duplicate elements within a given group.
`TABLE<T>`	`explode(ARRAY<T>)`	A Table-valued function that expands the given ARRAY, with one row for each value. One column for each row corresponds to the relevant array elements.
`TABLE (int, T)`	`posexplode(ARRAY<T>)`	A table-valued function that expands the given ARRAY, with one row for each value. Two columns for each row correspond to the subscript and array elements of the array starting from 0.
`STRUCT<col1:T1, col2:T2, ...>`	`struct(T1 value1, T2 value2, ...)`	Create a struct by using a given value list. Each value can be of any type. The names of the fields that generate the struct are col1, col2, ...
`STRUCT<name1:T1, name2:T2, ...>`	`named_struct(name1, value1, name2, value2, ...)`	Use the given name/value list to create a struct. Each value can be of any type. The names of the fields that generate the struct are name1, name2, ...
`TABLE (f1 T1, f2 T2, ...)`	`inline(ARRAY<STRUCT<f1:T1, f2:T2, ...>>)`	A table-valued function that expands a given struct array. Each element corresponds to one row, and each struct element in each row corresponds to one column.

Use Complex Types in UDFs

The original ODPS does not support accessing any complex types in UDFs. MaxCompute Java UDFs support all complex types. Python UDFs will also support them in the near future.

Representation of Complex Types in JAVA UDFs

The UDFs of ODPS are classified into three types: UDFs, UDAFs, and UDTFs.

UDAFs and UDTFs use @Resolve annotation to specify the signature. After MaxCompute V2.0 is published, users can go through the Resolve annotation. For example, @Resolve("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint, b2:binary>")
UDFs use the signature of the evaluate method to map the input and output types of UDFs. In this case, refer to the mapping between the MaxCompute type and the Java type. Array corresponds to java.util.List, map corresponds to java.util.Map, and struct corresponds to com.aliyun.odps.data.Struct. Note that com.aliyun.odps.data.Struct cannot see the field name and field type from reflection. Therefore, you need to use @Resolve as an aid. That is, if you need to use a struct in a UDF, you must also mark the @Resolve annotation on the UDF class. This annotation only affects the overloads that contain com.aliyun.odps.data.Struct in the parameters or return values. Currently, only one @Resolve annotation can be provided in a class. So, there can only be one overload with a struct parameter or return value in a UDF. This is a restriction now. We are improving it and will remove it later.

Use Cases

For example, the following code defines a UDF with three overloads. The first parameter uses an array as the parameter, the second parameter uses the map as the parameter, and the third parameter uses struct as the parameter. The third overloads use struct as the parameter or return value. So, you must add a @Resolve annotation to the UDF class to specify the specific type of the struct.

@Resolve("struct<a:bigint>,string->string")
public class UdfArray extends UDF {
  public String evaluate(List<String> vals, Long len) {
    return vals.get(len.intValue());
  }

  public String evaluate(Map<String,String> map, String key) {
    return map.get(key);
  }

  public String evaluate(Struct struct, String key) {
    return struct.getFieldValue("a") + key;
  }
}

You can directly pass complex types into a UDF:

create function my_index as 'UdfArray' using 'myjar.jar';
select id, my_index(array('red', 'yellow', 'green'), colorOrdinal) as color_name from colors;

Summary

MaxCompute expands its support for complex data types, enabling better adaptation to various application scenarios. MaxCompute will continue to enhance the type system in terms of compatibility and expression capabilities. Starting from the next article in this series, I will introduce other improvements of MaxCompute in SQL.

Community

MaxCompute Unleashed - Part 3: Complex Type Functions

The Extension of Complex Types

Complex Type Constructions and Operation Functions

Use Complex Types in UDFs

Representation of Complex Types in JAVA UDFs

Use Cases

Summary

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

MaxCompute

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink