By Haiqing
MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.
Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).
The previous article explained how to incorporate scripts from various languages into MaxCompute. SELECT TRANSFORM's main advantage is the ability to execute scripts in different languages without creating functions or uploading resources. Even if resources must be written, MaxCompute imposes no format or dependency requirements. This article introduces another advancement that elevates this benefit: User Defined Type, or UDT.
A task that can be easily accomplished in other languages—like a single method call from a built-in Java class—may lack a corresponding built-in function in MaxCompute. Writing a UDF for such a simple task feels unnecessarily complex and offers a poor experience.
SELECT TRANSFORM allows me to directly embed scripts into SQL statements, which enhances code readability (as opposed to the "black box" nature of UDFs, scripts within SQL are immediately understandable) and maintainability (since there is no need to commit multiple code files, particularly when SQL and script files are in separate repositories). However, this does not work for languages like Java, which need to be compiled before execution. Is there a way for these languages to enjoy the same benefits?
There's a need to invoke third-party libraries within SQL to perform certain functions. Ideally, this could be done directly in SQL without wrapping the functionality in a UDF layer.
UDTs offer viable solutions to these scenarios. The following sections will introduce how to use UDTs in detail.
Many examples in this article are demonstrated using MaxCompute Studio. For users who haven't installed MaxCompute Studio, please refer to the installation guide to import and test MaxCompute projects and create new projects.
Note: The provided examples are for reference and require the installation of MaxCompute Studio.
UDTs (User Defined Types) in MaxCompute allow you to directly reference classes or objects from third-party languages in SQL statements to retrieve data or invoke methods.
While other SQL engines also have the concept of UDTs, there are significant differences compared to MaxCompute. In many SQL engines, UDTs resemble the struct composite type in MaxCompute. Some languages provide features to call third-party libraries, such as the CREATE TYPE statement in Oracle databases. In contrast, UDTs in MaxCompute are more aligned with the concept of CREATE TYPE. A MaxCompute UDT not only includes data fields but can also contain methods. Additionally, MaxCompute simplifies the usage by allowing direct usage in SQL, without the need for special DDL syntax to define type mapping.
A simple example is as follows:
set odps.sql.type.system.odps2=true; -- Open the new type, because the following operation will use Integer, that is, the INT type
SELECT java.lang.Integer.MAX_VALUE;
The output of the above example:
+-----------+
| max_value |
+-----------+
| 2147483647 |
+-----------+
Like the Java language, the java.lang package can be omitted. So the above example can be abbreviated as:
set odps.sql.type.system.odps2=true;
SELECT Integer.MAX_VALUE;
As you can see, the above example directly writes an expression similar to a Java expression in the select list, and this expression is indeed executed according to the semantics of Java. This example demonstrates the capabilities of MaxCompute UDTs.
You can use UDFs to implement all features provided by UDTs. For example, if you use a UDF to implement the preceding example, you need to perform the following operations.
First, define a UDF class.
package com.aliyun.odps.test;
public class IntegerMaxValue extends com.aliyun.odps.udf.UDF {
public Integer evaluate() {
return Integer.MAX_VALUE;
}
}
Then, compile the UDF as a jar package. Then, upload the jar package and create a function.
add jar odps-test.jar;
create function integer_max_value as 'com.aliyun.odps.test.IntegerMaxValue' using 'odps-test.jar';
Finally, it can be used in SQL.
select integer_max_value();
UDTs simplify the preceding procedures and allow developers to easily extend SQL features in other languages.
The preceding example shows the ability of Java to access static domains, while the ability of UDTs is far more powerful. For example,
-- Sample data
@table1 := select * from values ('100000000000000000000') as t(x);
@table2 := select * from values (100L) as t(y);
-- Code logic
@a := select new java.math.BigInteger(x) x from @table1; -- Create an object by using the new method.
@b := select java.math.BigInteger.valueOf(y) y from @table2; -- Call a static method.
select /*+mapjoin(b)*/ x.add(y).toString() from @a a join @b b; -- Call an instance method.
The output of the preceding example is 100000000000000000100.
This example also shows a feature that is not easy to implement with UDFs: the results of subqueries allow columns of the UDT. For example, the x column retrieved by variable a is of the java.math.BigInteger type instead of a built-in type. You can pass the UDT data to another operator and then call its method. You can also use the UDT data in data shuffling. In the preceding example, the execution diagram in MaxCompute Studio is as follows:
It can be seen that there are three STAGEs in the figure: M1, R2, and J3. Users who are familiar with the MapReduce principle will know that multiple stages will appear because the existence of a join requires a data reshuffle. Typically, data processing at different stages is performed in different processes or different physical machines. Double-click the box representing M1 to display the following:
It can be seen that M1 only performs the new java.math.BigInteger(x) operation. Similarly, if you click the box representing J3, you can see that J3 performs the java.math.BigInteger.valueOf(y) operation and the x.add(y).toString() operation at different stages:
These operations are not only executed in stages but even on different processes and different physical machines. However, UDTs encapsulate this process to make it look almost the same as if it is executed in the same JVM.
UDTs also allow you to upload JAR packages and directly reference the packages. It is like the preceding UDF jar package. Use UDTs to:
set odps.sql.type.system.odps2=true;
set odps.sql.session.resources=odps-test.jar; -- Specify the jar to be referenced. These jars must be uploaded to the project in advance and need to be jar-type resources.
select new com.aliyun.odps.test.IntegerMaxValue().evaluate();
If you find it troublesome to write the full path of the package, you can use the flag to specify the default package, like the import of Java.
set odps.sql.type.system.odps2=true;
set odps.sql.session.resources=odps-test.jar;
set odps.sql.session.java.imports=com.aliyun.odps.test.*; -- Specify the default package
select new IntegerMaxValue().evaluate();
Provide some flags to improve usage efficiency:
set odps.sql.session.resources=foo.sh,bar.txt;
Note that this flag is the same as the flag of the resources specified in SELECT TRANSFORM. So, this flag affects both the features of SELECT TRANSFORM and UDTs.*
. Currently, static import is not supported.UDTs support the following operations:
new Integer[] { 1, 2, 3 }
.Note:
String.valueOf(1) + String.valueOf(2)
is 3 (string is implicitly converted to double, and double is added), not '12' (string addition is the semantics of concatenate in Java). In addition to the string addition operation, another operation that is easy to confuse is the =
operation. In SQL,=
is not an assignment, but is used to determine equality. For Java objects, the equals method should be used to determine equality. The equality determined by the equal sign cannot guarantee its behavior (in UDT scenarios, the concept of the same object cannot be guaranteed. For specific reasons, see the following No.8).There is a one-to-one mapping between built-in types and specific Java types. This mapping also works in UDTs:
'123'.length() , 1L.hashCode()
.chr(Long.valueOf('100'))
. Long.valueOf returns data of the java.lang.Long type, while the built-in function chr accepts data of the BIGINT type.set odps.sql.type.system.odps2=true;
in order to be used. Otherwise, an error will occur.java.util.Arrays.asList(new java.math.BigInteger('1'))
as an example. The compiler can know that the return value of the method is java.util.List<java.math.BigInteger>
type based on the parameter type. Note that constructor functions need to specify type parameters. Otherwise, they use java.lang.Object
, which is consistent with Java:The result of the new java.util.ArrayList(java.util.Arrays.asList('1', '2'))
is the java.util.ArrayList<Object>
type; And the new java.util.ArrayList<String>(java.util.Arrays.asList('1', '2'))
result is java.util.ArrayList<String>
type.
UDTs do not have a clear definition of equal objects. This is caused by data reshuffling. As can be seen from the join example in the first part above, objects may be transmitted between different processes and different physical machines. During the transmission, two references of the same object may refer to different objects (for example, the object is shuffled to two machines first, and then shuffled back together next time).
= operator
to determine equality but use equals
methods instead.group by new java.math.BigInteger('123')
, it can group by new java.math.BigInteger('123').hashCode()
. As the return value of hashCode is INT., the class type can be used as a built-in type INT (corresponding to the above rule of "mapping built-in types and specific Java types").Note: This restriction is planned to be removed in future versions.
The following type conversion rules apply in UDTs:
Currently, UDT objects cannot be stored. This means that UDT objects cannot be inserted into tables (in fact, DDL does not support UDTs and cannot create such tables), except for the implicit type conversion to built-in types. At the same time, the final result of the PRINT cannot be of UDTs. For PRINT scenarios, all Java classes have the toString() method, so the java.lang.String
type is legal. You can use this method to check UDT data during debugging.
set odps.sql.udt.display.tostring=true;
MaxCompute can automatically put java.util.Objects.toString(...)
on the column wrap that uses all UDTs as the final output to facilitate debugging. This flag is typically used for debugging because it can be applied only to PRINT statements. It cannot be applied to INSERT statements.set odps.sql.type.system.odps2=true;
set odps.sql.udt.display.tostring=true;
select
new Integer[10], -- Create an array with 10 elements.
new Integer[] {c1, c2, c3}, --Initialize an array that contains three elements.
new Integer[][] { new Integer[] {c1, c2}, new Integer[] {c3, c4} }, -- Create a multi-dimensional array.
new Integer[] {c1, c2, c3} [2], -- Access the elements in the array using indexes.
java.util.Arrays.asList(c1, c2, c3); -- This creates a List<Integer>, which can also be used as an array< int >, so this is another way to create built-in array data.
from values (1,2,3,4) as t(c1, c2, c3, c4);
The runtime of UDTs comes with a gson dependency (V2.2.4). So users can directly use gson.
set odps.sql.type.system.odps2=true;
set odps.sql.session.java.imports=java.util.*,java,com.google.gson.*; -- Import multiple packages at the same time and separate them by commas.
@a := select new Gson() gson; -- Build a GSON object
select
gson.toJson(new ArrayList<Integer>(Arrays.asList(1, 2, 3))), -- Converts any object into a JSON string
cast(gson.fromJson('["a","b","c"]', List.class) as array<string>) -- Deserialize the JSON string, pay attention to the GSON interface, directly deserialize it, and you get the List<Object> type. So there is a strong conversion into List<String>, convenient for the following use
from @a;
Compared with get_json_object, the benefit of the above usage is more convenient to use. When the content of multiple parts of the JSON string needs to be extracted, the GSON string is first deserialized into formatted data, which is much more efficient.
In addition to GSON, the built-in dependencies of MaxCompute runtime include commons-logging (V1.1.1), commons-lang (V2.5), commons-io (V2.4), and protobuf-java (V2.4.1).
Built-in types of array and map are mapped to java.util.List and java.util.Map, respectively. The result is:
set odps.sql.type.system.odps2=true;
set odps.sql.session.java.imports=java.util.*;
select
size(new ArrayList<Integer>()), -- Call built-in function size for ArrayList data
size
array(1,2,3).size(), -- Call the List method on the built-in type array
sort_array(new ArrayList<Integer>()), -- Sort the data in the ArrayList
al[1], -- The Java List method does not support indexing. However, the array type supports indexing.
Objects.toString(a)), -- With this method, you can convert array type to string type data.
array(1,2,3).subList(1, 2) -- Get a subList
from (select new ArrayList<Integer>(array(1,2,3)) as al, array(1,2,3) as a) t;
It can also implement some special features, such as the distinct of the array.
select cast (new java.util.ArrayList<Long>(new java.util.HashSet<Long>(array(1L, 2L, 2L))) as array<bigint>); -- Output [1,2]
The principle of UDT aggregation is to use the COLLECT_SET or COLLECT_LIST function to convert data into a List, and then apply the scalar method of UDT to the list to obtain the aggregate value of this group of data.
For example, the following example is used to find the median of BigInteger (as the data is of the java.math.BigInteger type, the built-in median function cannot be directly used).
set odps.sql.session.java.imports=java.math.*;
@test_data := select * from values (1),(2),(3),(5) as t(value);
@a := select collect_list(new BigInteger(value)) values from @test_data; -- Aggregate the data to a list
@b := select sort_array(values) as values, values.size() cnt from @a; -- To obtain the median, first sort the data.
@c := select if(cnt % 2 == 1, new BigDecimal(values[cnt div 2]), new BigDecimal(values[cnt div 2 - 1].add(values[cnt div 2])).divide(new BigDecimal(2))) med from @b;
-- The final result.
select med.toString() from @c;
As collect_list collects all data into one piece first, there is no way to implement partial aggregate. Thus, the efficiency of this method is lower than that of the built-in aggregator or UDAFs. Therefore, we should try to use the built-in aggregator when it can be implemented. Aggregating all data in a group increases the risk of data skew.
However, if the logic of UDAFs is to collect all data together (for example, a feature similar to wm_concat), the preceding method may be more efficient than UDAFs (note that it is not a built-in aggregator).
Table-valued functions allow you to specify multiple input rows and columns and can generate multiple output rows and columns. It can be implemented according to the following principles:
The following example shows how to split a json string and output the splitting result in multiple columns:
@a := select '[{"a":"1","b":"2"},{"a":"1","b":"2"}]' str; -- Sample data
@b := select new com.google.gson.Gson().fromJson(str, java.util.List.class) l from @a; -- Deserialize the JSON string.
@c := select cast(e as java.util.Map<Object,Object>) m from @b lateral view explode(l) t as e; -- Use explode to type multiple lines
@d := select m.get('a') as a, m.get('b') as b from @c; -- Output the splitting result in multiple columns
select a.toString() a, b.toString() b from @d; -- The final result output. Columns a and b in variable are of the Object type.
We know that resource files can be read by ExecutionContext objects in UDFs. Now UDTs can also be used by com.aliyun.odps.udt.UDTExecutionContext.get()
method or such an ExecutionContext object.
The following example reads the resource file 1.txt into a STRING object and outputs:
set odps.sql.session.resources=1.txt;
select new String(com.aliyun.odps.udt.UDTExecutionContext.get().readResourceFile('1.txt')) text;
UDT objects do not support data storage on disks by default. However, there are methods to persist UDT objects. The basic idea is to serialize data into BINARY or STRING for persistence or expand UDT objects to persist key data that can be converted into built-in types.
The UDT is defined as follows:
public class Shape
{
public List<Point> points;
public Shape(List<Point> points)
{
this.points = points;
}
}
public class Point
{
public int x;
public int y;
public Point(int x, int y)
{
this.x = x;
this.y = y;
}
}
To expand an object into a built-in type:
@data := select key, shape from ...;
@exploded := select key, point from @data lateral view explode(shape.points) t as point;
@expanded := select key, point.x, point.y from @exploded;
insert into table points select * from @expanded;
Reconstruct when needed:
select key, new Shape(collect_list(new Point(x, y))) as shape from points group by key;
Or serialize the object into BINARY.
The biggest problem with flat expansion is in serialization and deserialization. Of course, it can be directly converted to BINARY. For example, transform the Shape class:
-- Transform the Shape class.
public com.aliyun.odps.data.Binary serialize() {
ByteBuffer buffer = ByteBuffer.allocate(points.size() * 8 + 4);
buffer.putInt(points.size());
for (Point point : points) {
buffer.putInt(point.x);
buffer.putInt(point.y);
}
return new com.aliyun.odps.data.Binary(buffer.array());
}
public static Shape deserialize(com.aliyun.odps.data.Binary bytes) {
ByteBuffer buffer = ByteBuffer.wrap(bytes.data());
int size = buffer.getInt();
List<Point> points = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
points.add(new Point(buffer.getInt(), buffer.getInt()));
}
return new Shape(points);
}
-- Call serialize() when persistence is required.
select key, shape.serialize() data from ...
-- Call the deserialize method when you need to read data.
select key, Shape.deserialize(data) as Shape from ...
It may be more convenient to use the existing framework directly. For example, Shape is defined with ProtoBuffer
-- The definition of Shape
message Point
{
required int32 x = 1;
required int32 y = 2;
}
message Shape
{
repeated Point points = 1;
}
Method of directly calling pb in SQL
select key, new com.aliyun.odps.data.Binary(shape.toByteArray()) from ...
This feature can be used together with MaxCompute Studio to maximize its value.
When it comes to features, UDTs offer several advantages:
In terms of performance, the execution process of UDTs is very similar to that of UDFs, with comparable performance. Additional optimizations have been implemented for UDTs, resulting in slightly better performance in certain scenarios:
values[x].add(values[y]).divide(java.math.BigInteger.valueOf(2))
is an operation that seems to have multiple UDT method calls, but in fact, there is only one call. UDTs focus on small-granularity data processing. This does not incur additional overhead for the API where multiple functions are called.In terms of security control, UDTs are subject to the same sandbox policy as UDFs. Restricted operations can be performed by enabling sandbox isolation or applying to join a sandbox whitelist.
This article introduces the features of UDTs, which allow the direct inclusion of Java expressions in SQL and the referencing of JDK classes. This feature greatly extends the functionality of SQL.
There are still several areas for improvement in UDTs, as mentioned in the article:
MaxCompute Unleashed - Part 7: Grouping Set, Cube and Rollup
137 posts | 20 followers
FollowAlibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - February 4, 2024
Alibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - February 7, 2024
137 posts | 20 followers
FollowConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreMore Posts by Alibaba Cloud MaxCompute