All Products
Search
Document Center

MaxCompute:Overview

最終更新日:Jul 10, 2024

MaxCompute introduces user-defined types (UDTs) based on the new-generation SQL engine. UDTs allow you to reference classes or objects of third-party programming languages in SQL statements to call methods or obtain data.

Introduction

UDTs supported by many SQL engines are similar to the STRUCT type in MaxCompute. UDTs supported by MaxCompute are similar to the CREATE TYPE statement. A UDT contains both fields and methods. You do not need to use DDL statements to define new data types in MaxCompute. Instead, MaxCompute allows you to reference new data types directly in SQL statements. The following examples show how to use UDTs.

For example, to call the java.lang package in SQL statements, you can use one of the following methods:

  • Use UDTs to call java.lang

    -- Enable new data types. The following example uses a new type of INTEGER (INT). 
    set odps.sql.type.system.odps2=true;    
    SELECT java.lang.Integer.MAX_VALUE;

    Similar to Java, the java.lang package can be omitted. Therefore, the preceding statement is equivalent to the following statement:

    set odps.sql.type.system.odps2=true;
    SELECT Integer.MAX_VALUE;

    The following result is returned:

    +-----------+
    | max_value |
    +-----------+
    | 2147483647 |
    +-----------+
  • Use user-defined functions (UDFs) to call java.lang

    1. Develop code. The following code defines a UDF class.

      package com.aliyun.odps.test;
      public class IntegerMaxValue extends com.aliyun.odps.udf.UDF {
        public Integer evaluate() {
           return Integer.MAX_VALUE;
        } 
      }
    2. Compile the UDF into a JAR package, upload the package, and create a function.

      add jar odps-test.jar;
      create function integer_max_value as 'com.aliyun.odps.test.IntegerMaxValue' using 'odps-test.jar';
    3. Call the function in the SQL statement.

      select integer_max_value();

    In this example, UDTs simplify the procedure for you to use other programming languages to extend SQL features.

Scenarios

UDTs are suitable for the following scenarios:

  • You want to use some features that are not provided by MaxCompute but can be implemented in other programming languages.

    For example, to implement some features, you need only to call built-in Java classes once. However, MaxCompute does not provide methods to implement these features. If you use UDFs to run these tasks, the procedure is complex.

  • You want to call a third-party library in SQL statements to implement the related features. In this scenario, UDFs allow you to directly use a function provided by a third-party library in a SQL statement, instead of wrapping the function inside a UDF.

  • You want to directly call the source code of a third-party programming language in SQL statements. SELECT TRANSFORM allows you to include scripts in SQL statements to make these SQL statements easier to read and maintain. For some programming languages, such as Java, the source code can be executed only after it is compiled. You can use UDTs to reference objects and classes of these languages in SQL statements.

Limits

You cannot use UDFs, UDAFs, or UDTs to read data from the following types of tables:

  • Table on which schema evolution is performed

  • Table that contains complex data types

  • Table that contains JSON data types

  • Transactional table

Implementation

The following example shows how to run a UDT.

-- Sample data. 
@table1 := select * from values ('100000000000000000000') as t(x);
@table2 := select * from values (100L) as t(y);
-- Code logic. 
-- Create an object by using the new method. 
@a := select new java.math.BigInteger(x) x from @table1; 
-- Call a static method.          
@b := select java.math.BigInteger.valueOf(y) y from @table2;  
-- Call an instance method.     
select /*+mapjoin(b)*/ x.add(y).toString() from @a a join @b b;  

-- The following result is displayed: 
100000000000000000100

The following figure shows the process.

This UDT has three stages: M1, R2, and J3. If a JOIN operation is used in MapReduce, data must be reshuffled. As a result, data is processed at multiple stages. The processes and physical machines that process data vary at different stages.

Only the new java.math.BigInteger(x) method is called at the M1 stage.

The java.math.BigInteger.valueOf(y) and x.add(y).toString() methods are separately called at the J3 stage. These methods are called at different stages and executed in different processes and on different physical machines. The UDT encapsulates these stages to achieve an effect like all the stages are implemented on the same Java Virtual Machine (JVM).

The preceding example shows that the result of a subquery supports UDT columns. The x column retrieved by variable a is of the java.math.BigInteger type rather than a built-in type. You can transfer the UDT data to another operator and then call its method. You can also use the UDT data in a data shuffle.

Features

  • UDTs support only Java. By default, all classes of SDK for Java can be referenced by UDTs.

    Note

    The JDK runtime environment is JDK 1.8. A version later than JDK 1.8 may not be supported.

  • UDTs also allow you to upload JAR packages and directly reference these packages. Specific flags are provided for UDTs.

    • set odps.sql.session.resources: specifies the resource that you want to reference. You can specify multiple resources and separate them with commas (,). Example: set odps.sql.session.resources=foo.sh,bar.txt;.

      Note

      This flag works the same as the flag used to specify resources in the SELECT TRANSFORM statement. Therefore, this flag controls two features. For example, you can use a UDT to reference the UDF JAR package that we have mentioned in Overview.

      set odps.sql.type.system.odps2=true;
      set odps.sql.session.resources=odps-test.jar; 
      -- Specify the JAR package that you want to reference. This package must be uploaded to the required project. 
      select new com.aliyun.odps.test.IntegerMaxValue().evaluate();
    • odps.sql.session.java.imports: specifies the default Java package. You can specify multiple packages and separate them with commas (,). This flag is similar to the IMPORT statement in Java. You can specify a classpath, such as java.math.BigInteger, or use *. Static import is not supported.

      For example, you can use a UDT to reference the UDF JAR package that we have mentioned in Overview.

      set odps.sql.type.system.odps2=true;
      set odps.sql.session.resources=odps-test.jar;
      set odps.sql.session.java.imports=com.aliyun.odps.test.*;  
      -- Specify the default JAR package. 
      select new IntegerMaxValue().evaluate();
  • UDTs support resource access. In MaxCompute SQL, you can call the static method com.aliyun.odps.udf.impl.UDTExecutionContext.get() to obtain the ExecutionContext object. Then, you can use this object to access the current ExecutionContext class and then access resources, such as files and tables.

  • UDTs support the following operations:

    • Create objects by using the new method.

    • Create arrays by using the new method. Initializer lists can be used. Example: new Integer[] { 1, 2, 3 }.

    • Call methods, including static methods.

    • Access fields, including static fields.

    Note
    • Only public methods and public fields are supported.

    • The identifiers in UDTs contain the names of packages, classes, methods, and fields. All identifiers are case-sensitive.

    • Anonymous classes and lambda expressions are not supported.

    • UDTs are used in expressions. Functions that do not return values cannot be called in expressions. This issue will be resolved in later versions.

  • UDTs support the following data types:

    • UDTs support SQL type conversions, such as cast(1 as java.lang.Object). UDTs do not support Java type conversions, such as (Object)1.

    • Java data types are mapped to built-in data types. The mapping can be applied to UDTs. For more information, see the data type mapping table in Java UDFs.

      • You can directly call the method of the Java type to which the built-in type is mapped. Example: '123'.length() , 1L.hashCode().

      • UDTs can be used in built-in functions and UDFs. For example, in chr(Long.valueOf('100')), Long.valueOf returns a value of the java.lang.Long type. The CHR built-in function supports the built-in BIGINT type.

      • The data of a Java primitive type is automatically converted into the boxing type and the preceding two rules apply.

      Note

      For specific new built-in data types, you must use set odps.sql.type.system.odps2=true; to declare these types. Otherwise, an error occurs.

    • The following type conversion rules apply in UDTs:

      • UDT objects can be implicitly converted into the objects of their base classes.

      • UDT objects can be forcibly converted into the objects of their base classes or subclasses.

      • The data type conversion between two objects without inheritance follows the original conversion rules. However, such conversions may result in changes to the data. For example, data of the java.lang.Long type can be forcibly converted into the java.lang.Integer type. This conversion uses the rules that are used to convert the built-in BIGINT type into the INT type. This process may result in changes to the data and even loss of data precision.

      Note

      UDT objects cannot be saved to disks. This means that UDT objects cannot be inserted into tables because DDL statements do not support UDTs. However, if the data type can be implicitly converted into one of the built-in types, you can create tables that contain UDT objects. BINARY is a built-in type and supports automatic serialization. You can save the byte[] arrays to disks. The saved byte[] arrays can be deserialized to the BINARY type. To save UDTs, you must call serialization and deserialization methods to convert the data type into BINARY.

      The output cannot be a UDT. However, you can call the toString() method to convert the data type into the java.lang.String type because the toString() method supports all Java classes. You can use this method to check UDT data during debugging.

      You can also add the set odps.sql.udt.display.tostring=true; flag to enable MaxCompute to convert all output UDT data into strings by using the java.util.Objects.toString(...) method. This facilitates debugging. This flag is typically used for debugging because it can be applied only to PRINT statements. It cannot be applied to INSERT statements.

    • UDTs support Java generics. For example, the compiler can determine that the value returned by java.util.Arrays.asList(new java.math.BigInteger('1')) is of the java.util.List<java.math.BigInteger> type based on the parameter type.

      Note

      You must specify the type parameter in a constructor function or use java.lang.Object. This is the same as Java.

      For example, the result of new java.util.ArrayList(java.util.Arrays.asList('1', '2')) is of the java.util.ArrayList<Object> type. The result of new java.util.ArrayList<String>(java.util.Arrays.asList('1', '2')) is of the java.util.ArrayList<String> type.

  • All operators use the semantics of MaxCompute SQL instead of UDTs.

    • Combination of strings: The result of String.valueOf(1) + String.valueOf(2) is 3. The two strings are implicitly converted into DOUBLE-type values and summed. If you use Java string concatenation to combine the strings, the result is 12.

    • = operations: The = operator in SQL statements is used as a comparison operator. It is used to compare one expression with another. You must call the equals method in Java to check whether two objects are equivalent. The = operator cannot be used to verify the equivalence of two objects.

  • UDTs do not have a clear definition of object equality. This is caused by data reshuffling. Objects may be transmitted between different processes or physical machines. During object transmission, an object may be referenced as two different objects. For example, an object may be shuffled to two machines and then reshuffled. Therefore, when you use UDTs, you must use the equals method instead of the = operator to verify the equivalence of two objects.

    Objects in the same row or column are correlated in some way. However, a correlation between objects in different rows or columns cannot be ensured.

  • UDTs cannot be used as shuffle keys in clauses, such as JOIN, GROUP BY, DISTRIBUTE BY, SORT BY, ORDER BY, or CLUSTER BY.

    UDTs can be used at the stages of expressions, but cannot be used as outputs. For example, you cannot call the group by new java.math.BigInteger('123') method. However, you can call the group by new java.math.BigInteger('123').hashCode() method. This is because the value returned by hashCode is of the int.class type, which can be used as the built-in INT type. This applies the rules of both the built-in types and specific Java types.

  • You can use UDTs to implement the feature provided by the SCALAR function. You can use the built-in functions COLLECT_SET and Other functions with UDTs to implement the features provided by aggregate and table-valued functions.

Benefits

UDTs provide the following benefits:

  • UDTs are easy to use. You do not need to define functions.

  • UDTs support all JDK features. This improves SQL flexibility.

  • UDT code can be stored in the same file as the SQL code. This facilitates code management.

  • You can directly reference the libraries of other programming languages and reuse code that you have written in other languages.

  • You can create object-oriented features.

Features to be improved:

  • Call functions that do not return values and functions that directly use transferred data. For functions that directly use transferred data, their return values are ignored. For example, if you call the add method provided by the List interface, this method returns the list that you have transferred.

  • Use anonymous classes and lambda expressions.

  • Use UDTs as shuffle keys.

  • Support more programming languages, such as Python.

Performance

UDTs run in a similar way to UDFs. Therefore, the performance of UDTs is almost the same as that of UDFs. The optimized computing engine improves the performance of UDTs in specific scenarios.

  • If a UDT object is used in different processes, it must be serialized and deserialized. If you use UDTs to perform operations that do not require data reshuffling, such as JOIN or AGGREGATE, the overheads of serialization and deserialization are avoided.

  • The runtime of UDTs is based on Codegen rather than reflection. Therefore, no performance loss occurs. Multiple UDTs can be executed in a single function call. In the preceding example, values[x].add(values[y]).divide(java.math.BigInteger.valueOf(2)) is called only once. Therefore, no additional interface overheads are caused even though the operational units of UDTs are small.

Security

Similar to UDFs, UDTs are limited by the Java sandbox model. To perform restricted operations, you must cancel sandbox isolation for the operations or apply to join a sandbox whitelist.