×
Community Blog MaxCompute Unleashed - Part 2: Basic Data Types and Built-in Functions

MaxCompute Unleashed - Part 2: Basic Data Types and Built-in Functions

Part 2 of the “Unleash the Power of MaxCompute” series describes the basic data types and built-in functions of MaxCompute.

By Zhenyu

MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.

Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).

In the previous article, I introduced the usability improvements of the MaxCompute compiler. In this article, I will introduce the new basic data types and built-in functions.

The original ODPS had only six basic data types: BIGINT, DOUBLE, DECIMAL, STRING, DATETIME, and BOOLEAN. These types are generally sufficient, but may not be enough in certain scenarios.

  • Scenario 1

A project needed to migrate an ETL system that was originally running on SQL SERVER to MaxCompute due to a significant increase in data volume. It was discovered that some tables used VARCHAR while others used INT. These types were also used in multiple SQL scripts in the system and were involved in operations. After migrating to ODPS, VARCHAR was replaced with STRING and INT was replaced with BIGINT. [1]

After the migration, it was found that the data did not match the original system. Was this due to the truncation of VARCHAR and the overflow of INT? Or were there other reasons? Faced with the existing system, there was no choice but to check the code, run the data, and analyze it step by step. What was thought to be a relatively easy project ended up taking a couple of weeks to resolve. . .

  • Scenario 2

My project required storing binary data in a table for a speech recognition project. If each small segment of collected audio was stored as a field, it would be convenient to process it with a UDF. However, ODPS does not have a BINARY type. Okay, I decided to save it as STRING. However, writing the UDF became quite troublesome. In order to save the data, the byte[] had to be encoded into a STRING and decoded when reading it. A significant amount of code had to be written, resulting in slower execution speed. . .

MaxCompute uses an SQL engine based on ODPS V2.0, which significantly expands the basic data types and provides supporting built-in functions to address the aforementioned problems.

The Expansion of Basic Types

This article uses MaxCompute Studio as an example. First, please install MaxCompute Studio, connect to a MaxCompute project, and create a new MaxCompute script file, as follows.

1

After running, create another file to insert data, as follows.

2

After running, you can find the newly created table in the Project Explorer of MaxCompute Studio. View the details of the table and preview the data, as shown in the following figure.

3

The results can be seen here.

• When you create a table, you must first specify the MaxCompute new type system. For compatibility reasons, you must enable this setting. It can also be specified by default in MaxCompute Studio, as shown in the following figure.

4

If you do not use MaxCompute Studio, you can specify set odps.sql.type.system.odps2=true; in the script. Studio actually uses this switch in the backend to control whether new types are enabled. When the odps.sql.type.system.odps2 is set to true, in addition to using the new type, it also controls some behavior changes in other aspects. It will be explained in the relevant section.

If you want to enable the feature in a MaxCompute project by default, contact your project administrator and configure the feature in the project template.

• The following table lists the basic data types supported by MaxCompute after the expansion. The newly added data types include TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY.

Type Newly added or not Constant definition Description
TINYINT Yes 1Y, -127Y 8-bit signed integer, ranging from -128 to 127.
SMALLINT Yes 32767S, -100S 16-bit signed integer, ranging from -32,768 to 32,767.
INT Yes 1000, -15645787 [1] 32-bit signed integer, ranging from -2^31 to 2^31 - 1
BIGINT No 100000000000L, -1L 64-bit signed integer, ranging from -2^63 + 1 to 2^63 - 1
FLOAT Yes No 32-bit binary floating point type
DOUBLE No 3.1415926 1E+7 64-bit binary floating point type
DECIMAL No 3.5BD, 99999999999.9999999BD The precise numeric type based on the decimal system.
precision: indicates the number of digits in a value. Valid values: 1 to 38.
scale: indicates the number of digits to the right of the decimal point in a value. Valid values: 0 to 18.
VARCHAR Yes None [2] n is the length. Range from 1 to 65535.
STRING No "abc", 'bcd', "alibaba", and 'inc' [3] The string type. The maximum length is 8 MB.
BINARY Yes No A binary number. The maximum length is 8 MB.
DATETIME No DATETIME '2017-11-11 00:00:00' The DATETIME type.
Valid values: 0001-01-01 00:00:00.000 to 9999-12-31 23:59:59.999, accurate to the millisecond.
TIMESTAMP Yes TIMESTAMP '2017-11-11 00:00:00.123456789' The TIMESTAMP type.
Valid values: 0001-01-01 00:00:00.000000000 to 9999-12-31 23:59:59.999999999, accurate to the nanosecond. [4]
BOOLEAN No TRUE,FALSE Valid values: TRUE or FALSE

The following table describes the new implicit conversion rules. [5]

boolean tinyint smallint int bigint float double decimal string varchar timestamp binary
boolean to TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
tinyint to FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
smallint to FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
int to FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
bigint to FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
float to FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
double to FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
decimal to FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
string to FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
varchar to FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
timestamp to FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
binary to FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

In addition, the definition of constants of the DECIMAL type and the DATETIME type is added. 100BD is DECIMAL with a value of 100, and datetime '2017-11-11 00:00:00 'is a constant of the DATETIME type. The convenience of constant definition is that it can be directly used in the values clause and the values table, which will be described separately later.

The Expansion of Built-in Functions

For any programming language, including SQL, no matter how powerful the language itself is, if there are no rich functions supported by the latter class library, it will still be very inconvenient when it is applied. MaxCompute with new data types greatly enriches the built-in functions, as follows.

• Mathematical functions

log2, log10, bin, hex, unhex, degrees, radians, sign, e, pi, factorial, cbrt, shiftleft, shiftright, shiftrightunsigned

• Date functions

unix_timestamp, year, quarter, month, day, dayofmonth, hour, minute, second, millisecond, nanosecond, from_utc_timestamp, current_timestamp, add_months, last_day, next_day, months_between

• String functions

concat_ws, lpad, rpad, replace, soundex, substring_index, base64, unbase64

• Aggregate functions

corr

Most of these functions are compatible with the built-in functions of Hive. For more information, see The document of Hive. Different from Hive, these functions provided by MaxCompute are efficient versions implemented by using local code.

The new TIMESTAMP type supports nanosecond-level precision. To be compatible with it, MaxCompute adds specific millisecond and nanosecond functions to retrieve milliseconds of TIMESTAMP and DATETIME and nanoseconds of TIMESTAMP.

As mentioned in the previous article in this series, MaxCompute supports new coercion writing methods. For example, if you want to force a BIGINT variable to be converted to STRING, you can directly write string(a_bigint), which is equivalent to cast(a_bigint as string). Which form to use depends entirely on your preference.

It should be noted that all functions used to support new types, such as current_timestamp, also need set odps.sql.type.system.odps2=true;. Otherwise, compilation errors will be reported.

The Expansion of Partition Types

The support for partition types has also been expanded. Currently, partition types support TINYINT, SMALLINT, INT, BIGINT, VARCHAR, and STRING [6].

In addition, when the original ODPS is used for dynamic partitioning, if the type of partition column is not strictly consistent with the type of the column in the relevant SELECT list, an error will be reported. MaxCompute supports the implicit conversions of data types. For example,

set odps.sql.type.system.odps2=true;
create table parttable(a int, b double) partitioned by (p string);
insert into parttable partition(p) (p, a) select key, value, current_timestmap() from src;
select * from parttable;

Return

a b p
0 NULL 2017-01-23 22:30:47.130406621
0 NULL 2017-01-23 22:30:47.130406621

You can see that the value of the partition column p is implicitly converted from the TIMESTAMP type.

Use a UDF

Currently, Java UDFs of MaxCompute V2.0 support new types. Python UDFs will be implemented as soon as possible. Java UDFs use the new type as follows.

  1. UDAFs and UDTFs use the @Resolve annotation to obtain signatures. MaxCompute V2.0 supports using new types in annotations, such as @Resolve("smallint->varchar(10)")
  2. The UDF obtains the signature through the reflection analysis of evaluate. In this case, the built-in type of max compute and the JAVA type meet the one-to-one mapping relationship.
max compute type java type
tinyint java.lang.Byte
smallint java.lang.Short
int java.lang.Integer
bigint java.lang.Long
float java.lang.Float
double java.lang.Double
decimal java.math.BigDecimal
boolean java.lang.Boolean
string java.lang.String
varchar com.aliyun.odps.data.Varchar
binary com.aliyun.odps.data.Binary
datetime java.util.Date
timestamp java.sql.Timestamp
array java.util.List
map java.util.Map
struct com.aliyun.odps.data.Struct

It should be noted that the java type corresponding to the ARRAY type is List.

Summary

MaxCompute significantly enhances basic data types and built-in functions to cater to diverse application scenarios, yet relying solely on basic types can still be cumbersome in complex situations. Please look forward to the next article.

Annotations

● [1]:

  1. If an INT constant exceeds the value range of INT, it will be converted into BIGINT. If it exceeds the value range of BIGINT, it will be converted into DOUBLE.
  2. In the original ODPS, due to historical reasons, all INT types in SQL scripts are converted to BIGINT, for example,
create table a_bigint_table(a int); -- INT here is actually treated as BIGINT.
select cast(id as int) from mytable; -- INT here is actually treated as BIGINT

To be compatible with the original ODPS mode, MaxCompute retains this conversion even if the odps.sql.type.system.odps2 is not set to true. But a warning is reported, indicating that INT is treated as BIGINT. If the case occurs in your script, we recommend that you rewrite all your scripts to BIGINT to avoid confusion.

● [2]:

The VARCHAR constant can be represented by implicit conversion of the STRING constant.

● [3]:

The STRING constant supports the connection. For example, 'abc' 'xyz' will be parsed as 'abcxyz'. Different parts can be written on different lines.

● [4]:

Due to the limitations of the underlying system, the current call current_timestamp does not reach nanosecond precision. For example,

meta_dev>set odps.sql.type.system.odps2=true;select nanosecond(current_timestamp());

The output is similar to

+------+
| _c0  |
+------+
| 877000000 |
+------+

The TIMASTAMP constant and external data imports can support nanosecond precision.

● [5]:

In the original ODPS, due to historical reasons, DOUBLE can be implicitly converted to BIGINT. This conversion may cause data loss, which is not allowed by general database systems. To be compatible with the original ODPS mode, MaxCompute still allows this conversion even if the odps.sql.type.system.odps2 is not set to true. But a warning is reported. If the odps.sql.type.system.odps2 is set to true, this implicit type conversion is not allowed.

● [6]:

In the original ODPS, due to historical reasons, although the partition type can be specified as BIGINT, all other cases are processed as STRING except that the schema of the table indicates that it is BIGINT. For example,

create table parttest (a bigint) partitioned by (pt bigint);
insert into parttest partition(pt) select 1, 2 from dual;
insert into parttest partition(pt) select 1, 10 from dual;
select * from parttest where pt >= 2;

The returned result has only one line because 10 is compared by string and 2. It can not go back. To be compatible with the original ODPS mode, MaxCompute still treats it in this way even if the odps.sql.type.system.odps2 is not set to true. If odps.sql.type.system.odps2 is set to true, partitions of the BIGINT type are processed in strict accordance with the BIGINT type.

0 1 0
Share on

Alibaba Cloud MaxCompute

137 posts | 19 followers

You may also like

Comments

Alibaba Cloud MaxCompute

137 posts | 19 followers

Related Products