By Zhenyu
MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.
Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).
In the previous article, I introduced the usability improvements of the MaxCompute compiler. In this article, I will introduce the new basic data types and built-in functions.
The original ODPS had only six basic data types: BIGINT, DOUBLE, DECIMAL, STRING, DATETIME, and BOOLEAN. These types are generally sufficient, but may not be enough in certain scenarios.
A project needed to migrate an ETL system that was originally running on SQL SERVER to MaxCompute due to a significant increase in data volume. It was discovered that some tables used VARCHAR while others used INT. These types were also used in multiple SQL scripts in the system and were involved in operations. After migrating to ODPS, VARCHAR was replaced with STRING and INT was replaced with BIGINT. [1]
After the migration, it was found that the data did not match the original system. Was this due to the truncation of VARCHAR and the overflow of INT? Or were there other reasons? Faced with the existing system, there was no choice but to check the code, run the data, and analyze it step by step. What was thought to be a relatively easy project ended up taking a couple of weeks to resolve. . .
My project required storing binary data in a table for a speech recognition project. If each small segment of collected audio was stored as a field, it would be convenient to process it with a UDF. However, ODPS does not have a BINARY type. Okay, I decided to save it as STRING. However, writing the UDF became quite troublesome. In order to save the data, the byte[] had to be encoded into a STRING and decoded when reading it. A significant amount of code had to be written, resulting in slower execution speed. . .
MaxCompute uses an SQL engine based on ODPS V2.0, which significantly expands the basic data types and provides supporting built-in functions to address the aforementioned problems.
This article uses MaxCompute Studio as an example. First, please install MaxCompute Studio, connect to a MaxCompute project, and create a new MaxCompute script file, as follows.
After running, create another file to insert data, as follows.
After running, you can find the newly created table in the Project Explorer of MaxCompute Studio. View the details of the table and preview the data, as shown in the following figure.
The results can be seen here.
• When you create a table, you must first specify the MaxCompute new type system. For compatibility reasons, you must enable this setting. It can also be specified by default in MaxCompute Studio, as shown in the following figure.
If you do not use MaxCompute Studio, you can specify set odps.sql.type.system.odps2=true;
in the script. Studio actually uses this switch in the backend to control whether new types are enabled. When the odps.sql.type.system.odps2
is set to true, in addition to using the new type, it also controls some behavior changes in other aspects. It will be explained in the relevant section.
If you want to enable the feature in a MaxCompute project by default, contact your project administrator and configure the feature in the project template.
• The following table lists the basic data types supported by MaxCompute after the expansion. The newly added data types include TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY.
Type | Newly added or not | Constant definition | Description |
TINYINT | Yes | 1Y, -127Y | 8-bit signed integer, ranging from -128 to 127. |
SMALLINT | Yes | 32767S, -100S | 16-bit signed integer, ranging from -32,768 to 32,767. |
INT | Yes | 1000, -15645787 [1] | 32-bit signed integer, ranging from -2^31 to 2^31 - 1 |
BIGINT | No | 100000000000L, -1L | 64-bit signed integer, ranging from -2^63 + 1 to 2^63 - 1 |
FLOAT | Yes | No | 32-bit binary floating point type |
DOUBLE | No | 3.1415926 1E+7 | 64-bit binary floating point type |
DECIMAL | No | 3.5BD, 99999999999.9999999BD | The precise numeric type based on the decimal system. • precision: indicates the number of digits in a value. Valid values: 1 to 38. • scale: indicates the number of digits to the right of the decimal point in a value. Valid values: 0 to 18. |
VARCHAR | Yes | None [2] | n is the length. Range from 1 to 65535. |
STRING | No | "abc", 'bcd', "alibaba", and 'inc' [3] | The string type. The maximum length is 8 MB. |
BINARY | Yes | No | A binary number. The maximum length is 8 MB. |
DATETIME | No | DATETIME '2017-11-11 00:00:00' | The DATETIME type. Valid values: 0001-01-01 00:00:00.000 to 9999-12-31 23:59:59.999, accurate to the millisecond. |
TIMESTAMP | Yes | TIMESTAMP '2017-11-11 00:00:00.123456789' | The TIMESTAMP type. Valid values: 0001-01-01 00:00:00.000000000 to 9999-12-31 23:59:59.999999999, accurate to the nanosecond. [4] |
BOOLEAN | No | TRUE,FALSE | Valid values: TRUE or FALSE |
The following table describes the new implicit conversion rules. [5]
boolean | tinyint | smallint | int | bigint | float | double | decimal | string | varchar | timestamp | binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
boolean to | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
tinyint to | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
smallint to | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
int to | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
bigint to | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
float to | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
double to | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
decimal to | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | FALSE | FALSE |
string to | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
varchar to | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
timestamp to | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | TRUE | FALSE |
binary to | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
In addition, the definition of constants of the DECIMAL type and the DATETIME type is added. 100BD is DECIMAL with a value of 100, and datetime '2017-11-11 00:00:00 'is a constant of the DATETIME type. The convenience of constant definition is that it can be directly used in the values clause and the values table, which will be described separately later.
For any programming language, including SQL, no matter how powerful the language itself is, if there are no rich functions supported by the latter class library, it will still be very inconvenient when it is applied. MaxCompute with new data types greatly enriches the built-in functions, as follows.
• Mathematical functions
log2, log10, bin, hex, unhex, degrees, radians, sign, e, pi, factorial, cbrt, shiftleft, shiftright, shiftrightunsigned
• Date functions
unix_timestamp, year, quarter, month, day, dayofmonth, hour, minute, second, millisecond, nanosecond, from_utc_timestamp, current_timestamp, add_months, last_day, next_day, months_between
• String functions
concat_ws, lpad, rpad, replace, soundex, substring_index, base64, unbase64
• Aggregate functions
corr
Most of these functions are compatible with the built-in functions of Hive. For more information, see The document of Hive. Different from Hive, these functions provided by MaxCompute are efficient versions implemented by using local code.
The new TIMESTAMP type supports nanosecond-level precision. To be compatible with it, MaxCompute adds specific millisecond
and nanosecond
functions to retrieve milliseconds of TIMESTAMP
and DATETIME
and nanoseconds of TIMESTAMP.
As mentioned in the previous article in this series, MaxCompute supports new coercion writing methods. For example, if you want to force a BIGINT variable to be converted to STRING, you can directly write string(a_bigint)
, which is equivalent to cast(a_bigint as string)
. Which form to use depends entirely on your preference.
It should be noted that all functions used to support new types, such as current_timestamp
, also need set odps.sql.type.system.odps2=true;
. Otherwise, compilation errors will be reported.
The support for partition types has also been expanded. Currently, partition types support TINYINT, SMALLINT, INT, BIGINT, VARCHAR, and STRING [6].
In addition, when the original ODPS is used for dynamic partitioning, if the type of partition column is not strictly consistent with the type of the column in the relevant SELECT list, an error will be reported. MaxCompute supports the implicit conversions of data types. For example,
set odps.sql.type.system.odps2=true;
create table parttable(a int, b double) partitioned by (p string);
insert into parttable partition(p) (p, a) select key, value, current_timestmap() from src;
select * from parttable;
Return
a | b | p |
0 | NULL | 2017-01-23 22:30:47.130406621 |
0 | NULL | 2017-01-23 22:30:47.130406621 |
You can see that the value of the partition column p is implicitly converted from the TIMESTAMP type.
Currently, Java UDFs of MaxCompute V2.0 support new types. Python UDFs will be implemented as soon as possible. Java UDFs use the new type as follows.
max compute type | java type |
tinyint | java.lang.Byte |
smallint | java.lang.Short |
int | java.lang.Integer |
bigint | java.lang.Long |
float | java.lang.Float |
double | java.lang.Double |
decimal | java.math.BigDecimal |
boolean | java.lang.Boolean |
string | java.lang.String |
varchar | com.aliyun.odps.data.Varchar |
binary | com.aliyun.odps.data.Binary |
datetime | java.util.Date |
timestamp | java.sql.Timestamp |
array | java.util.List |
map | java.util.Map |
struct | com.aliyun.odps.data.Struct |
It should be noted that the java type corresponding to the ARRAY type is List.
MaxCompute significantly enhances basic data types and built-in functions to cater to diverse application scenarios, yet relying solely on basic types can still be cumbersome in complex situations. Please look forward to the next article.
● [1]:
create table a_bigint_table(a int); -- INT here is actually treated as BIGINT.
select cast(id as int) from mytable; -- INT here is actually treated as BIGINT
To be compatible with the original ODPS mode, MaxCompute retains this conversion even if the odps.sql.type.system.odps2 is not set to true. But a warning is reported, indicating that INT is treated as BIGINT. If the case occurs in your script, we recommend that you rewrite all your scripts to BIGINT to avoid confusion.
● [2]:
The VARCHAR constant can be represented by implicit conversion of the STRING constant.
● [3]:
The STRING constant supports the connection. For example, 'abc' 'xyz' will be parsed as 'abcxyz'. Different parts can be written on different lines.
● [4]:
Due to the limitations of the underlying system, the current call current_timestamp does not reach nanosecond precision. For example,
meta_dev>set odps.sql.type.system.odps2=true;select nanosecond(current_timestamp());
The output is similar to
+------+
| _c0 |
+------+
| 877000000 |
+------+
The TIMASTAMP constant and external data imports can support nanosecond precision.
● [5]:
In the original ODPS, due to historical reasons, DOUBLE can be implicitly converted to BIGINT. This conversion may cause data loss, which is not allowed by general database systems. To be compatible with the original ODPS mode, MaxCompute still allows this conversion even if the odps.sql.type.system.odps2 is not set to true. But a warning is reported. If the odps.sql.type.system.odps2 is set to true, this implicit type conversion is not allowed.
● [6]:
In the original ODPS, due to historical reasons, although the partition type can be specified as BIGINT, all other cases are processed as STRING except that the schema of the table indicates that it is BIGINT. For example,
create table parttest (a bigint) partitioned by (pt bigint);
insert into parttest partition(pt) select 1, 2 from dual;
insert into parttest partition(pt) select 1, 10 from dual;
select * from parttest where pt >= 2;
The returned result has only one line because 10 is compared by string and 2. It can not go back. To be compatible with the original ODPS mode, MaxCompute still treats it in this way even if the odps.sql.type.system.odps2 is not set to true. If odps.sql.type.system.odps2 is set to true, partitions of the BIGINT type are processed in strict accordance with the BIGINT type.
MaxCompute Unleashed - Part 1: Harnessing Compiler Errors and Warnings Effectively
137 posts | 19 followers
FollowAlibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - February 18, 2024
137 posts | 19 followers
FollowConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreMore Posts by Alibaba Cloud MaxCompute