All Products
Search
Document Center

MaxCompute:Overview of Open Storage

Last Updated:Jan 07, 2026

MaxCompute provides the Storage API to improve integration with the big data ecosystem and allow external engines to access data in MaxCompute. Mainstream third-party compute engines can call the Storage API to directly access the underlying storage of MaxCompute. This significantly improves the efficiency of data access and interaction.

Open Storage overview

The Storage API is a data service interface that provides an efficient, low-latency, and secure method for data access. It allows mainstream third-party compute engines to access the MaxCompute storage system. This improves the integration and data processing efficiency between MaxCompute and open source compute engines or machine learning engines. Compute engines such as Spark, StarRocks, Presto, and Flink can use a connector to directly access data stored in MaxCompute. This simplifies the data usage process and improves data access performance. The following figure shows the architecture.

image

Scenarios

The Storage API is ideal for scenarios that require data accessibility and multi-engine computing. If enterprises or developers need to flexibly switch between computing frameworks or use specific engine features to process data in MaxCompute, the Storage API acts as a bridge to facilitate data exchange and enable diverse data processing.

Key features

  • High throughput: The Storage API supports efficient columnar data reads, predicate pushdown for data filtering before transmission, and the Arrow data format.

  • Secure and user-friendly: Offers direct readaccess to the underlying storage with table semantics. This abstracts away storage complexities while adhering to security policies such as project isolation, access control, and data encryption.

  • Ecosystem integration: Spark on EMR and StarRocks can use a connector to directly access data in MaxCompute. This simplifies compute engine integration.

Scope

  • Third-party engines that access MaxCompute can read standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views. However, they cannot read external tables or logical views in MaxCompute.

  • Reading data of the JSON type is not supported.

  • For the pay-as-you-go Storage API, the number of concurrent requests is limited to 1,000 per tenant, and the transmission rate is limited to 10 MB/s per concurrent request.

Data transmission resources

When a third-party engine uses the MaxCompute Storage API for data transmission tasks, you can use either exclusive resource groups for Data Transmission Service (subscription) resources. The following table describes the resources.

Resource group name

Billing description

Supported regions

Usage instructions

Subscription fees for exclusive data transmission resources

This resource group is based on the subscription billing method. You are charged based on the number of concurrent instances that you purchase.

  • China (Beijing)

  • China (Hangzhou)

  • China (Shanghai)

  • China (Shenzhen)

  • China (Hong Kong)

  • Indonesia (Jakarta)

  • Singapore

  • US (Silicon Valley)

  • US (Virginia)

  • Japan (Tokyo)

  • Germany (Frankfurt)

Purchase and use an exclusive resource group for Data Transmission Service

On the Resource Observation page, you can view usage details for exclusive resource groups for Data Transmission Service (subscription). For more information, see Resource Observation.

Usage examples

Arrow data type mapping

The MaxCompute Storage API uses Apache Arrow data types to ensure an efficient and structured representation of data during storage and transmission. When you write data using the MaxCompute Storage API, the data is not computed or processed. For example, duplicate keys are not removed from Map data. The original data structure is retained unless the storage engine has specific restrictions.

The following table describes the mappings between MaxCompute and Apache Arrow data types.

MaxCompute data type

Arrow data type

TINYINT

Int8Type

SMALLINT

Int16Type

INT

Int32Type

BIGINT

Int64Type

FLOAT

FloatType

DOUBLE

DoubleType

BOOLEAN

BooleanType

DECIMAL

Decimal128Type

Note

The precision of the MaxCompute DECIMAL type is higher than that of the Storage API Decimal128Type.

  • When you read data, the MaxCompute DECIMAL type is converted to the Arrow decimal(38,18) type for output. An exception is thrown if an overflow occurs.

  • When you write data, the Arrow decimal(precision,scale) type is converted to the MaxCompute DECIMAL(38,18) type. The precision and scale must be the same.

DECIMAL(precision, scale)

Decimal128Type

STRING

StringType

BINARY

BinaryType

VARCHAR

StringType

CHAR

StringType

DATETIME

TimestampType[1]

The time unit is milliseconds, and the timezone is UTC.

TIMESTAMP

TimestampType[2]

The time unit is nanoseconds, and the timezone is UTC.

Note

The TIMESTAMP type supports a wider value range. If you use the Storage API to read or write TIMESTAMP data that is beyond the precision range of TimestampType, the high-precision part of the value is truncated. This causes a loss of precision.

DATE

Date32Type

INTERVAL_DAY_TIME

DayTimeIntervalType

Note

The INTERVAL_DAY_TIME type is accurate to nanoseconds. The Storage API DayTimeIntervalType type is accurate to milliseconds. If you use the Storage API to read or write INTERVAL_DAY_TIME data, the nanosecond part is truncated. This causes a loss of precision.

INTERVAL_YEAR_MONTH

MonthIntervalType

ARRAY

ListType

MAP

MapType

Note

If a MAP contains duplicate keys:

  • Write phase: The Storage API does not process duplicate keys in a Map when writing data. The original data structure is retained and written to storage.

  • Query phase: When you query data, the SQL engine uses the "last write wins" principle and returns a result that complies with the semantics of unique keys for the Map data type.

For example, if you write the raw data {'a': 1, 'b': 2, 'a': 3}, the query result is {'a': 3, 'b': 2}.

STRUCT

StructType

JSON

StringType