All Products
Search
Document Center

MaxCompute:Instructions for using JSON data types in MaxCompute

Last Updated:Nov 19, 2024

MaxCompute supports JSON data types. This helps improve the computing and analytics performance of tables that store data of a JSON type. This topic provides instructions for using JSON data types in MaxCompute.

Introduction to JSON data types

Background information

Semi-structured data is a mix of structured data and unstructured data. Semi-structured data has flexible schemas that do not have strong constraints. In most cases, the schemas of semi-structured data are self-descriptive. JSON data is typical semi-structured data. To strengthen the support for semi-structured data in MaxCompute SQL, MaxCompute supports items such as schema evolution, the JSON STRING type, built-in complex type functions, and Lambda expressions. However, you must process semi-structured data in a standardized manner and import the processed data to tables that have schemas and are used to store structured data. If your business data changes, you must use the schema evolution feature to explicitly execute DDL statements to modify table schemas.image.png

The preceding mode has strong constraints on schemas. In this mode, you cannot quickly import semi-structured data to a system. Data whose schemas do not conform to schema specifications can only be discarded during data import. To resolve this issue, the MaxCompute development team designs JSON data types. The JSON data types are suitable for semi-structured data that does not have strong constraints on schemas and allow you to fully utilize the column-oriented storage mode. This helps meet requirements for high flexibility and high performance.

Basic principles

JSON data types are new data types introduced in MaxCompute. You can use JSON data types in a similar way as other data types. The support for JSON data types frees you from managing schemas. After you insert JSON data into a MaxCompute table, MaxCompute automatically extracts a public schema and optimizes the schema. To improve performance, we recommend that you store JSON data in the column-oriented storage mode. The following code shows an example:

CREATE TABLE json_table
(
    json_val  json
);

CREATE TABLE string_table
(
    string_val  STRING
);

INSERT INTO string_table VALUES
        ('{"a":1, "b":2}')
        ,('{"a":"key", "b":2}')
        ,('{"c":3}');

INSERT INTO json_table
SELECT  json_parse(string_val)
FROM    string_table;

When you insert data into a MaxCompute table, MaxCompute automatically extracts a public schema <"a":binary, "b":bigint, "c":bigint>. When you query data from the table, MaxCompute can prune columns based on the public schema. This reduces the amount of data that needs to be read and improves query efficiency. The following code shows a query example.

SELECT  json_val["b"]
        ,json_val["c"]
FROM    json_table
;  
-- When you query data from the table, MaxCompute performs a column pruning operation and retains only the b and c variables.
+-----+-----+
| _c0 | _c1 |
+-----+-----+
| 2   | NULL |
| 2   | NULL |
| NULL | 3   |
+-----+-----+

MaxCompute stores data that does not belong to the public schema by using the BINARY type instead of the original STRING type. This can reduce the storage space that is required. Compared with user-defined functions (UDFs), JSON data types help improve the efficiency of conversions between the STRING type and a JSON data type.

Use JSON data types

Note

For new MaxCompute projects, the odps.sql.type.json.enable parameter is set to true by default. For existing MaxCompute projects, the odps.sql.type.json.enable parameter is set to false by default. Therefore, if you want to use JSON data types for an existing MaxCompute project, you must run the set odps.sql.type.json.enable=true; command to enable the JSON data type feature. You can execute the setproject; code to confirm the value of the odps.sql.type.json.enable parameter.

Limits

  • The supported development tools include the MaxCompute client (odpscmd), MaxCompute Studio, and DataWorks. External ecosystems such as Dataphin are not supported. If you need to use JSON data types supported by MaxCompute in an external ecosystem, you must check whether the external ecosystem supports the JSON data types. The following table describes the items that you must take note of when you use the MaxCompute client (odpscmd) or MaxCompute Studio as the development tool for data of a JSON type.

    MaxCompute client (odpscmd)

    MaxCompute Studio

    • You must upgrade the MaxCompute client to V0.46.5 or later. Otherwise, you cannot execute the desc json_table statement or download data of JSON types by using Tunnel.

    • You must set the use_instance_tunnel parameter to false. This parameter is included in the odps_config.ini file in the conf folder in the installation path of the client. Otherwise, an error is reported when you perform a query.

    MaxCompute Studio allows you to query JSON data. MaxCompute Studio does not allow you to upload or download JSON data.

  • If you want to use another engine, such as Hologres, to read data from a MaxCompute table, JSON data in the table cannot be read.

  • Columns of a JSON type cannot be added to a MaxCompute table.

  • You are not allowed to compare data of a JSON type with data of other types, execute SQL statements that contain the ORDER BY or GROUP BY clause on data of a JSON type, or use columns of a JSON type as join keys.

  • For data of the JSON NUMBER type, the integer part is stored by using the BIGINT type, and the decimal part is stored by using the DOUBLE type. If the integer part exceeds the range that is supported by the BIGINT type, an integer overflow occurs. When the decimal part is converted into the DOUBLE type, precision loss occurs.

  • Strings that are used to generate JSON data cannot contain \u0000, which is the null character in Unicode.

  • Java UDFs and Python UDFs do not support JSON data types.

  • Clustered tables cannot store JSON data.

  • SDKs for Java in versions earlier than V0.44.0 and PyODPS in versions earlier than V0.11.4.1 do not support JSON data types

  • Delta tables do not support JSON data types.

  • The JSON data type can be used nested, supporting up to 20 levels of nesting.

Literal constants

JSON data types are defined based on JSON standards and include BOOLEAN, NUMBER, STRING, NULL, ARRAY, and OBJECT. For data of the JSON NUMBER type, different parts are separately stored by using the BIGINT type and DOUBLE type. If a part of data of the JSON NUMBER type exceeds the specified range, precision loss occurs. JSON 'null' is different from SQL null.

JSON 'null'
JSON '123'
JSON '123.34'
JSON 'true'
JSON '{"id":123,"name":"MaxCompute"}'
JSON '[12, 34]'

The constants that you use must conform to JSON standards. For example, JSON '{id:123,"name":"MaxCompute"}' is an invalid JSON string. id must be enclosed in double quotation marks (").

Define JSON data types

You can create a table that contains data of a JSON type in the same way as you create a table that contains data of basic data types. You do not need to specify schemas.

CREATE TABLE mf_json_table (json_val JSON);

Generate JSON data

You can use one of the following methods to generate JSON data:

  • JSON Literal

    insert into mf_json_table values (json '123');
  • JSON functions

    -- json_object and json_array are built-in functions of MaxCompute.
    insert into mf_json_table select json_object("key",123, "value", "abc");
    
    select * from mf_json_table;
    
    -- The following result is returned:
    +----------+
    | json_val |
    +----------+
    | 123      |
    | {"key":123,"value":"abc"} |
    +----------+
    
    
    insert into mf_json_table select json_array("key",234, "value", "abc");
    
    select * from mf_json_table;
    
    -- The following result is returned:
    +----------+
    | json_val |
    +----------+
    | 123      |
    | ["key",234,"value","abc"] |
    | {"key":123,"value":"abc"} |
    +----------+
  • Type conversion

    You must take note of the differences between the cast function and the json_parse function. For more information, see Complex type functions.

    insert into mf_json_table select cast("abc" as json);
    select * from mf_json_table;
    -- The following result is returned:
    +----------+
    | json_val |
    +----------+
    | 123      |
    | "abc"    |
    | ["key",234,"value","abc"] |
    | {"key":123,"value":"abc"} |
    +----------+

Access JSON data

You can use the indexing method, json_extract function, or get_json_object function to access JSON data. The returned results are of a JSON type.

Indexing method

Access to JSON data is in the strict mode. The indexing method supports access based on a specified index and access based on a field name. If the JSON path is inconsistent with the actual path of JSON data, NULL is returned.

json_val['a'] [0][1] is equivalent to json_extract(json_val, 'strict $.a[0][1]').

-- 123 is returned.
SELECT v['id'] 
  FROM VALUES (JSON '{"id":123}') as t(v);
  
-- 12 is returned.
SELECT v[0] 
  FROM VALUES (JSON '[12, 34]') as t(v);
  
-- 1 is returned.
select v['x']['a']  from values (json '{"x": {"a": 1, "b": 2}}') as t(v);

-- NULL is returned.
SELECT v[0] 
FROM VALUES (JSON '{"id":123}') as t(v);

-- NULL is returned.
SELECT v['not_exists'] 
FROM VALUES (JSON '{"id":123}') as t(v);

JSON functions

In this example, the json_extract or get_json_object function is used to access JSON data.

-- Access JSON data by using the get_json_object function. MaxCompute is returned.
SELECT GET_JSON_OBJECT(v, '$.x.name')
  FROM VALUES (JSON '{"x": {"id": 1001, "name": "MaxCompute"}}') as t(v);
  
-- The following result is returned:
+-----+
| _c0 |
+-----+
| MaxCompute |
+-----+

-- Access JSON data by using the json_extract function. "MaxCompute" of a JSON type is returned.
SELECT JSON_EXTRACT(v, '$.x.name') 
  FROM VALUES (JSON '{"x": {"id": 1001, "name": "MaxCompute"}}') as t(v);
  
-- The following result is returned:
+-----+
| _c0 |
+-----+
| "MaxCompute" |
+-----+
Note

The newly designed JSON data types use a more standard method to parse a JSON path, which is different from parsing by using the get_json_object function of MaxCompute. The two parsing methods may be incompatible with each other. We recommend that you use the json_extract function for the parsing. For more built-in functions that are supported by MaxCompute SQL, see Complex type functions.

Specifications for a JSON path

A JSON path specifies the position of a node in JSON data and helps you find the desired node and obtain the desired data with ease. A JSON path is used as a parameter of a JSON function. The JSON path parser used for JSON data types is the same as the JSON path parser used in PostgreSQL and is a subset of PostgreSQL. The following code shows an example:

  • Sample JSON data:

    { "name": "Molly",
      "phones": [ 
        { "phonetype": "work",
        "phone#": "650-506-7000" 
        },
        { "phonetype": "cell",
          "phone#": "650-555-5555" 
        }
      ]
    }
  • Sample JSON path: The parsing result obtained based on $.phones[1]."phone#' is "650-555-5555".

The following table describes specifications for a JSON path based on the preceding JSON data.

Variable

Operator

accessor

  • member accessor: $.phone. Special characters such as $."sf*" are supported.

  • wildcard member accessor: $.*.

  • element accessor: $[1, 2, 4 to 7].

  • wildcard element accessor: $[*].

mode

Valid values: lax and strict. Default value: lax.

  • lax: The lax mode contains the wrapping and unwrapping processes. Example for a JSON path in the lax mode: 'lax $.phones.phonetype'.

    The results of the following expressions are returned based on the preceding JSON data:

    • $[0]: wrap object [{....}] indicates that data whose index is 0 is accessed and {....} is returned.

    • $[1]: wrap object [{....}] indicates that data whose index is 1 is accessed and NULL is returned.

    • $.name.*: The value of the name parameter is Molly, and a value of the OBJECT type is expected to be returned. NULL is returned.

    • $.name[*]: The value of the name parameter is Molly, and a value of the ARRAY type is expected to be returned. The value Molly is wrapped into ["Molly"], and ["Molly"] is returned.

    • $.phones.phonetype: The value of the phones parameter is of the ARRAY type. The value is unwrapped into two values of the OBJECT type. The system continues to obtain the values of the phonetype parameter in the two values of the OBJECT type and finally returns ["work","cell"].

    • $.phones[*].phonetype: The values of the phonetype parameter are directly obtained, and ["work","cell"] is returned.

  • strict: The strict mode requires that a JSON path be the same as the actual path of JSON data to be accessed. If the paths are different, NULL is returned. Example for a JSON path in the strict mode: 'strict $.phones.phonetype'.

    The results of the following expressions are returned based on the preceding JSON data:

    • strict $.phones.phonetype: The value of a subnode of the phones parameter is of the ARRAY type, and a value of the OBJECT type is expected to be returned. NULL is returned.

    • strict $.address: The address variable does not exist. Therefore, NULL is returned.

Important

The lax mode does not support column pruning. The strict mode supports column pruning.

Example

-- If the odps.sql.type.json.enable parameter is set to false for your project, you must execute the following statements:
set odps.sql.type.json.enable=true;
create table json_table(json_val json);

create table mf_string_table(string_val string);
insert into mf_string_table values('{"a":1, "b":2}');

insert into json_table select json_parse(string_val) 
                         from mf_string_table 
                         where json_valid(string_val);


select * from json_table where json_val is not null;
-- The following result is returned:
+----------+
| json_val |
+----------+
| {"a":1,"b":2} |
+----------+

select json_val['b'] from json_table where json_val is not null;
-- The following result is returned:
+-----+
| _c0 |
+-----+
| 2   |
+-----+