Examples of using the SDK for Python: tables - MaxCompute

This topic provides examples on how to perform operations on a table in typical scenarios by using the SDK for Python.

Query all tables

You can use the list_tables() method of an entry object to query all tables in a project.

for table in odps.list_tables():
    # Query all tables in a project.

Check whether a table exists

You can use the exist_table() method of an entry object to check whether a table exists. Then, use the get_table() method to obtain information about the table.

t = odps.get_table('table_name')
t.schema
odps.Schema {
  c_int_a                 bigint
  c_int_b                 bigint
  c_double_a              double
  c_double_b              double
  c_string_a              string
  c_string_b              string
  c_bool_a                boolean
  c_bool_b                boolean
  c_datetime_a            datetime
  c_datetime_b            datetime
}
t.lifecycle
-1
print(t.creation_time)
2014-05-15 14:58:43
t.is_virtual_view
False
t.size
1408
t.schema.columns
[<column c_int_a, type bigint>,
 <column c_int_b, type bigint>,
 <column c_double_a, type double>,
 <column c_double_b, type double>,
 <column c_string_a, type string>,
 <column c_string_b, type string>,
 <column c_bool_a, type boolean>,
 <column c_bool_b, type boolean>,
 <column c_datetime_a, type datetime>,
 <column c_datetime_b, type datetime>]

Create a table schema

You can use one of the following methods to create a table schema:

Create a schema based on table columns and optional partitions.

from odps.models import Schema, Column, Partition
columns = [
    Column(name='num', type='bigint', comment='the column'),
    Column(name='num2', type='double', comment='the column2'),
]
partitions = [Partition(name='pt', type='string', comment='the partition')]
schema = Schema(columns=columns, partitions=partitions)

After you create a schema, you can obtain information about columns and partitions.

Obtain information about all columns.

print(schema.columns)

Sample response:

[<column num, type bigint>,
 <column num2, type double>,
 <partition pt, type string>]

Obtain information about partition key columns.
```
print(schema.partitions)
```
Sample response:
```
[<partition pt, type string>]
```
Obtain the names of non-partition key columns.
```
print(schema.names)
```
Sample response:
```
['num', 'num2']
```
Obtain the data types of non-partition key columns.
```
print(schema.types)
```
Sample response:
```
[bigint, double]
```

Create a schema by calling the Schema.from_lists() method. This method is easier to call, but you cannot directly configure comments for columns and partitions.

from odps.models import Schema
schema = Schema.from_lists(['num', 'num2'], ['bigint', 'double'], ['pt'], ['string'])
print(schema.columns)

Sample response:

[<column num, type bigint>,
 <column num2, type double>,
 <partition pt, type string>]

Create a table

You can call the o.create_table() method to create a table by using a table schema or by specifying the names and data types of columns. When you create a table, you must make sure that the data types of columns in the table are valid.

Use a table schema to create a table

When you use a table schema to create a table, you must create a schema before you create a table.

# Create a table schema.
from odps.models import Schema
schema = Schema.from_lists(['num', 'num2'], ['bigint', 'double'], ['pt'], ['string'])

# Create a table by using the schema that you created.
table = o.create_table('my_new_table', schema)

# Create a table only if no table with the same name exists. 
table = o.create_table('my_new_table', schema, if_not_exists=True)

# Configure the lifecycle of the table. 
table = o.create_table('my_new_table', schema, lifecycle=7)

You can call the print(o.exist_table('my_new_table')) method to check whether the table is successfully created. If True is returned, the table is successfully created.

Create a table by specifying the names and data types of the columns to be contained in the table

# Create a partitioned table named my_new_table with specified common columns and partition key columns. 
table = o.create_table('my_new_table', ('num bigint, num2 double', 'pt string'), if_not_exists=True)

# Create a non-partitioned table named my_new_table02. 
table = o.create_table('my_new_table02', 'num bigint, num2 double', if_not_exists=True)

You can call the print(o.exist_table('my_new_table')) method to check whether the table is successfully created. If True is returned, the table is successfully created.

Create a table by specifying the names and data types of the columns to be contained in the table: new data types in the MaxCompute V2.0 data type edition

By default, when you create a table, only the BIGINT, DOUBLE, DECIMAL, STRING, DATETIME, BOOLEAN, MAP, and ARRAY data types are supported. If you need to use other data types such as TINYINT and STRUCT, you must set options.sql.use_odps2_extension to True. Example:

from odps import options
options.sql.use_odps2_extension = True
table = o.create_table('my_new_table', 'cat smallint, content struct<title:varchar(100), body:string>')

Delete a table

You can call the delete_table() method to delete an existing table.

o.delete_table('my_table_name', if_exists=True)  # Delete a table only if the table exists. 
t.drop() # Call the drop() method to drop a table if the table exists.

Manage table partitions

Check whether a table is partitioned.

table = o.get_table('my_new_table')
if table.schema.partitions:
    print('Table %s is partitioned.' % table.name)

Iterate over all the partitions in a table.

table = o.get_table('my_new_table')
for partition in table.partitions:  # Iterate over all partitions.
    print(partition.name)  # An iteration step. In this step, the partition name is displayed.
for partition in table.iterate_partitions(spec='pt=test'):  # Iterate over level-2 partitions in the partition named test.
    print(partition.name)  # An iteration step. In this step, the partition name is displayed.
for partition in table.iterate_partitions(spec='dt>20230119'):  # Iterate over level-2 partitions in the partitions that meet the dt>20230119 condition.
    print(partition.name)  # An iteration step. In this step, the partition name is displayed.

Important

In PyODPS 0.11.3 and later, you can specify logical expressions for iterate_partitions, such as dt>20230119 in the preceding example.

Check whether a partition exists.

table = o.get_table('my_new_table')
table.exist_partition('pt=test,sub=2015')

Obtain information about a partition.

table = o.get_table('my_new_table')
partition = table.get_partition('pt=test')
print(partition.creation_time)
partition.size

Create a partition.

t = o.get_table('my_new_table')
t.create_partition('pt=test', if_not_exists=True)  # Create a partition only if no partition with the same name exists.

Delete an existing partition.

t = o.get_table('my_new_table')
t.delete_partition('pt=test', if_exists=True)  # Set the if_exists parameter to True. This ensures that a partition is deleted only if the partition exists. 
partition.drop()  # Call the drop() method to drop a partition if the partition exists.

Read data from a table

You can read data from a table in different ways.

Use the head method to retrieve the first 10,000 or fewer data records in each table.

from odps import ODPS
t = o.get_table('dual')
for record in t.head(3):
    # Process each record.

Execute statements by using a WITH clause.

with t.open_reader(partition='pt=test') as reader:
count = reader.count
for record in reader[5:10]  # You can execute the statement multiple times until all records are read. The number of records is specified by count. You can change the code to parallel-operation code.
    # Process one record.

Execute statements without using a WITH clause.

reader = t.open_reader(partition='pt=test')
count = reader.count
for record in reader[5:10]  # You can execute the statement multiple times until all records are read. The number of records is specified by count. You can change the code to parallel-operation code.
    # Process one record.

Directly read data into Pandas DataFrames.

with t.open_reader(partition='pt=test') as reader:
pd_df = reader.to_pandas()

Write data to a table

Similar to open_reader, you can use open_writer of a table object to open a writer and write data to the table.

Execute statements by using a WITH clause.

with t.open_writer(partition='pt=test') as writer:
	  records = [[111, 'aaa', True],                 # A list can be used.
	             [222, 'bbb', False],
	             [333, 'ccc', True],
	             [444, 'Chinese', False]]
    writer.write(records)  # Records can be iterable objects.

records = [t.new_record([111, 'aaa', True]),   # Record objects can be used.
           t.new_record([222, 'bbb', False]),
           t.new_record([333, 'ccc', True]),
           t.new_record([444, 'Chinese', False])]
writer.write(records)

If the specified partition does not exist, set the create_partition parameter to True to create a partition. Example:

with t.open_writer(partition='pt=test', create_partition=True) as writer:
    records = [[111, 'aaa', True],                 # A list can be used.
               [222, 'bbb', False],
               [333, 'ccc', True],
               [444, 'Chinese', False]]
    writer.write(records)  # Records can be iterable objects.

An easier way is to use the write_table method of the MaxCompute object to write data.
```
records = [[111, 'aaa', True],                 # A list can be used.
           [222, 'bbb', False],
           [333, 'ccc', True],
           [444, 'Chinese', False]]
o.write_table('test_table', records, partition='pt=test', create_partition=True)
```
Note
- Each time you call write_table, MaxCompute generates a file on the server. This operation is time-consuming. If a large number of files are generated, the efficiency of subsequent query operations is reduced. Therefore, we recommend that you write multiple records at a time or provide a generator object when you use the write_table method.
- When you use the write_table method to write data, new data is appended to existing data. PyODPS does not provide options to overwrite existing data. You need to manually remove the data that you want to overwrite. For a non-partitioned table, you must call table.truncate(). For a partitioned table, you must delete partitions first.

Use the Arrow format to read and write data

Apache Arrow is a cross-language format that supports data exchange between different platforms. Since 2021, MaxCompute supports table data reading by using the Arrow format. PyODPS 0.11.2 and later versions support this feature. After you install pyarrow in your Python environment, you can add the arrow=True configuration when you call open_writer. This way, you can read or write Arrow RecordBatches.

import pandas as pd
import pyarrow as pa
with t.open_writer(partition='pt=test', create_partition=True, arrow=True) as writer:
    records = [[111, 'aaa', True],
               [222, 'bbb', False],
               [333, 'ccc', True],
               [444, 'Chinese', False]]
    df = pd.DataFrame(records, columns=["int_val", "str_val", "bool_val"])
    # Write a RecordBatch.
    batch = pa.RecordBatch.from_pandas(df)
    writer.write(batch)
    # You can also use Pandas DataFrame directly.
    writer.write(df)