This topic provides examples on how to perform operations on a table in typical scenarios by using the SDK for Python.
Query all tables
You can use the list_tables()
method of an entry object to query all tables in a project.
for table in odps.list_tables():
# Query all tables in a project.
Check whether a table exists
You can use the exist_table()
method of an entry object to check whether a table exists. Then, use the get_table()
method to obtain information about the table.
t = odps.get_table('table_name')
t.schema
odps.Schema {
c_int_a bigint
c_int_b bigint
c_double_a double
c_double_b double
c_string_a string
c_string_b string
c_bool_a boolean
c_bool_b boolean
c_datetime_a datetime
c_datetime_b datetime
}
t.lifecycle
-1
print(t.creation_time)
2014-05-15 14:58:43
t.is_virtual_view
False
t.size
1408
t.schema.columns
[<column c_int_a, type bigint>,
<column c_int_b, type bigint>,
<column c_double_a, type double>,
<column c_double_b, type double>,
<column c_string_a, type string>,
<column c_string_b, type string>,
<column c_bool_a, type boolean>,
<column c_bool_b, type boolean>,
<column c_datetime_a, type datetime>,
<column c_datetime_b, type datetime>]
Create a table schema
You can use one of the following methods to create a table schema:
Create a schema based on table columns and optional partitions.
from odps.models import Schema, Column, Partition columns = [ Column(name='num', type='bigint', comment='the column'), Column(name='num2', type='double', comment='the column2'), ] partitions = [Partition(name='pt', type='string', comment='the partition')] schema = Schema(columns=columns, partitions=partitions)
After you create a schema, you can obtain information about columns and partitions.
Obtain information about all columns.
print(schema.columns)
Sample response:
[<column num, type bigint>, <column num2, type double>, <partition pt, type string>]
Obtain information about partition key columns.
print(schema.partitions)
Sample response:
[<partition pt, type string>]
Obtain the names of non-partition key columns.
print(schema.names)
Sample response:
['num', 'num2']
Obtain the data types of non-partition key columns.
print(schema.types)
Sample response:
[bigint, double]
Create a schema by calling the
Schema.from_lists()
method. This method is easier to call, but you cannot directly configure comments for columns and partitions.from odps.models import Schema schema = Schema.from_lists(['num', 'num2'], ['bigint', 'double'], ['pt'], ['string']) print(schema.columns)
Sample response:
[<column num, type bigint>, <column num2, type double>, <partition pt, type string>]
Create a table
You can call the o.create_table()
method to create a table by using a table schema or by specifying the names and data types of columns. When you create a table, you must make sure that the data types of columns in the table are valid.
Use a table schema to create a table
When you use a table schema to create a table, you must create a schema before you create a table.
# Create a table schema.
from odps.models import Schema
schema = Schema.from_lists(['num', 'num2'], ['bigint', 'double'], ['pt'], ['string'])
# Create a table by using the schema that you created.
table = o.create_table('my_new_table', schema)
# Create a table only if no table with the same name exists.
table = o.create_table('my_new_table', schema, if_not_exists=True)
# Configure the lifecycle of the table.
table = o.create_table('my_new_table', schema, lifecycle=7)
You can call the print(o.exist_table('my_new_table'))
method to check whether the table is successfully created. If True
is returned, the table is successfully created.
Create a table by specifying the names and data types of the columns to be contained in the table
# Create a partitioned table named my_new_table with specified common columns and partition key columns.
table = o.create_table('my_new_table', ('num bigint, num2 double', 'pt string'), if_not_exists=True)
# Create a non-partitioned table named my_new_table02.
table = o.create_table('my_new_table02', 'num bigint, num2 double', if_not_exists=True)
You can call the print(o.exist_table('my_new_table'))
method to check whether the table is successfully created. If True
is returned, the table is successfully created.
Create a table by specifying the names and data types of the columns to be contained in the table: new data types in the MaxCompute V2.0 data type edition
By default, when you create a table, only the BIGINT, DOUBLE, DECIMAL, STRING, DATETIME, BOOLEAN, MAP, and ARRAY data types are supported. If you need to use other data types such as TINYINT and STRUCT, you must set options.sql.use_odps2_extension
to True. Example:
from odps import options
options.sql.use_odps2_extension = True
table = o.create_table('my_new_table', 'cat smallint, content struct<title:varchar(100), body:string>')
Delete a table
You can call the delete_table()
method to delete an existing table.
o.delete_table('my_table_name', if_exists=True) # Delete a table only if the table exists.
t.drop() # Call the drop() method to drop a table if the table exists.
Manage table partitions
Check whether a table is partitioned.
table = o.get_table('my_new_table') if table.schema.partitions: print('Table %s is partitioned.' % table.name)
Iterate over all the partitions in a table.
table = o.get_table('my_new_table') for partition in table.partitions: # Iterate over all partitions. print(partition.name) # An iteration step. In this step, the partition name is displayed. for partition in table.iterate_partitions(spec='pt=test'): # Iterate over level-2 partitions in the partition named test. print(partition.name) # An iteration step. In this step, the partition name is displayed. for partition in table.iterate_partitions(spec='dt>20230119'): # Iterate over level-2 partitions in the partitions that meet the dt>20230119 condition. print(partition.name) # An iteration step. In this step, the partition name is displayed.
ImportantIn PyODPS 0.11.3 and later, you can specify logical expressions for
iterate_partitions
, such asdt>20230119
in the preceding example.Check whether a partition exists.
table = o.get_table('my_new_table') table.exist_partition('pt=test,sub=2015')
Obtain information about a partition.
table = o.get_table('my_new_table') partition = table.get_partition('pt=test') print(partition.creation_time) partition.size
Create a partition.
t = o.get_table('my_new_table') t.create_partition('pt=test', if_not_exists=True) # Create a partition only if no partition with the same name exists.
Delete an existing partition.
t = o.get_table('my_new_table') t.delete_partition('pt=test', if_exists=True) # Set the if_exists parameter to True. This ensures that a partition is deleted only if the partition exists. partition.drop() # Call the drop() method to drop a partition if the partition exists.
Read data from a table
You can read data from a table in different ways.
Use the head method to retrieve the first 10,000 or fewer data records in each table.
from odps import ODPS t = o.get_table('dual') for record in t.head(3): # Process each record.
Execute statements by using a WITH clause.
with t.open_reader(partition='pt=test') as reader: count = reader.count for record in reader[5:10] # You can execute the statement multiple times until all records are read. The number of records is specified by count. You can change the code to parallel-operation code. # Process one record.
Execute statements without using a WITH clause.
reader = t.open_reader(partition='pt=test') count = reader.count for record in reader[5:10] # You can execute the statement multiple times until all records are read. The number of records is specified by count. You can change the code to parallel-operation code. # Process one record.
Directly read data into Pandas DataFrames.
with t.open_reader(partition='pt=test') as reader: pd_df = reader.to_pandas()
Write data to a table
Similar to open_reader
, you can use open_writer
of a table object to open a writer and write data to the table.
Execute statements by using a WITH clause.
with t.open_writer(partition='pt=test') as writer: records = [[111, 'aaa', True], # A list can be used. [222, 'bbb', False], [333, 'ccc', True], [444, 'Chinese', False]] writer.write(records) # Records can be iterable objects. records = [t.new_record([111, 'aaa', True]), # Record objects can be used. t.new_record([222, 'bbb', False]), t.new_record([333, 'ccc', True]), t.new_record([444, 'Chinese', False])] writer.write(records)
If the specified partition does not exist, set the create_partition parameter to True to create a partition. Example:
with t.open_writer(partition='pt=test', create_partition=True) as writer: records = [[111, 'aaa', True], # A list can be used. [222, 'bbb', False], [333, 'ccc', True], [444, 'Chinese', False]] writer.write(records) # Records can be iterable objects.
An easier way is to use the write_table method of the MaxCompute object to write data.
records = [[111, 'aaa', True], # A list can be used. [222, 'bbb', False], [333, 'ccc', True], [444, 'Chinese', False]] o.write_table('test_table', records, partition='pt=test', create_partition=True)
NoteEach time you call
write_table
, MaxCompute generates a file on the server. This operation is time-consuming. If a large number of files are generated, the efficiency of subsequent query operations is reduced. Therefore, we recommend that you write multiple records at a time or provide a generator object when you use thewrite_table
method.When you use the
write_table
method to write data, new data is appended to existing data. PyODPS does not provide options to overwrite existing data. You need to manually remove the data that you want to overwrite. For a non-partitioned table, you must calltable.truncate()
. For a partitioned table, you must delete partitions first.
Use the Arrow format to read and write data
Apache Arrow is a cross-language format that supports data exchange between different platforms. Since 2021, MaxCompute supports table data reading by using the Arrow format. PyODPS 0.11.2 and later versions support this feature. After you install pyarrow in your Python environment, you can add the arrow=True
configuration when you call open_writer
. This way, you can read or write Arrow RecordBatches.
import pandas as pd
import pyarrow as pa
with t.open_writer(partition='pt=test', create_partition=True, arrow=True) as writer:
records = [[111, 'aaa', True],
[222, 'bbb', False],
[333, 'ccc', True],
[444, 'Chinese', False]]
df = pd.DataFrame(records, columns=["int_val", "str_val", "bool_val"])
# Write a RecordBatch.
batch = pa.RecordBatch.from_pandas(df)
writer.write(batch)
# You can also use Pandas DataFrame directly.
writer.write(df)