This topic describes the APIs that are specific to MaxFrame, including Session, Input/Output, Execute, and Fetch. These APIs provide a convenient way to process data in MaxFrame tasks.
Session
new_session
API name: new_session. For more information about the source code, see new_session.
new_session( session_id: str = None, default: bool = True, new: bool = True, odps_entry: Optional[ODPS] = None )Description: starts a MaxFrame task session.
Input parameters
Parameter
Data type
Required
Description
session_id
String
No
The session identifier.
This parameter is used to specify a unique identifier for a new session. If this parameter is not specified, MaxFrame automatically generates a default identifier.
default
Boolean
No
Specifies whether to use the created session as the default session.
Default value: True.
new
Boolean
No
Specifies whether to create a session.
Default value: True. If this parameter is set to False, an existing session is reused based on session_id.
odps_entry
ODPS
Yes
The MaxCompute entry object. For more information, see Create a MaxCompute entry point.
Return value
The session object.
Sample code
from maxframe import new_session from odps import ODPS # Use the MaxFrame account to initialize MaxCompute. o = ODPS( # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the AccessKey ID of your Alibaba Cloud account. # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the AccessKey secret of your Alibaba Cloud account. # We recommend that you do not directly use the actual AccessKey ID and AccessKey secret. os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='your-default-project', endpoint='your-end-point', ) # Initialize the MaxFrame session. session = new_session(odps_entry=o)
Input/Output-related APIs
read_odps_table
API name: read_odps_table. For more information about the source code, see read_odps_table.
read_odps_table( table_name: Union[str, Table], partitions: Union[None, str, List[str]] = None, columns: Optional[List[str]] = None, index_col: Union[None, str, List[str]] = None, odps_entry: ODPS = None, string_as_binary: bool = None, append_partitions: bool = False )Description: reads data from a MaxCompute table and builds a DataFrame object. You can specify specific columns as indexes. If you do not specify indexes, a RangeIndex is generated.
Input parameters
Parameter
Data type
Required
Description
table_name
String/Table
Yes
The name of the MaxCompute table or table object from which you want to read data.
partitions
String/List
No
The table partition or partition list from which you want to read data.
The format is
<partition_name>=<partition_value>. If you do not specify this parameter, data from all partitions in the table is read.columns
List
No
The names of columns from which you want to read data.
The format is
<column1>, <column2>, .... If you do not specify this parameter, data from all columns except partition key columns is read.index_col
String/List
No
The names of columns that are used as indexes.
odps_entry
ODPS
No
The ODPS entry object. For more information, see Initialize an ODPS entry point.
string_as_binary
Boolean
No
Specifies whether to read string data in the binary form.
append_partitions
Boolean
No
Specifies whether to read data from partition key columns.
The default value is False. If this parameter is set to True and the
columnsparameter is not specified, data from all columns, including partition key columns, is read.Return value
The DataFrame object.
Sample code
import maxframe.dataframe as md df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users', index_col='user_id', columns=['age', 'sex']) print(df.execute().fetch()) # Return value user_id age sex 1 24 M 2 53 F 3 23 M 4 24 M 5 33 F ... ... ... 939 26 F 940 32 M 941 20 M 942 48 F 943 22 M
read_odps_query
API name: read_odps_query. For more information about the source code, see read_odps_query.
read_odps_query( query: str, odps_entry: ODPS = None, index_col: Union[None, str, List[str]] = None, string_as_binary: bool = None )Description: reads data from a MaxCompute SQL query and creates a DataFrame object. You can specify specific columns as indexes. If you do not specify indexes, a RangeIndex is generated.
Input parameters
Parameter
Data type
Required
Description
query
String
Yes
The MaxCompute SQL statement that you want to read.
odps_entry
ODPS
No
The MaxCompute entry object. For more information, see Create a MaxCompute entry point.
index_col
String/List
No
The names of columns that are used as indexes.
string_as_binary
Boolean
No
Specifies whether to read string data in the binary form.
Return value
The DataFrame object.
Sample code
import maxframe.dataframe as md df = md.read_odps_query('select user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`')
to_odps_table
API name: to_odps_table. For more information about the source code, see to_odps_table.
to_odps_table( table: Union[Table, str], partition: Optional[str] = None, partition_col: Union[None, str, List[str]] = None, overwrite: bool = False, unknown_as_string: Optional[bool] = None, index: bool = True, index_label: Union[None, str, List[str]] = None, lifecycle: Optional[int] = None )Description: writes a DataFrame object to a MaxCompute table. If the table does not exist in MaxCompute, MaxFrame automatically creates the table.
Input parameters
Parameter
Data type
Required
Description
table
String/Table
Yes
The name of the table or table object to which you want to write DataFrame data.
partition
String
No
The partition to which you want to write data.
For example,
pt1=xxx, pt2=yyy.partition_col
String/List
No
The names of columns that are used as partition key columns in DataFrame.
overwrite
Boolean
No
Specifies whether to overwrite data if the table or partition already exists.
Default value: False.
unknown_as_string
Boolean
No
Specifies whether to process data of an unrecognized type as the STRING data type.
Default value: False. If this parameter is set to True, the object type in DataFrame is processed as the STRING data type. An error may occur.
index
Boolean
No
Specifies whether to store indexes.
Default value: True.
index_label
String/List
No
The name of the column specified for the index.
The name of an index column is specified by the index_label parameter. If you do not specify this parameter, the default name `index` is used. For a single-level index, the name defaults to `index`. For a multi-level index, the names are `level_x`, where x is the level of the index.
lifecycle
int
No
The lifecycle of the output table.
The value of this parameter is a positive integer. If the table already exists, the setting of this parameter overwrites the original parameter setting.
Return value
The DataFrame object.
Example
import maxframe.dataframe as md df = md.read_odps_query('select user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`', index_col='user_id')) ouput_df = df.to_odps_table('output_table', lifecycle = 7)
to_odps_model
API name: to_odps_model.
to_odps_model( model_name: str, model_version: str = None, schema: str = None, project: str = None, description: Optional[str] = None, version_description: Optional[str] = None, create_model: bool = True, set_default_version: bool = False )Description: Saves an XGBoost model that is trained by a MaxFrame job as a MaxCompute model object.
Input parameters
Parameter
Data type
Required
Description
model_name
String
Yes
The model name.
If
projectandschemaare specified separately in the job, specify only the model name. Otherwise, specify the model name in theproject.schema.modelformat.
model_version
String
No
The model version.
If you do not specify this parameter, the system automatically generates a version.
schema
String
No
The schema to which the model belongs.
If you do not specify this parameter, the default schema is "default".
project
String
No
The project to which the model belongs.
description
String
No
The model description.
version_description
String
No
The description of the model version.
create_model
Boolean
No
Specifies whether to automatically create the model if it does not exist.
Default value: True.
set_default_version
Boolean
No
Specifies whether to set the current version as the default version of the model.
Default value: False.
Return value
A Scalar object. You can call
.execute()to trigger the model saving operation.Example
# Train an XGBoost model from maxframe.learn.contrib.xgboost import XGBClassifier X_df = md.DataFrame(X, columns=cols) clf = XGBClassifier(n_estimators=10) clf.fit(X_df, y) # Save the model to MaxCompute clf.to_odps_model( model_name="my_model", # If you specify a project and schema, the format of model_name is as follows: # model_name="project.schema.my_model" model_version="version1" ).execute()
Execute
execute
API name: execute. For more information about the source code, see execute.
execute( session: SessionType = None )Description: calls the execute method to start a data processing task.
Input parameters
Parameter
Data type
Required
Description
session
Session
No
The session that is used to run a data processing task. For more information about how to create a session, see new_session.
If this parameter is not specified, the global session initialized using new_session is used.
Return value
N/A.
Example
import maxframe.dataframe as md df = md.read_odps_query('select user_id, age, sex FROM BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users', index_col='user_id')) df.execute()
Fetch
fetch
API name: fetch. For more information about the source code, see fetch.
fetch( session: SessionType = None )Description: returns the result data to the on-premises environment.
Input parameters
Parameter
Data type
Required
Description
session
Session
No
The session that is used to obtain the result data. For more information about how to create a session, see new_session.
If this parameter is not specified, the global session initialized using new_session is used.
Return value
The DataFrame or Series of Pandas.
Sample code
import maxframe.dataframe as md df = md.read_odps_query('select user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`', index_col='user_id') res = df.execute().fetch() print(res) # Obtain the returned result. user_id age sex 1 24 M 2 53 F 3 23 M 4 24 M 5 33 F ... ... .. 939 26 F 940 32 M 941 20 M 942 48 F 943 22 M [943 rows x 2 columns]