MaxCompute MaxFrame is a distributed computing framework developed by Alibaba Cloud. It is compatible with Pandas interfaces and automatically performs distributed processing. This ensures powerful data processing capabilities and greatly improves the data processing scale and computing efficiency.
MaxFrame allows you to use the same APIs as Pandas in a distributed environment to analyze data. With MaxFrame, you can perform data analysis and computation on MaxCompute with performance tens of times higher than that of open-source Pandas. This article introduces how to use common Pandas operators with MaxFrame.
MaxFrame has been installed. For more information, please refer to Preparations.
1. Run the following script in the Python environment with MaxFrame installed to prepare test tables and test data:
from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
import pandas as pd
o = ODPS(
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the Access Key ID of your Alibaba Cloud account.
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the Access Key Secret of your Alibaba Cloud account.
# We recommend that you do not directly use your Access Key ID / Access Key Secret.
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-end-point',
)
data_sets = [{
"table_name": "product",
"table_schema" : "index bigint, product_id bigint, product_name string, current_price bigint",
"source_type": "records",
"records" : [
[1, 100, 'Nokia', 1000],
[2, 200, 'Apple', 5000],
[3, 300, 'Samsung', 9000]
],
},
{
"table_name" : "sales",
"table_schema" : "index bigint, sale_id bigint, product_id bigint, user_id bigint, year bigint, quantity bigint, price bigint",
"source_type": "records",
"records" : [
[1, 1, 100, 101, 2008, 10, 5000],
[2, 2, 300, 101, 2009, 7, 4000],
[3, 4, 100, 102, 2011, 9, 4000],
[4, 5, 200, 102, 2013, 6, 6000],
[5, 8, 300, 102, 2015, 10, 9000],
[6, 9, 100, 102, 2015, 6, 2000]
],
"lifecycle": 5
}]
def prepare_data(o: ODPS, data_sets, suffix="", drop_if_exists=False):
for index, data in enumerate(data_sets):
table_name = data.get("table_name")
table_schema = data.get("table_schema")
source_type = data.get("source_type")
if not table_name or not table_schema or not source_type:
raise ValueError(f"Dataset at index {index} is missing one or more required keys: 'table_name', 'table_schema', or 'source_type'.")
lifecycle = data.get("lifecycle", 5)
table_name += suffix
print(f"Processing {table_name}...")
if drop_if_exists:
print(f"Deleting {table_name}...")
o.delete_table(table_name, if_exists=True)
o.create_table(name=table_name, table_schema=table_schema, lifecycle=lifecycle, if_not_exists=True)
if source_type == "local_file":
file_path = data.get("file")
if not file_path:
raise ValueError(f"Dataset at index {index} with source_type 'local_file' is missing the 'file' key.")
sep = data.get("sep", ",")
pd_df = pd.read_csv(file_path, sep=sep)
ODPSDataFrame(pd_df).persist(table_name, drop_table=True)
elif source_type == 'records':
records = data.get("records")
if not records:
raise ValueError(f"Dataset at index {index} with source_type 'records' is missing the 'records' key.")
with o.get_table(table_name).open_writer() as writer:
writer.write(records)
else:
raise ValueError(f"Unknown data set source_type: {source_type}")
print(f"Processed {table_name} Done")
prepare_data(o, data_sets, "_maxframe_demo", True)
Parameters:
http://service.cn-chengdu.maxcompute.aliyun.com/api
. For more information, please refer to Configure endpoints.2. Execute the following SQL statement to query data in the sales_maxframe_demo
and product_maxframe_demo
:
-- Query the sales_maxframe_demo table.
SELECT * FROM sales_maxframe_demo;
-- The following result is returned:
+------------+------------+------------+------------+------------+------------+------------+
| index | sale_id | product_id | user_id | year | quantity | price |
+------------+------------+------------+------------+------------+------------+------------+
| 1 | 1 | 100 | 101 | 2008 | 10 | 5000 |
| 2 | 2 | 300 | 101 | 2009 | 7 | 4000 |
| 3 | 4 | 100 | 102 | 2011 | 9 | 4000 |
| 4 | 5 | 200 | 102 | 2013 | 6 | 6000 |
| 5 | 8 | 300 | 102 | 2015 | 10 | 9000 |
| 6 | 9 | 100 | 102 | 2015 | 6 | 2000 |
+------------+------------+------------+------------+------------+------------+------------+
-- Query the product_maxframe_demo table.
SELECT * FROM product_maxframe_demo;
-- The following result is returned:
+------------+------------+--------------+---------------+
| index | product_id | product_name | current_price |
+------------+------------+--------------+---------------+
| 1 | 100 | Nokia | 1000 |
| 2 | 200 | Apple | 5000 |
| 3 | 300 | Samsung | 9000 |
+------------+------------+--------------+---------------+
product_name
Corresponding to All sale_id
in the sales_maxframe_demo Table and All year
Values and price
Values of the Product.• Sample code
from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
o = ODPS(
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the Access Key ID of your Alibaba Cloud account.
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the Access Key Secret of your Alibaba Cloud account.
# We recommend that you do not directly use your Access Key ID / Access Key Secret.
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-end-point',
)
session = new_session(o)
# The session id is a string of characters used to associate a MaxFrame task. It is important for debugging and tracking the task status.
print(session.session_id)
sales = md.read_odps_table("sales_maxframe_demo", index_col="index")
product = md.read_odps_table("product_maxframe_demo", index_col="product_id")
# The df is not immediately executed unless you use the df.execute() to trigger it.
# This means that all computations will be fully completed within the MaxCompute cluster, avoiding unnecessary intermediate data transfers and bottlenecks.
df = sales.merge(product, left_on="product_id", right_index=True)
df = df[["product_name", "year", "price"]]
print(df.execute().fetch())
# Save the result to the MaxCompute table and destroy the Session.
md.to_odps_table(df, "result_df", overwrite=True).execute()
session.destroy()
• The following result is returned:
index product_name year price
1 Nokia 2008 5000
2 Samsung 2009 4000
3 Nokia 2011 4000
4 Apple 2013 6000
5 Samsung 2015 9000
6 Nokia 2015 2000
• Performance comparison
In a data sample where the sales table has 50 million records (size: 1.96 GB) and the product table has 100,000 records (size: 3 MB), the computation yields the following time comparison results:
Environment | Time (unit: seconds) |
Local Pandas (version 1.3.5) | 65.8 |
MaxFrame | 22 |
• Sample code:
from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
o = ODPS(
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the Access Key ID of your Alibaba Cloud account.
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the Access Key Secret of your Alibaba Cloud account.
# We recommend that you do not directly use your Access Key ID / Access Key Secret.
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-end-point',
)
session = new_session(o)
# The session id is a string of characters used to associate a MaxFrame task. It is important for debugging and tracking the task status.
print(session.session_id)
# Aggregate to get the first year of each product.
min_year_df = md.read_odps_table("sales_maxframe_demo", index_col="index")
min_year_df = min_year_df.groupby('product_id', as_index=False).agg(first_year=('year', 'min'))
# Find the corresponding sales record through join.
sales = md.read_odps_table("sales_maxframe_demo", index_col=['product_id', 'year'])
result_df = md.merge(sales, min_year_df,
left_index=True,
right_on=['product_id','first_year'],
how='inner')
# The result_df is not immediately executed unless you use the result_df.execute() to trigger it.
# This means that all computations will be fully completed within the MaxCompute cluster, avoiding unnecessary intermediate data transfers and bottlenecks.
result_df = result_df[['product_id', 'first_year', 'quantity', 'price']]
print(result_df.execute().fetch())
# Destroy the Session.
session.destroy()
• The following result is returned:
product_id first_year quantity price
100 100 2008 10 5000
300 300 2009 7 4000
200 200 2013 6 6000
• Performance comparison
In a data sample where the sales table has 50 million records (size: 1.96 GB) and the product table has 100,000 records (size: 3 MB), the computation yields the following time comparison results:
Environment | Time (unit: seconds) |
Local Pandas (version 1.3.5) | 186 |
MaxFrame | 21 |
Note: This scenario demonstrates multiple groupby, join, drop_duplicates, and sort_values operations.
• Sample code
from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-end-point',
)
session = new_session(o)
# The session id is a string of characters used to associate a MaxFrame task. It is important for debugging and tracking the task status.
print(session.session_id)
sales = md.read_odps_table("sales_maxframe_demo", index_col="index")
product = md.read_odps_table("product_maxframe_demo", index_col="product_id")
sales['total'] = sales['price'] * sales['quantity']
product_cost_df = sales.groupby(['product_id', 'user_id'], as_index=False).agg(user_product_total=('total','sum'))
product_cost_df = product_cost_df.merge(product, left_on="product_id", right_index=True, how='right')
user_cost_df = product_cost_df.groupby('user_id').agg(max_total=('user_product_total', 'max'))
merge_df = product_cost_df.merge(user_cost_df, left_on='user_id', right_index=True)
# The result_df is not immediately executed unless you use the result_df.execute() to trigger it.
# This means that all computations will be fully completed within the MaxCompute cluster, avoiding unnecessary intermediate data transfers and bottlenecks.
result_df = merge_df[merge_df['user_product_total'] == merge_df['max_total']][['user_id', 'product_id']].drop_duplicates().sort_values(['user_id'], ascending = [1])
print(result_df.execute().fetch())
# Destroy the Session.
session.destroy()
• The following result is returned:
user_id product_id
100 101 100
300 102 300
• Performance comparison
In a data sample where the sales table has 50 million records (size: 1.96 GB) and the product table has 100,000 records (size: 3 MB), the computation yields the following time comparison results:
Environment | Time (unit: seconds) |
Local Pandas (version 1.3.5) | 176 |
MaxFrame | 85 |
MaxFrame is compatible with Pandas interfaces and is capable of automatic distributed processing. This ensures powerful data processing capabilities and greatly improves the data processing scale and computing efficiency.
Introduction to MaxCompute's Unified Near Real-time Data Processing Architecture
137 posts | 19 followers
FollowAlibaba Cloud MaxCompute - June 22, 2020
Alibaba Cloud MaxCompute - December 6, 2021
Alibaba Clouder - April 22, 2021
Alibaba Cloud MaxCompute - May 5, 2019
Alibaba Cloud MaxCompute - March 20, 2019
Alibaba Cloud MaxCompute - June 22, 2020
137 posts | 19 followers
FollowConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreMore Posts by Alibaba Cloud MaxCompute
Santhakumar Munuswamy August 5, 2024 at 6:54 am
Thanks for the sharing