Function introduction
MaxFrame is a distributed computing framework for Alibaba Cloud MaxCompute. It provides a Python programming interface to address two key problems in traditional Python data processing: performance bottlenecks and inefficient data movement. With MaxFrame, you can perform distributed processing and analysis of petabyte-scale big data directly on MaxCompute for tasks such as visual data exploration, scientific computing, and machine learning and AI development. This meets the growing demand for efficient big data processing and AI development in the Python ecosystem.
Scenarios
Interactive data exploration
MaxFrame provides a smooth experience without memory limitations for real-time exploratory analysis, manipulation, and visualization of massive datasets, similar to working in a local Jupyter Notebook.
Large-scale data pre-processing (ETL)
For tasks such as data cleaning, format conversion, and feature engineering on terabyte-scale raw data, you can use expressive and maintainable Python code instead of complex SQL and user-defined function (UDF) logic. This approach also provides the high performance of distributed execution.
AI and machine learning
In a model development workflow, MaxFrame unifies the experience of data processing and model training. You can use MaxFrame to efficiently prepare training data and combine it with the image feature to import libraries such as Scikit-learn and XGBoost to build end-to-end AI workflows.
Scope
Supported regions
China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).
Supported environments
Local Python development environments.
MaxCompute Notebook.
DataWorks Notebook.
DataWorks Data Development PyODPS 3 task nodes.
Billing
MaxFrame is billed based on the compute resource usage of jobs. It supports subscription billing methods.
Subscription: Jobs consume the quota of your purchased resource groups without incurring additional charges.
For more information, see Analyze MaxCompute bill and usage details.
Core advantages
Compared with other Python development tools, MaxFrame aligns better with familiar development habits, offers more efficient data processing, provides more elastic computing resources, and delivers a more convenient development experience.
Pandas-compatible API: MaxFrame provides an API that is highly compatible with Pandas. This allows for the smooth migration of existing code to the MaxCompute platform and significantly reduces learning and migration costs.
Server-side distributed execution: MaxFrame jobs run directly within the MaxCompute cluster. Data does not need to be pulled to a local client. This eliminates performance bottlenecks caused by insufficient client memory and enables efficient processing of petabyte-scale data.
Elastic computing resources: MaxFrame uses the MaxCompute serverless architecture to allocate compute resources on demand. You can process data of any scale without cluster management.
Simplified development environment: MaxFrame provides built-in Python 3.7 and Python 3.11 environments with pre-installed common libraries such as Pandas and XGBoost. You can manage third-party dependencies with simple annotations. This greatly simplifies environment configuration and dependency management and is more convenient than manually packaging and uploading UDF dependencies.
The following compares this tool with other Python development tools:
Comparison item | MaxFrame | PyODPS | Mars | SQL + UDFs |
Development interface | Compatible with Pandas. | The syntax and interface differ significantly from Pandas DataFrame. | Requires the use of two interfaces: SQL and Python. | |
Data processing | At runtime, data does not need to be pulled to a local client for processing. This reduces unnecessary local data transfers and improves job execution efficiency. | In PyODPS, the | Distributed execution is supported for only some operators. A cluster must be created during initialization, which can be slow and unstable. | Supports distributed jobs based on MaxCompute SQL capabilities. |
Computing resources | It is not limited by local resource size and breaks through the performance bottleneck of single-machine Python. | Limited by local resource size. | Limited by resource size. You must specify the worker, CPU, and memory sizes. | Enables elastic computing for SQL jobs based on the MaxCompute serverless architecture. |
Development experience | Provides an out-of-the-box interactive development environment with offline scheduling capabilities. It includes built-in common libraries and supports dependency management through annotations, eliminating the need for manual packaging. | Provides an out-of-the-box interactive development environment and offline scheduling capabilities. | You must prepare the corresponding runtime environment and start a Mars cluster. | Dependency packages for Python UDFs must be manually packaged and uploaded. |
How it works
MaxFrame hides the complexity of distributed processing. The automated workflow is as follows:
Code submission: You write and run Python code in a client, such as a Notebook. The MaxFrame SDK captures the code and submits it to MaxCompute.
Parsing and optimization: After the MaxCompute execution engine receives the job, it parses the syntax, performs logical optimization, and transforms the job into a physical plan that can be run in parallel.
Distributed execution: The optimized task is distributed to multiple compute nodes in the MaxCompute cluster to directly read data and perform parallel computing.
Result return: After the computation is complete, the results are aggregated and returned to your client.