This topic describes the features of Mars, the differences between Mars and PyODPS, and the scenarios of using Mars and the PyODPS DataFrame API.
After MaxFrame of MaxCompute is officially released, it can be used to replace PyODPS DataFrame and Mars APIs. MaxFrame provides better operator compatibility and enhanced distributed capabilities. We recommend that new MaxCompute users develop Python jobs based on MaxFrame and new Python jobs be developed based on MaxFrame.
Scenarios
Use Mars and the PyODPS DataFrame API in the following scenarios:
Mars
You often call the to_pandas() method of the PyODPS DataFrame API to convert a PyODPS DataFrame into a pandas DataFrame.
You are familiar with the pandas API, but do not want to learn the PyODPS DataFrame API.
You want to use indexes.
You want to maintain the data order after you create a DataFrame.
The Mars DataFrame API provides the iloc method to retrieve rows and obtain data from specific rows. For example,
df.iloc[10]
is used to obtain data in the tenth row. The Mars DataFrame API also provides thedf.shift()
anddf.ffill()
methods, both of which can be used only in scenarios where the data order is maintained.You want to run NumPy or scikit-learn in a parallel and distributed manner, or run TensorFlow, PyTorch, and XGBoost in a distributed manner.
You want to process data whose volume is less than 1 TB.
PyODPS DataFrame
You want to use MaxCompute to schedule jobs. The PyODPS DataFrame API compiles operations on DataFrames to MaxCompute SQL statements. If you want to schedule jobs by using MaxCompute, we recommend that you use the PyODPS DataFrame API.
You want to schedule jobs in a more stable environment. The PyODPS DataFrame API compiles operations to MaxCompute SQL statements for execution. MaxCompute is stable, which means that PyODPS is stable. Mars is new and less stable. Therefore, we recommend that you use the PyODPS DataFrame API if you require high stability.
If you want to process data whose volume is larger than 1 TB, we recommend that you use the PyODPS DataFrame API.
Differences between Mars and PyODPS
API
Mars
The Mars DataFrame API is fully compatible with pandas. The Mars tensor API is compatible with NumPy. The Mars learn API is compatible with scikit-learn.
PyODPS
PyODPS provides only the DataFrame API, which is different from the pandas API.
Table of contents
Mars
The Mars DataFrame API supports operations based on indexes, including row indexes and column indexes. The following code shows an example of how to use the Mars DataFrame API:
In [1]: import mars.dataframe as md In [5]: import mars.tensor as mt In [7]: df = md.DataFrame(mt.random.rand(10, 3), index=md.date_range('2020-5-1', periods=10)) In [9]: df.loc['2020-5'].execute() Out[9]: 0 1 2 2020-05-01 0.061912 0.507101 0.372242 2020-05-02 0.833663 0.818519 0.943887 2020-05-03 0.579214 0.573056 0.319786 2020-05-04 0.476143 0.245831 0.434038 2020-05-05 0.444866 0.465851 0.445263 2020-05-06 0.654311 0.972639 0.443985 2020-05-07 0.276574 0.096421 0.264799 2020-05-08 0.106188 0.921479 0.202131 2020-05-09 0.281736 0.465473 0.003585 2020-05-10 0.400000 0.451150 0.956905
PyODPS
PyODPS does not support index-based operations.
Data order
Mars
After a Mars DataFrame is created, it maintains the data order. The Mars DataFrame API provides time series methods such as
shift
, and missing value handling methods such asffill
andbfill
.In [3]: df = md.DataFrame([[1, None], [None, 1]]) In [4]: df.execute() Out[4]: 0 1 0 1.0 NaN 1 NaN 1.0 In [5]: df.ffill().execute() # Fill the missing value with the value in the previous row. Out[5]: 0 1 0 1.0 NaN 1 1.0 1.0
PyODPS
PyODPS processes and stores data by using MaxCompute, which does not maintain the data order. Therefore, PyODPS does not ensure the data order or support time series methods.
Execution
Mars
Mars consists of a client and a distributed execution layer. You can call the
o.create_mars_cluster
method to create a Mars cluster in MaxCompute and submit computing jobs to the Mars cluster. This process significantly reduces the costs for scheduling. Mars outperforms PyODPS in processing smaller amounts of data.PyODPS
PyODPS is a client and does not contain any servers. When you use the PyODPS DataFrame API, the system compiles the operations to MaxCompute SQL statements for execution. Therefore, the operations supported by the PyODPS DataFrame API depend on MaxCompute SQL. Every time you call the
execute
method, a MaxCompute job is submitted for the cluster to schedule.
Usage notes
Mars is a unified distributed computing framework based on tensors. Mars can use parallel and distributed computing technologies to accelerate data processing for Python data science libraries such as NumPy, pandas and scikit-learn.
Mars provides the following common APIs:
The Mars tensor API mimics the NumPy API and can process large multidimensional arrays, which are also called tensors. The following code shows an example of how to use the Mars tensor API:
import mars.tensor as mt a = mt.random.rand(10000, 50) b = mt.random.rand(50, 5000) a.dot(b).execute()
Mars DataFrame
The Mars DataFrame API mimics the pandas API and can process and analyze a large amount of data. The following code shows an example of how to use the Mars DataFrame API:
import mars.dataframe as md ratings = md.read_csv('Downloads/ml-20m/ratings.csv') movies = md.read_csv('Downloads/ml-20m/movies.csv') movie_rating = ratings.groupby('movieId', as_index=False).agg({'rating': 'mean'}) result = movie_rating.merge(movies[['movieId', 'title']], on='movieId') result.sort_values(by='rating', ascending=False).execute()
Mars learn
The Mars learn API mimics the scikit-learn API. The Mars learn API can integrate TensorFlow, PyTorch, and XGBoost. The following code shows an example of how to use the Mars learn API:
import mars.dataframe as md from mars.learn.neighbors import NearestNeighbors df = md.read_csv('data.csv') nn = NearestNeighbors(n_neighbors=10) nn.fit(df) neighbors = nn.kneighbors(df).fetch()
References
Technical support
If you encounter any issues when you use Mars, click the link to join the DingTalk group for technical support.