This topic describes window functions that are supported by the DataFrame API.
grouped = iris.groupby('name')
grouped.mutate(grouped.sepallength.cumsum(), grouped.sort('sepallength').row_number()).head(10)
name sepallength_sum row_number
0 Iris-setosa 250.3 1
1 Iris-setosa 250.3 2
2 Iris-setosa 250.3 3
3 Iris-setosa 250.3 4
4 Iris-setosa 250.3 5
5 Iris-setosa 250.3 6
6 Iris-setosa 250.3 7
7 Iris-setosa 250.3 8
8 Iris-setosa 250.3 9
9 Iris-setosa 250.3 10
- Use window functions to select columns.
iris['name', 'sepallength', iris.groupby('name').sort('sepallength').sepallength.cumcount()].head(5) name sepallength sepallength_count 0 Iris-setosa 4.3 1 1 Iris-setosa 4.4 2 2 Iris-setosa 4.4 3 3 Iris-setosa 4.4 4 4 Iris-setosa 4.5 5
- Use window functions to aggregate data by scalar. The processing method is a combination
of grouping and aggregation.
from odps.df import Scalar iris.groupby(Scalar(1)).sort('sepallength').sepallength.cumcount()
The following table lists the window functions that are supported by the DataFrame
API.
Window function | Description |
---|---|
cumsum | Calculates the cumulative sum. |
cummean | Calculates the cumulative mean. |
cummedian | Calculates the cumulative median. |
cumstd | Calculates the cumulative standard deviation. |
cummax | Calculates the cumulative maximum. |
cummin | Calculates the cumulative minimum. |
cumcount | Calculates the cumulative sum. |
lag | Retrieves the value of a row at a given offset that precedes the current row. The system determines the number of the row to retrieve the value based on the following formula: Number of the current row - Offset value. |
lead | Retrieves the value of a row at a given offset that follows the current row. The system determines the number of the row to retrieve the value based on the following formula: Number of the current row + Offset value. |
rank | Calculates the rank of a row in an ordered group. |
dense_rank | Calculates the dense rank of a row in an ordered group. |
percent_rank | Calculates the relative rank of a row in an ordered group. |
row_number | Calculates the row number. Row numbers start from 1. |
qcut | Divides a group of data into N bins based on the data sequence and returns the number of the bin that contains the current data. If data is not evenly distributed in bins, more data is distributed to the first bin by default. |
nth_value | Retrieves the Nth value in the group. |
cume_dist | Calculates the proportion of rows whose values are no greater than the current value to all rows in a group. |
The following tables list parameters that are supported by different window functions.
- The
rank
,dense_rank
,percent_rank
, androw_number
window functions support the following parameters.Parameter Description sort The keyword that is used to sort rows. This parameter is an empty string by default. ascending Specifies whether to sort rows in ascending order. This parameter is set to True by default. - The
lag
andlead
window functions support the following parameters in addition to the parameters that are supported by therank
function.Parameter Description offset The number of rows that are between the current row and the row where you want to retrieve data. default The value to return if the row at the specified offset does not exist. - The
cumsum
,cummax
,cummin
,cummean
,cummedian
,cumcount
, andcumstd
window functions support the following parameters in addition to the parameters that are supported by therank
function.Parameter Description unique Specifies whether to enable data deduplication. This parameter is set to False by default. preceding The start of a window. following The end of a window.