This topic describes how to call the DataFrame API to perform column operations.
Column operations
from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))
If you add a constant or execute a sin function to a sequence object, you add the constant or execute the sin function to all the elements of the sequence object.
Null-related functions
isnull
function to check whether the value of a field is null. You can use the notnull
function to check whether the value of a field is not null. You can use the fillna
function to replace null values with specified values. >>> iris.sepallength.isnull().head(5)
sepallength
0 False
1 False
2 False
3 False
4 False
Logic functions
ifelse
function to check whether a BOOLEAN field is true. If the field is true, the parameter
numbered as 0 is returned. If the field is false, the parameter numbered as 1 is returned.
>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
cmp5
0 gt5
1 lte5
2 lte5
3 lte5
4 lte5
switch
function to handle multiple conditions. >>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
equalness
0 noeq
1 eq4.9
2 noeq
3 noeq
4 eq5.0
>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
equalness
0 noeq
1 eq4.9
2 noeq
3 noeq
4 eq5.0
>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
cmp5
0 gt5
1 lte5
2 lte5
3 lte5
4 lte5
Mathematical operations
>>> (iris.sepallength * 10).log().head(5)
sepallength
0 3.931826
1 3.891820
2 3.850148
3 3.828641
4 3.912023
>>> fields = [iris.sepallength,
>>> (iris.sepallength / 2).rename('sepallength divided by 2'),
>>> (iris.sepallength ** 2).rename('sepallength squared')]
>>> iris[fields].head(5)
sepallength sepallength divided by 2 sepallength squared
0 5.1 2.55 26.01
1 4.9 2.45 24.01
2 4.7 2.35 22.09
3 4.6 2.30 21.16
4 5.0 2.50 25.00
Arithmetic operation | Description |
---|---|
abs | Returns the absolute value of the given number. |
sqrt | Returns the square root of the given number. |
sin | N/A |
sinh | N/A |
cos | N/A |
cosh | N/A |
tan | N/A |
tanh | N/A |
arccos | N/A |
arccosh | N/A |
arcsin | N/A |
arcsinh | N/A |
arctan | N/A |
arctanh | N/A |
exp | Returns e raised to the power of the given number. |
expm1 | Returns e raised to the power of the given number, minus 1. |
log | Returns the logarithm of the given number by using a supplied base. |
log2 | N/A |
log10 | N/A |
log1p | log(1+x) |
radians | Converts the values in radians to degrees. |
degrees | Converts the values in degrees to radians. |
ceil | Returns the smallest integer that is no less than the given number. |
floor | Returns the largest integer that is no greater than the given number. |
trunc | Returns a number truncated to the specified decimal place. |
>>> (iris.sepallength < 5).head(5)
sepallength
0 False
1 True
2 True
3 True
4 False
3 <= iris.sepallength <= 5
. However, you can use the between
function to check whether the value of iris.sepallength
is within a specified interval. >>> (iris.sepallength.between(3, 5)).head(5)
sepallength
0 False
1 True
2 True
3 True
4 True
between
function specifies an interval that includes endpoints. To specify an open interval,
set the inclusive
parameter to False. >>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
sepallength
0 False
1 True
2 True
3 True
4 False
String-related operations
>>> fields = [
>>> iris.name.upper().rename('upper_name'),
>>> iris.name.extract('Iris(.*)', group=1)
>>> ]
>>> iris[fields].head(5)
upper_name name
0 IRIS-SETOSA -setosa
1 IRIS-SETOSA -setosa
2 IRIS-SETOSA -setosa
3 IRIS-SETOSA -setosa
4 IRIS-SETOSA -setosa
Operation | Description |
---|---|
capitalize | N/A |
contains | Returns whether the given string contains a substring. The substring is a regular
expression if the regex parameter is set to True. The regex parameter is set to True by default.
|
count | Returns the number of occurrences of the specified string. |
endswith | Ends the given string with the specified string. |
startswith | Starts the given string with the specified string. |
extract | Extracts a regular expression. If the group is not specified, substrings that satisfy the pattern of a regular expression are returned. If the group is specified, the specified group is returned. |
find | Searches from left to right and returns the position of the first occurrence of the specified substring. If no matching substring exists, the value -1 is returned. |
rfind | Searches from right to left and returns the position of the first occurrence of the specified substring. If no matching substring exists, the value -1 is returned. |
replace | Replaces the substrings that satisfy the pattern of a regular expression with another substring. If you specify the n parameter, the substrings are replaced n times. |
get | Returns the string at the specified position. |
len | Returns the length of the given string. |
ljust | Pads the given string with the character specified by fillchar on the right until the string reaches the length specified by width . The default pad character is a space.
|
rjust | Pads the given string with the character specified by fillchar on the left until the string reaches the length specified by width . The default pad character is a space.
|
lower | Converts the given string to lowercase. |
upper | Converts the given string to uppercase. |
lstrip | Removes spaces on the left side of the given string, including blank lines. |
rstrip | Removes spaces on the right side of the given string, including blank lines. |
strip | Removes spaces on both sides of the given string, including blank lines. |
split | Splits the given string at the specified delimiter and returns a value of LIST<STRING> type. |
pad | Pads the given string with the character specified by fillchar at the specified position, which can be on the left side, right side, or both sides
of the string. The default pad character is a space.
|
repeat | Repeats n times. |
slice | Performs slice operations. |
swapcase | Converts all the uppercase characters to lowercase and all the lowercase characters to uppercase in the given string. |
title | Returns a titlecased version of the given string where words start with an uppercase
character and the remaining characters are lowercase. This operation is the same as
the str.title operation.
|
zfill | Pads the given string with the character 0 on the left side of the string until the
string reaches the length specified by width .
|
isalnum | Returns True if all characters in the given string are alphanumeric. Otherwise, False
is returned. This operation is the same as the str.isalnum operation.
|
isalpha | Returns True if all characters in the given string are alphabetic. Otherwise, False
is returned. This operation is the same as the str.isalpha operation.
|
isdigit | Returns True if all the characters in the given string are digits. Otherwise, False
is returned. This operation is the same as the str.isdigit operation.
|
isspace | Returns True if all the characters in the given string are spaces. Otherwise, False
is returned. This operation is the same as the str.isspace operation.
|
islower | Returns True if all the cased characters in the given string are lowercase. Otherwise,
False is returned. This operation is the same as the str.islower operation.
|
isupper | Returns True if all the cased characters in the given string are uppercase. Otherwise,
False is returned. This operation is the same as the str.isupper operation.
|
istitle | Returns True if the given string is a titlecased string. Otherwise, False is returned.
This operation is the same as the str.istitle operation.
|
isnumeric | Returns True if all characters in the given string are numeric. Otherwise, False is
returned. This operation is the same as the str.isnumeric operation.
|
isdecimal | Returns True if all characters in the given string are decimal characters. Otherwise,
False is returned. This operation is the same as the str.isdecimal operation.
|
todict | Splits the given string at the specified delimiter into a dict and returns a value of DICT<STRING, STRING> type. The two input parameters are the project delimiter and key-value delimiter. |
strptime |
Converts the given string representing a time to the specified format. The time format is the same as the time format in the standard Python library. For more information about the time formats in Python, see Basic date and time types. |
Time-related operations
>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>> df.dt.year.rename('year'),
>>> df.dt.month.rename('month'),
>>> df.dt.day.rename('day'),
>>> df.dt.hour.rename('hour')].head(5)
dt year month day hour
0 1998-04-08 11:02:00 1998 4 8 11
1 1998-04-08 10:57:55 1998 4 8 10
2 1998-04-08 10:45:26 1998 4 8 10
3 1998-04-08 10:25:52 1998 4 8 10
4 1998-04-08 10:44:19 1998 4 8 10
Time-related attribute | Description |
---|---|
year | N/A |
month | N/A |
day | N/A |
hour | N/A |
minute | N/A |
second | N/A |
weekofyear | Returns a number representing the week of the year where the provided date falls. Monday is taken as the first day of a week. |
weekday | Returns a number representing the day of the week where the provided date falls. |
dayofweek | Returns a number representing the day of the week where the provided date falls. |
strftime |
Converts the given string representing a time to the specified format. The time format is the same as the time in the standard Python library. For more information about the time formats in Python, see Basic date and time types. |
PyODPS also supports the addition and subtraction of time. For example, you can retrieve the date three days before the current date. You can subtract one date column from another to obtain the difference in the number of milliseconds.
>>> df
a b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> from odps.df import day
>>> df.a - day(3)
a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype
int64
>>> (df.b - df.a).rename('a')
a
0 3600000
1 3600000
2 3600000
Type | Description |
---|---|
year | N/A |
month | N/A |
day | N/A |
hour | N/A |
minute | N/A |
second | N/A |
millisecond | N/A |
Collection-related operations
PyODPS supports collections of LIST and DICT types. You can use subscripts to retrieve
an item from both types. You can also use the len
method to retrieve the number of items in each collection.
In addition, collections of LIST and DICT types support the explode
method. You can use the explode method to display the content of a collection. For
collections of LIST type, the explode
method returns one column by default. If you set the pos
parameter to True, the explode method returns two columns. One of the columns indicates
the serial number of each value in the array. The explode method is similar to the
enumerate
method in Python. For collections of DICT type, the explode
method returns two columns. The two columns indicate the keys and values, respectively.
You can pass in column names to the explode
method as the names of the generated columns.
The following examples show how to use the explode method:
>>> df
id a b
0 1 [a1, b1] {'a2': 0, 'b2': 1, 'c2': 2}
1 2 [c1] {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']]
id a b
0 1 a1 1
1 2 c1 NaN
>>> df[df.id, df.a.len(), df.b.len()]
id a b
0 1 2 3
1 2 1 2
>>> df.a.explode()
a
0 a1
1 b1
2 c1
>>> df.a.explode(pos=True)
a_pos a
0 0 a1
1 1 b1
2 0 c1
>>> # Specify column names.
>>> df.a.explode(['pos', 'value'], pos=True)
pos value
0 0 a1
1 1 b1
2 0 c1
>>> df.b.explode()
b_key b_value
0 a2 0
1 b2 1
2 c2 2
3 d2 3
4 e2 4
>>> # Specify column names.
>>> df.b.explode(['key', 'value'])
key value
0 a2 0
1 b2 1
2 c2 2
3 d2 3
4 e2 4
You can use the explode
method with Collection together. In this way, the columns generated by the explode
method are combined with the original columns.
>>> df[df.id, df.a.explode()]
id a
0 1 a1
1 1 b1
2 2 c1
>>> df[df.id, df.a.explode(), df.b.explode()]
id a b_key b_value
0 1 a1 a2 0
1 1 a1 b2 1
2 1 a1 c2 2
3 1 b1 a2 0
4 1 b1 b2 1
5 1 b1 c2 2
6 2 c1 d2 3
7 2 c1 e2 4
len
and explode
methods, collections of LIST type support the following two methods.
Method | Description |
---|---|
contains(v) | Checks whether the given list contains a specified element. |
sort | Sorts the given list and returns a value of LIST type. |
Method | Description |
---|---|
keys | Retrieves DICT keys and returns a value of LIST type. |
values | Retrieves DICT values and returns a value of LIST type. |
Other operations
You can use theisin
operation to check whether the elements of a sequence object exist in a specified
collection. You can use the notin
operation to check whether the elements of a sequence object do not exist in a specified
collection. >>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
sepallength
0 True
1 True
2 False
3 False
4 False
You can use the cut
operation to divide data in a sequence object into several segments.
>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
sepallength_cut
0 None
1 4-5
2 4-5
3 4-5
4 4-5
You can use the include_under
and include_over
operations to specify the maximum and minimum values, respectively.
>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
sepallength_cut
0 5-
1 4-5
2 4-5
3 4-5
4 4-5
Call built-in functions or UDFs in MaxCompute
func
function. The returned value of the func function is of STRING type by default. You
can use the rtype
parameter to specify the type of the returned value. >>> from odps.df import func
>>>
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # Call UDFs defined in MaxCompute. You must specify the column name if the column name cannot be automatically determined.
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]
>>> # Call UDFs from other projects. You can specify the column name by using the name parameter.
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project', name='new_col')]
func
function.