This topic describes how to call the DataFrame API to perform column operations.

Column operations

from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))

If you add a constant or execute a sin function to a sequence object, you add the constant or execute the sin function to all the elements of the sequence object.

Null-related functions

The DataFrame API provides null-related built-in functions, including isnull, notnull, and fillna. You can use the isnull function to check whether the value of a field is null. You can use the notnull function to check whether the value of a field is not null. You can use the fillna function to replace null values with specified values.
>>> iris.sepallength.isnull().head(5)
   sepallength
0        False
1        False
2        False
3        False
4        False

Logic functions

The DataFrame API provides the logic functions ifelse and switch. You can use the ifelse function to check whether a BOOLEAN field is true. If the field is true, the parameter numbered as 0 is returned. If the field is false, the parameter numbered as 1 is returned.
>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5
You can use the switch function to handle multiple conditions.
>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0
>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0
In Python on MaxCompute (PyODPS) V0.7.8 and later, you can modify the values of a column in a dataset based on the specified conditions.
>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

Mathematical operations

In numerical fields, a sequence object supports mathematical operations such as addition (+), subtraction (-), multiplication (*), and division (/). The log and sin functions are also supported.
>>> (iris.sepallength * 10).log().head(5)
   sepallength
0     3.931826
1     3.891820
2     3.850148
3     3.828641
4     3.912023
>>> fields = [iris.sepallength,
>>>           (iris.sepallength / 2).rename('sepallength divided by 2'),
>>>           (iris.sepallength ** 2).rename('sepallength squared')]
>>> iris[fields].head(5)
   sepallength  sepallength divided by 2  sepallength squared
0          5.1              2.55             26.01
1          4.9              2.45             24.01
2          4.7              2.35             22.09
3          4.6              2.30             21.16
4          5.0              2.50             25.00
The following table describes the supported arithmetic operations.
Arithmetic operation Description
abs Returns the absolute value of the given number.
sqrt Returns the square root of the given number.
sin N/A
sinh N/A
cos N/A
cosh N/A
tan N/A
tanh N/A
arccos N/A
arccosh N/A
arcsin N/A
arcsinh N/A
arctan N/A
arctanh N/A
exp Returns e raised to the power of the given number.
expm1 Returns e raised to the power of the given number, minus 1.
log Returns the logarithm of the given number by using a supplied base.
log2 N/A
log10 N/A
log1p log(1+x)
radians Converts the values in radians to degrees.
degrees Converts the values in degrees to radians.
ceil Returns the smallest integer that is no less than the given number.
floor Returns the largest integer that is no greater than the given number.
trunc Returns a number truncated to the specified decimal place.
You can compare a sequence object with another sequence or scalar objects.
>>> (iris.sepallength < 5).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False
The DataFrame API does not support sequential comparison, such as 3 <= iris.sepallength <= 5. However, you can use the between function to check whether the value of iris.sepallength is within a specified interval.
>>> (iris.sepallength.between(3, 5)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4         True
By default, the between function specifies an interval that includes endpoints. To specify an open interval, set the inclusive parameter to False.
>>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

String-related operations

The DataFrame API provides a number of string-related operations for sequence or scalar objects.
>>> fields = [
>>>     iris.name.upper().rename('upper_name'),
>>>     iris.name.extract('Iris(.*)', group=1)
>>> ]
>>> iris[fields].head(5)
    upper_name     name
0  IRIS-SETOSA  -setosa
1  IRIS-SETOSA  -setosa
2  IRIS-SETOSA  -setosa
3  IRIS-SETOSA  -setosa
4  IRIS-SETOSA  -setosa
The following table describes the string-related operations.
Operation Description
capitalize N/A
contains Returns whether the given string contains a substring. The substring is a regular expression if the regex parameter is set to True. The regex parameter is set to True by default.
count Returns the number of occurrences of the specified string.
endswith Ends the given string with the specified string.
startswith Starts the given string with the specified string.
extract Extracts a regular expression. If the group is not specified, substrings that satisfy the pattern of a regular expression are returned. If the group is specified, the specified group is returned.
find Searches from left to right and returns the position of the first occurrence of the specified substring. If no matching substring exists, the value -1 is returned.
rfind Searches from right to left and returns the position of the first occurrence of the specified substring. If no matching substring exists, the value -1 is returned.
replace Replaces the substrings that satisfy the pattern of a regular expression with another substring. If you specify the n parameter, the substrings are replaced n times.
get Returns the string at the specified position.
len Returns the length of the given string.
ljust Pads the given string with the character specified by fillchar on the right until the string reaches the length specified by width. The default pad character is a space.
rjust Pads the given string with the character specified by fillchar on the left until the string reaches the length specified by width. The default pad character is a space.
lower Converts the given string to lowercase.
upper Converts the given string to uppercase.
lstrip Removes spaces on the left side of the given string, including blank lines.
rstrip Removes spaces on the right side of the given string, including blank lines.
strip Removes spaces on both sides of the given string, including blank lines.
split Splits the given string at the specified delimiter and returns a value of LIST<STRING> type.
pad Pads the given string with the character specified by fillchar at the specified position, which can be on the left side, right side, or both sides of the string. The default pad character is a space.
repeat Repeats n times.
slice Performs slice operations.
swapcase Converts all the uppercase characters to lowercase and all the lowercase characters to uppercase in the given string.
title Returns a titlecased version of the given string where words start with an uppercase character and the remaining characters are lowercase. This operation is the same as the str.title operation.
zfill Pads the given string with the character 0 on the left side of the string until the string reaches the length specified by width.
isalnum Returns True if all characters in the given string are alphanumeric. Otherwise, False is returned. This operation is the same as the str.isalnum operation.
isalpha Returns True if all characters in the given string are alphabetic. Otherwise, False is returned. This operation is the same as the str.isalpha operation.
isdigit Returns True if all the characters in the given string are digits. Otherwise, False is returned. This operation is the same as the str.isdigit operation.
isspace Returns True if all the characters in the given string are spaces. Otherwise, False is returned. This operation is the same as the str.isspace operation.
islower Returns True if all the cased characters in the given string are lowercase. Otherwise, False is returned. This operation is the same as the str.islower operation.
isupper Returns True if all the cased characters in the given string are uppercase. Otherwise, False is returned. This operation is the same as the str.isupper operation.
istitle Returns True if the given string is a titlecased string. Otherwise, False is returned. This operation is the same as the str.istitle operation.
isnumeric Returns True if all characters in the given string are numeric. Otherwise, False is returned. This operation is the same as the str.isnumeric operation.
isdecimal Returns True if all characters in the given string are decimal characters. Otherwise, False is returned. This operation is the same as the str.isdecimal operation.
todict Splits the given string at the specified delimiter into a dict and returns a value of DICT<STRING, STRING> type. The two input parameters are the project delimiter and key-value delimiter.
strptime

Converts the given string representing a time to the specified format. The time format is the same as the time format in the standard Python library. For more information about the time formats in Python, see Basic date and time types.

Time-related operations

You can call time-related built-in functions to manage sequence or scalar objects of DATETIME type.
>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>>    df.dt.year.rename('year'),
>>>    df.dt.month.rename('month'),
>>>    df.dt.day.rename('day'),
>>>    df.dt.hour.rename('hour')].head(5)
                    dt  year  month  day  hour
0  1998-04-08 11:02:00  1998      4    8    11
1  1998-04-08 10:57:55  1998      4    8    10
2  1998-04-08 10:45:26  1998      4    8    10
3  1998-04-08 10:25:52  1998      4    8    10
4  1998-04-08 10:44:19  1998      4    8    10
The following table describes the time-related attributes.
Time-related attribute Description
year N/A
month N/A
day N/A
hour N/A
minute N/A
second N/A
weekofyear Returns a number representing the week of the year where the provided date falls. Monday is taken as the first day of a week.
weekday Returns a number representing the day of the week where the provided date falls.
dayofweek Returns a number representing the day of the week where the provided date falls.
strftime

Converts the given string representing a time to the specified format. The time format is the same as the time in the standard Python library. For more information about the time formats in Python, see Basic date and time types.

PyODPS also supports the addition and subtraction of time. For example, you can retrieve the date three days before the current date. You can subtract one date column from another to obtain the difference in the number of milliseconds.

>>> df
                           a                          b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> from odps.df import day
>>> df.a - day(3)
                           a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype
int64
>>> (df.b - df.a).rename('a')
         a
0  3600000
1  3600000
2  3600000
The following table describes the supported DATETIME types.
Type Description
year N/A
month N/A
day N/A
hour N/A
minute N/A
second N/A
millisecond N/A

Collection-related operations

PyODPS supports collections of LIST and DICT types. You can use subscripts to retrieve an item from both types. You can also use the len method to retrieve the number of items in each collection.

In addition, collections of LIST and DICT types support the explode method. You can use the explode method to display the content of a collection. For collections of LIST type, the explode method returns one column by default. If you set the pos parameter to True, the explode method returns two columns. One of the columns indicates the serial number of each value in the array. The explode method is similar to the enumerate method in Python. For collections of DICT type, the explode method returns two columns. The two columns indicate the keys and values, respectively. You can pass in column names to the explode method as the names of the generated columns.

The following examples show how to use the explode method:

>>> df
   id         a                            b
0   1  [a1, b1]  {'a2': 0, 'b2': 1, 'c2': 2}
1   2      [c1]           {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']]
   id   a    b
0   1  a1    1
1   2  c1  NaN
>>> df[df.id, df.a.len(), df.b.len()]
   id  a  b
0   1  2  3
1   2  1  2
>>> df.a.explode()
    a
0  a1
1  b1
2  c1
>>> df.a.explode(pos=True)
   a_pos   a
0      0  a1
1      1  b1
2      0  c1
>>> # Specify column names.
>>> df.a.explode(['pos', 'value'], pos=True)
   pos value
0    0    a1
1    1    b1
2    0    c1
>>> df.b.explode()
  b_key  b_value
0    a2        0
1    b2        1
2    c2        2
3    d2        3
4    e2        4
>>> # Specify column names.
>>> df.b.explode(['key', 'value'])
  key  value
0  a2      0
1  b2      1
2  c2      2
3  d2      3
4  e2      4

You can use the explode method with Collection together. In this way, the columns generated by the explode method are combined with the original columns.

>>> df[df.id, df.a.explode()]
   id   a
0   1  a1
1   1  b1
2   2  c1
>>> df[df.id, df.a.explode(), df.b.explode()]
   id   a b_key  b_value
0   1  a1    a2        0
1   1  a1    b2        1
2   1  a1    c2        2
3   1  b1    a2        0
4   1  b1    b2        1
5   1  b1    c2        2
6   2  c1    d2        3
7   2  c1    e2        4
In addition to the len and explode methods, collections of LIST type support the following two methods.
Method Description
contains(v) Checks whether the given list contains a specified element.
sort Sorts the given list and returns a value of LIST type.
Collections of DICT type also support the following two methods.
Method Description
keys Retrieves DICT keys and returns a value of LIST type.
values Retrieves DICT values and returns a value of LIST type.

Other operations

You can use the isin operation to check whether the elements of a sequence object exist in a specified collection. You can use the notin operation to check whether the elements of a sequence object do not exist in a specified collection.
>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
   sepallength
0         True
1         True
2        False
3        False
4        False

You can use the cut operation to divide data in a sequence object into several segments.

>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
   sepallength_cut
0             None
1              4-5
2              4-5
3              4-5
4              4-5

You can use the include_under and include_over operations to specify the maximum and minimum values, respectively.

>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
   sepallength_cut
0               5-
1              4-5
2              4-5
3              4-5
4              4-5

Call built-in functions or UDFs in MaxCompute

To call built-in functions or user defined functions (UDFs) in MaxCompute to create columns, you can use the func function. The returned value of the func function is of STRING type by default. You can use the rtype parameter to specify the type of the returned value.
>>> from odps.df import func
>>>
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # Call UDFs defined in MaxCompute. You must specify the column name if the column name cannot be automatically determined.
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]
>>> # Call UDFs from other projects. You can specify the column name by using the name parameter.
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project', name='new_col')]
Note The Pandas backend does not allow you to execute expressions that contain the func function.