Column operations

from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))

If you add a constant or execute a sin function to a sequence object, you add the constant or execute the sin function to all the elements of the sequence object.

Null-related functions

The DataFrame API provides null-related built-in functions, including isnull, notnull, and fillna. You can use the isnull function to check whether the value of a field is null. You can use the notnull function to check whether the value of a field is not null. You can use the fillna function to replace null values with specified values.

>>> iris.sepallength.isnull().head(5)
   sepallength
0        False
1        False
2        False
3        False
4        False

Logic functions

The DataFrame API provides the logic functions ifelse and switch. You can use the ifelse function to check whether a BOOLEAN field is true. If the field is true, the parameter numbered as 0 is returned. If the field is false, the parameter numbered as 1 is returned.

>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

You can use the switch function to handle multiple conditions.

>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

In Python on MaxCompute (PyODPS) V0.7.8 and later, you can modify the values of a column in a dataset based on the specified conditions.

>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

Mathematical operations

In numerical fields, a sequence object supports mathematical operations such as addition (+), subtraction (-), multiplication (*), and division (/). The log and sin functions are also supported.

>>> (iris.sepallength * 10).log().head(5)
   sepallength
0     3.931826
1     3.891820
2     3.850148
3     3.828641
4     3.912023

>>> fields = [iris.sepallength,
>>>           (iris.sepallength / 2).rename('sepallength divided by 2'),
>>>           (iris.sepallength ** 2).rename('sepallength squared')]
>>> iris[fields].head(5)
   sepallength  sepallength divided by 2  sepallength squared
0          5.1              2.55             26.01
1          4.9              2.45             24.01
2          4.7              2.35             22.09
3          4.6              2.30             21.16
4          5.0              2.50             25.00

The following table describes the supported arithmetic operations.


Arithmetic operation	Description
abs	Returns the absolute value of the given number.
sqrt	Returns the square root of the given number.
sin	N/A
sinh	N/A
cos	N/A
cosh	N/A
tan	N/A
tanh	N/A
arccos	N/A
arccosh	N/A
arcsin	N/A
arcsinh	N/A
arctan	N/A
arctanh	N/A
exp	Returns e raised to the power of the given number.
expm1	Returns e raised to the power of the given number, minus 1.
log	Returns the logarithm of the given number by using a supplied base.
log2	N/A
log10	N/A
log1p	log(1+x)
radians	Converts the values in radians to degrees.
degrees	Converts the values in degrees to radians.
ceil	Returns the smallest integer that is no less than the given number.
floor	Returns the largest integer that is no greater than the given number.
trunc	Returns a number truncated to the specified decimal place.

You can compare a sequence object with another sequence or scalar objects.

>>> (iris.sepallength < 5).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

The DataFrame API does not support sequential comparison, such as 3 <= iris.sepallength <= 5. However, you can use the between function to check whether the value of iris.sepallength is within a specified interval.

>>> (iris.sepallength.between(3, 5)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4         True

By default, the between function specifies an interval that includes endpoints. To specify an open interval, set the inclusive parameter to False.

>>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

String-related operations

The DataFrame API provides a number of string-related operations for sequence or scalar objects.

>>> fields = [
>>>     iris.name.upper().rename('upper_name'),
>>>     iris.name.extract('Iris(.*)', group=1)
>>> ]
>>> iris[fields].head(5)
    upper_name     name
0  IRIS-SETOSA  -setosa
1  IRIS-SETOSA  -setosa
2  IRIS-SETOSA  -setosa
3  IRIS-SETOSA  -setosa
4  IRIS-SETOSA  -setosa

The following table describes the string-related operations.


Operation	Description
capitalize	N/A
contains	Returns whether the given string contains a substring. The substring is a regular expression if the `regex` parameter is set to True. The regex parameter is set to True by default.
count	Returns the number of occurrences of the specified string.
endswith	Ends the given string with the specified string.
startswith	Starts the given string with the specified string.
extract	Extracts a regular expression. If the group is not specified, substrings that satisfy the pattern of a regular expression are returned. If the group is specified, the specified group is returned.
find	Searches from left to right and returns the position of the first occurrence of the specified substring. If no matching substring exists, the value -1 is returned.
rfind	Searches from right to left and returns the position of the first occurrence of the specified substring. If no matching substring exists, the value -1 is returned.
replace	Replaces the substrings that satisfy the pattern of a regular expression with another substring. If you specify the n parameter, the substrings are replaced n times.
get	Returns the string at the specified position.
len	Returns the length of the given string.
ljust	Pads the given string with the character specified by `fillchar` on the right until the string reaches the length specified by `width`. The default pad character is a space.
rjust	Pads the given string with the character specified by `fillchar` on the left until the string reaches the length specified by `width`. The default pad character is a space.
lower	Converts the given string to lowercase.
upper	Converts the given string to uppercase.
lstrip	Removes spaces on the left side of the given string, including blank lines.
rstrip	Removes spaces on the right side of the given string, including blank lines.
strip	Removes spaces on both sides of the given string, including blank lines.
split	Splits the given string at the specified delimiter and returns a value of LIST<STRING> type.
pad	Pads the given string with the character specified by `fillchar` at the specified position, which can be on the left side, right side, or both sides of the string. The default pad character is a space.
repeat	Repeats n times.
slice	Performs slice operations.
swapcase	Converts all the uppercase characters to lowercase and all the lowercase characters to uppercase in the given string.
title	Returns a titlecased version of the given string where words start with an uppercase character and the remaining characters are lowercase. This operation is the same as the `str.title` operation.
zfill	Pads the given string with the character 0 on the left side of the string until the string reaches the length specified by `width`.
isalnum	Returns True if all characters in the given string are alphanumeric. Otherwise, False is returned. This operation is the same as the `str.isalnum` operation.
isalpha	Returns True if all characters in the given string are alphabetic. Otherwise, False is returned. This operation is the same as the `str.isalpha` operation.
isdigit	Returns True if all the characters in the given string are digits. Otherwise, False is returned. This operation is the same as the `str.isdigit` operation.
isspace	Returns True if all the characters in the given string are spaces. Otherwise, False is returned. This operation is the same as the `str.isspace` operation.
islower	Returns True if all the cased characters in the given string are lowercase. Otherwise, False is returned. This operation is the same as the `str.islower` operation.
isupper	Returns True if all the cased characters in the given string are uppercase. Otherwise, False is returned. This operation is the same as the `str.isupper` operation.
istitle	Returns True if the given string is a titlecased string. Otherwise, False is returned. This operation is the same as the `str.istitle` operation.
isnumeric	Returns True if all characters in the given string are numeric. Otherwise, False is returned. This operation is the same as the `str.isnumeric` operation.
isdecimal	Returns True if all characters in the given string are decimal characters. Otherwise, False is returned. This operation is the same as the `str.isdecimal` operation.
todict	Splits the given string at the specified delimiter into a dict and returns a value of DICT<STRING, STRING> type. The two input parameters are the project delimiter and key-value delimiter.
strptime	Converts the given string representing a time to the specified format. The time format is the same as the time format in the standard Python library. For more information about the time formats in Python, see Basic date and time types.

Time-related operations

You can call time-related built-in functions to manage sequence or scalar objects of DATETIME type.

>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>>    df.dt.year.rename('year'),
>>>    df.dt.month.rename('month'),
>>>    df.dt.day.rename('day'),
>>>    df.dt.hour.rename('hour')].head(5)
                    dt  year  month  day  hour
0  1998-04-08 11:02:00  1998      4    8    11
1  1998-04-08 10:57:55  1998      4    8    10
2  1998-04-08 10:45:26  1998      4    8    10
3  1998-04-08 10:25:52  1998      4    8    10
4  1998-04-08 10:44:19  1998      4    8    10

The following table describes the time-related attributes.


Time-related attribute	Description
year	N/A
month	N/A
day	N/A
hour	N/A
minute	N/A
second	N/A
weekofyear	Returns a number representing the week of the year where the provided date falls. Monday is taken as the first day of a week.
weekday	Returns a number representing the day of the week where the provided date falls.
dayofweek	Returns a number representing the day of the week where the provided date falls.
strftime	Converts the given string representing a time to the specified format. The time format is the same as the time in the standard Python library. For more information about the time formats in Python, see Basic date and time types.

PyODPS also supports the addition and subtraction of time. For example, you can retrieve the date three days before the current date. You can subtract one date column from another to obtain the difference in the number of milliseconds.

>>> df
                           a                          b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> from odps.df import day
>>> df.a - day(3)
                           a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype
int64
>>> (df.b - df.a).rename('a')
         a
0  3600000
1  3600000
2  3600000

The following table describes the supported DATETIME types.


Type	Description
year	N/A
month	N/A
day	N/A
hour	N/A
minute	N/A
second	N/A
millisecond	N/A

Collection-related operations

PyODPS supports collections of LIST and DICT types. You can use subscripts to retrieve an item from both types. You can also use the len method to retrieve the number of items in each collection.

In addition, collections of LIST and DICT types support the explode method. You can use the explode method to display the content of a collection. For collections of LIST type, the explode method returns one column by default. If you set the pos parameter to True, the explode method returns two columns. One of the columns indicates the serial number of each value in the array. The explode method is similar to the enumerate method in Python. For collections of DICT type, the explode method returns two columns. The two columns indicate the keys and values, respectively. You can pass in column names to the explode method as the names of the generated columns.

The following examples show how to use the explode method:

>>> df
   id         a                            b
0   1  [a1, b1]  {'a2': 0, 'b2': 1, 'c2': 2}
1   2      [c1]           {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']]
   id   a    b
0   1  a1    1
1   2  c1  NaN
>>> df[df.id, df.a.len(), df.b.len()]
   id  a  b
0   1  2  3
1   2  1  2
>>> df.a.explode()
    a
0  a1
1  b1
2  c1
>>> df.a.explode(pos=True)
   a_pos   a
0      0  a1
1      1  b1
2      0  c1
>>> # Specify column names.
>>> df.a.explode(['pos', 'value'], pos=True)
   pos value
0    0    a1
1    1    b1
2    0    c1
>>> df.b.explode()
  b_key  b_value
0    a2        0
1    b2        1
2    c2        2
3    d2        3
4    e2        4
>>> # Specify column names.
>>> df.b.explode(['key', 'value'])
  key  value
0  a2      0
1  b2      1
2  c2      2
3  d2      3
4  e2      4

You can use the explode method with Collection together. In this way, the columns generated by the explode method are combined with the original columns.

>>> df[df.id, df.a.explode()]
   id   a
0   1  a1
1   1  b1
2   2  c1
>>> df[df.id, df.a.explode(), df.b.explode()]
   id   a b_key  b_value
0   1  a1    a2        0
1   1  a1    b2        1
2   1  a1    c2        2
3   1  b1    a2        0
4   1  b1    b2        1
5   1  b1    c2        2
6   2  c1    d2        3
7   2  c1    e2        4

In addition to the len and explode methods, collections of LIST type support the following two methods.


Method	Description
contains(v)	Checks whether the given list contains a specified element.
sort	Sorts the given list and returns a value of LIST type.

Collections of DICT type also support the following two methods.


Method	Description
keys	Retrieves DICT keys and returns a value of LIST type.
values	Retrieves DICT values and returns a value of LIST type.

Other operations

You can use the isin operation to check whether the elements of a sequence object exist in a specified collection. You can use the notin operation to check whether the elements of a sequence object do not exist in a specified collection.

>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
   sepallength
0         True
1         True
2        False
3        False
4        False

You can use the cut operation to divide data in a sequence object into several segments.

>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
   sepallength_cut
0             None
1              4-5
2              4-5
3              4-5
4              4-5

You can use the include_under and include_over operations to specify the maximum and minimum values, respectively.

>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
   sepallength_cut
0               5-
1              4-5
2              4-5
3              4-5
4              4-5

Call built-in functions or UDFs in MaxCompute

To call built-in functions or user defined functions (UDFs) in MaxCompute to create columns, you can use the func function. The returned value of the func function is of STRING type by default. You can use the rtype parameter to specify the type of the returned value.

>>> from odps.df import func
>>>
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # Call UDFs defined in MaxCompute. You must specify the column name if the column name cannot be automatically determined.
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]
>>> # Call UDFs from other projects. You can specify the column name by using the name parameter.
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project', name='new_col')]

Note The Pandas backend does not allow you to execute expressions that contain the func function.