Query objects - Object Storage Service - Alibaba Cloud Documentation Center

This topic describes how to use SelectObject in OSS SDK for Python to query CSV and JSON objects.

Usage notes

In this topic, the public endpoint of the China (Hangzhou) region is used. If you want to access OSS from other Alibaba Cloud services in the same region as OSS, use an internal endpoint. For more information about OSS regions and endpoints, see Regions, endpoints and open ports.
In this topic, an OSSClient instance is created by using an OSS endpoint. If you want to create an OSSClient instance by using custom domain names or Security Token Service (STS), see Initialization.
To query objects, you must have the oss:GetObject permission. For more information, see Attach a custom policy to a RAM user.
Only objects in the CSV and JSON formats can be queried by using SelectObject.

Examples

The following code provides an example on how to query CSV and JSON objects:

import oss2
from oss2.credentials import EnvironmentVariableCredentialsProvider

def select_call_back(consumed_bytes, total_bytes =  None):
        print('Consumed Bytes:' + str(consumed_bytes) + '\n')

# Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
auth = oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider())

# Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. 
endpoint = "https://oss-cn-hangzhou.aliyuncs.com"

# Specify the ID of the region that maps to the endpoint. Example: cn-hangzhou. This parameter is required if you use the signature algorithm V4.
region = "cn-hangzhou"

# Specify the name of your bucket.
bucket = oss2.Bucket(auth, endpoint, "yourBucketName", region=region)

key ='python_select.csv'
content ='Tom Hanks,USA,45\r\n'*1024
filename ='python_select.csv'

# Upload a CSV file. 
bucket.put_object(key, content)
# Configure the parameters for the SelectObject operation. 
csv_meta_params = {'RecordDelimiter': '\r\n'}
select_csv_params = {'CsvHeaderInfo': 'None',
                    'RecordDelimiter': '\r\n',
                    'LineRange': (500, 1000)}

csv_header = bucket.create_select_object_meta(key, csv_meta_params)
print(csv_header.rows)
print(csv_header.splits)
result = bucket.select_object(key, "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
bucket.delete_object(key)

###JSON DOCUMENT
key =  'python_select.json'
content =  "{\"contacts\":[{\"key1\":1,\"key2\":\"hello world1\"},{\"key1\":2,\"key2\":\"hello world2\"}]}"
filename =  'python_select.json'
# Upload a JSON DOCUMENT object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'DOCUMENT'}
result = bucket.select_object(key, "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

###JSON LINES
key =  'python_select_lines.json'
content =  "{\"key1\":1,\"key2\":\"hello world1\"}\n{\"key1\":2,\"key2\":\"hello world2\"}"
filename =  'python_select.json'
# Upload a JSON LINES object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'LINES'}
json_header = bucket.create_select_object_meta(key,select_json_params)
print(json_header.rows)
print(json_header.splits)

result = bucket.select_object(key, "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
select_content =  result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
           "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

Python SelectObject

This section describes the elements used in the SelectObject operation in Python. These elements include select_object, select_object_to_file, and create_select_object_meta.

select_object

The following code provides an example on how to specify parameters for select_object:
```
def select_object(self, key, sql,
                   progress_callback=None,
                   select_params=None
                   byte_range=None
                   headers=None
                   ):
```
The preceding code example is used to execute an SQL statement on the object that has the specified key and return the query results.
- sql: the SQL statement that does not need to be Base64-encoded.
- Progress_callback: optional. This parameter specifies a callback function used to report the query progress.
- select_params: the parameters and actions of the SelectObject operation.
- headers: the header information included in the request. The header information functions the same as that for the GetObject operation. For example, you can configure the bytes field in the request header to specify the range of a CSV object to query.

The following table describes the parameters supported by select_params.

Parameter	Description
Json_Type	By default, the object is a CSV object if this parameter is not specified. The object is a JSON Document object if this parameter is set to DOCUMENT. The object is a JSON LINES object if this parameter is set to LINES.
CsvHeaderInfo	The header information of the CSV object. Valid values: None, Ignore, and Use. None: This object has no header information configured. Ignore: This object has header information configured, but the information is not used when the SQL statement is executed. Use: This object has header information configured, and the column names in the header information are used when the SQL statement is executed.
CommentCharacter	The comment character in the CSV object. The value of this parameter can be only one character in length. The default value is None, which indicates that no comment characters are allowed.
RecordDelimiter	The row delimiter in the CSV object. The value of this parameter can be only one or two characters in length. Default value: \n.
OutputRecordDelimiter	The row delimiter in the output result of the SELECT statement. Default value: \n.
FieldDelimiter	The column delimiter in the CSV object. The value of this parameter can be only one character in length. The default value is a comma (,).
OutputFieldDelimiter	The column delimiter in the output result of the SELECT statement. The default value is a comma (,).
QuoteCharacter	The quote character for the columns in the CSV object. The value of this parameter can be only one character in length. The default value is a double quotation mark ("). Row and column delimiters enclosed in quotation characters are processed as normal characters.
SplitRange	The split range in multipart query. The value of this parameter is a closed interval in the (start, end) format, which indicates the range of splits to query.
LineRange	The row range in multipart query. The value of this parameter is a closed interval in the (start, end) format, which indicates the range of rows to query.
CompressionType	The format in which the object is compressed. Valid values: GZIP and None. Default value: None.
KeepAllColumns	If this parameter is set to true, columns that are excluded by the SELECT statement in the CSV object are left empty in the output result. However, the column positions are kept. Default value: False. For example, the columns in the CSV object are firstname, lastname, and age. The SQL statement is select firstname, age from ossobject. If KeepAllColumns is set to true, the output result is firstname,,age, in which an extra comma (,) is added to indicate the position of the excluded lastname column. If KeepAllColumns is set to false, the output result is firstname,age. Note This parameter is intended to make the code that is used to process GetObject available to process SelectObject without modifications.
OutputRawData	If this parameter is set to True, SelectObject directly returns the raw data. If data is not returned for a long time, a timeout error may occur. If this parameter is set to False, the output data is encapsulated in frames. Default value: False.
EnablePayloadCrc	Specifies whether a cyclic redundancy check (CRC) value is calculated for each frame. Default value: False.
OutputHeader	The header information in the first line of the output result. This parameter applies only to CSV objects.
SkipPartialDataRecord	If this parameter is set to True for a CSV object, the current record is skipped when a column of this record has no data. If this parameter is set to True for a JSON object, the current record is skipped when a key of this record does not exist. If this parameter is set to False, a column without data is left empty in the output result. For example, a row includes the firstname, lastname, and age columns. The SQL statement is `select _1, _4 from ossobject`. If this parameter is set to True, this row is skipped. If this parameter is set to False, firstname,\n is returned.
MaxSkippedRecordsAllowed	The maximum number of skipped rows. The default value is 0, which indicates that an error is returned if a row is skipped.
ParseJsonNumberAsString	If this parameter is set to True, all numbers in the JSON object are parsed as strings. If this parameter is set to False, all numbers in the JSON object are parsed as integers or floating-point numbers. Default value: False. High-precision floating-point numbers in a JSON object suffer loss of precision when they are parsed as floating-point numbers. To keep the precision, you can set this parameter to True and use the CAST function to convert the parsed data to the decimal type.

Returned results of select_object: A SelectObjectResult object is returned. You can use the read() function or the _iter_ method to obtain all results.
Note
If you call the read() function to read all results when multiple results are returned, excess memory resources are used and you have to wait for a long time. We recommend that you use the _iter_ method (foreach chunk in result) to obtain results and process each chunk in the results. The _iter_ method reduces memory usage and enables the client to process each chunk request processed by the OSS server in a timely manner. This way, the client does not need to wait until all results are returned.

select_object_to_file
```
def select_object_to_file(self, key, filename, sql,
                   progress_callback=None,
                   select_params=None
                   headers=None
                   ):
```
The preceding code example is used to execute an SQL statement on the object that has the specified key and write the query results to another specified object.
Other parameters are the same as those of select_object.

create_select_object_meta

Syntax of create_select_object_meta
```
def create_select_object_meta(self, key, select_meta_params=None, header=None):
```
The preceding code example is used to create Select Meta for the object that has the specified key or obtain Select Meta from such an object. Select Meta includes the total number of rows, total number of columns (for CSV objects), and total number of splits in an object.
If Select Meta has been created for an object, this function does not re-create Select Meta unless the OverwriteIfExists parameter is set to true.
To create Select Meta for an object, the object must be completely scanned.

The following table describes the parameters supported by select_meta_params.

Parameter	Description
Json_Type	By default, the object is a CSV object if this parameter is not specified. If this parameter is specified, the value of this parameter must be LINES, which indicates that the object is a JSON LINES object. Note This operation does not apply to JSON Document objects.
RecordDelimiter	The row delimiter in the CSV object.
FieldDelimiter	The column delimiter in the CSV object.
QuoteCharacter	The quote character in the CSV object. Row and column delimiters enclosed in quote characters are processed as normal characters.
CompressionType	The format in which the object is compressed. No compression formats are supported. The default value is None.
OverwriteIfExists	Specifies whether the created Select Meta overwrites the original Select Meta. You do not need to set this parameter in most cases.

Returned results of create_select_object_meta: A GetSelectObjectMetaResult object is returned, which includes the rows and splits attributes. For a CSV object, the select_resp object in the result includes the columns attribute, which indicates the number of columns in the CSV object.

References

For the complete sample code that is used to query objects, visit GitHub.
For more information about the API operation that you can call to query objects, see SelectObject.