After logs are shipped from Simple Log Service to Object Storage Service (OSS), the logs can be stored in different formats. This topic describes the Parquet format.
Parameters
The following figure shows the parameters that you must configure if you specify parquet for Storage Format in a shipping rule. For more information, see Ship log data from Simple Log Service to OSS.
The following table describes the parameters.
Parameter | Description |
Key Name | The name of the log field that you want to ship to OSS. You can view log fields on the Raw Logs tab of a Logstore. We recommend that you add log fields one by one. When the log fields are shipped to OSS, the log fields are stored in a Parquet file based on the order in which you add them. The names of the log fields are used as the names of the columns in the Parquet file. The log fields that you can ship to OSS include the fields in the log content and the reserved fields, such as __time__, _topic__, and __source__. For more information about reserved fields, see Reserved fields. The values of the columns in a Parquet file are null in the following scenarios:
Note
|
Type | The data type of the specified log field. The following data types are supported: STRING, BOOLEAN, INT32, INT64, FLOAT, and DOUBLE. When log fields of the STRING type are shipped from Simple Log Service to OSS, the log fields are converted to the byte_array type, which is supported in a Parquet file. In addition, the |
Sample URLs of OSS objects
After logs are shipped to OSS, the logs are stored in OSS buckets. The following table provides the sample URLs of the OSS objects that store the logs.
Compression type | Object suffix | Sample URL | Description |
Not compressed | .parquet | oss://oss-shipper-shenzhen/ecs_test/2016/01/26/20/54_1453812893059571256_937.parquet | You can download the OSS object to your computer and consume data in the object. For more information, see Data consumption. |
Snappy | .snappy.parquet | oss://oss-shipper-shenzhen/ecs_test/2016/01/26/20/54_1453812893059571256_937.snappy.parquet |
Data consumption
You can consume data that is shipped to OSS by using E-MapReduce, Spark, or Hive. For more information, see LanguageManual DDL.
You can also consume data by using inspection tools.
You can use the parquet-tools utility that is provided by Python to inspect Parquet files, view details of the files, and read data. You can install the utility by running the following command or by using a different method:
pip3 install parquet-tools
View the data of columns in a Parquet file
Command
View the data of the remote_addr and body_bytes_sent columns.
parquet-tools show -n 2 -c remote_addr,body_bytes_sent 44_1693464263000000000_2288ff590970d092.parquet
Response
+----------------+-------------------+ | remote_addr | body_bytes_sent | |----------------+-------------------| | 61.243.1.63 | b'1904' | | 112.235.74.182 | b'4996' | +----------------+-------------------+
View the content in a Parquet file (Convert the file into the CSV format.)
Command
parquet-tools csv -n 2 44_1693464263000000000_2288ff590970d092.parquet
Response
remote_addr,body_bytes_sent,time_local,request_method,request_uri,http_user_agent,remote_user,request_time,request_length,http_referer,host,http_x_forwarded_for,upstream_response_time,status b'61.**.**.63',b'1904',b'31/Aug/2023:06:44:01',b'GET',b'/request/path-0/file-7',"b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.801.0 Safari/535.1'",b'uh2z',b'49',b'4082',b'www.kwm.mock.com',b'www.ap.mock.com',b'222.**.**.161',b'2.63',b'200' b'112.**.**.182',b'4996',b'31/Aug/2023:06:44:01',b'GET',b'/request/path-1/file-5',b'Mozilla/5.0 (Windows NT 6.1; de;rv:12.0) Gecko/20120403211507 Firefox/12.0',b'tix',b'71',b'1862',b'www.gx.mock.com',b'www.da.mock.com',b'36.**.**.237',b'2.43',b'200'
View the details of a Parquet file
Command
parquet-tools inspect 44_1693464263000000000_2288ff590970d092.parquet
Response
############ file meta data ############ created_by: SLS version 1 num_columns: 14 num_rows: 4661 num_row_groups: 1 format_version: 1.0 serialized_size: 2345 ############ Columns ############ remote_addr body_bytes_sent time_local request_method request_uri http_user_agent remote_user request_time request_length http_referer host http_x_forwarded_for upstream_response_time status ############ Column(remote_addr) ############ name: remote_addr path: remote_addr max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: None converted_type (legacy): NONE compression: UNCOMPRESSED (space_saved: 0%) ......