Import Amazon S3 objects to Simple Log Service for data query and analysis - Simple Log Service

You can import Amazon Simple Storage Service (S3) objects to Simple Log Service. After the objects are imported, you can perform operations on log data in the objects in Simple Log Service. For example, you can query, analyze, and transform the log data. You can only import a single S3 object that does not exceed 5 GB in size to Simple Log Service. If you want to import a compressed object, the size of the object after compression cannot exceed 5 GB.

Prerequisites

Log files are uploaded to S3.
A project and a Logstore are created. For more information, see Create a project and Create a Logstore.
A custom policy that grants permissions to manage S3 resources is created. For more information, see Custom permissions. The following sample code provides an example of the custom policy.
Note
After the custom policy is created, S3 objects can be imported to Simple Log Service.
```
{
  "Version": "1",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your_bucket_name",
        "arn:aws:s3:::your_bucket_name/*"
      ]
    }
  ]
}
```

Create a data import configuration

Log on to the Simple Log Service console.
On the right side of the page that appears, click Quick Data Import. On the Data Import tab of the Import Data dialog box, click S3 - Data Import.
Select the project and Logstore. Then, click Next.

In the Import Configuration step, create a data import configuration.

In the Import Configuration step, configure the following parameters.

Parameter	Description
Job Name	The name of the data import job.
Display Name	The display name of the data import job.
Job Description	The description of the data import job.
S3 Region	The region where the S3 bucket resides. The S3 bucket stores the S3 objects that you want to import to Simple Log Service.
AWS AccessKey ID	The AccessKey ID of your AWS account. Important Make sure that your AccessKey pair has permissions to access the AWS resources that you want to manage.
AWS Secret AccessKey	The Secret AccessKey of your AWS account.
File Path Prefix Filter	The directory of the S3 objects. If you configure this parameter, the system can find the S3 objects that you want to import in a more efficient manner. For example, if the S3 objects that you want to import are stored in the csv/ directory, you can set this parameter to csv/. If you leave this parameter empty, the system traverses the entire S3 bucket to find the S3 objects. Note We recommend that you configure this parameter. A larger number of S3 objects in an S3 bucket indicates lower data import efficiency when the entire bucket is traversed.
File Path Regex Filter	The regular expression that you want to use to filter S3 objects by directory. If you configure this parameter, the system can find the S3 objects that you want to import in a more efficient manner. Only the objects whose names match the regular expression are imported. The names include the paths of the objects. By default, this parameter is empty, which indicates that no filtering is performed. For example, if an S3 object that you want to import is named `stdata/csv/bill.csv`, you can set this parameter to `(testdata/csv/)(.*)`. For more information about how to debug a regular expression, see How do I debug a regular expression?
File Modification Time Filter	The modification time based on which you want to filter S3 objects. If you configure this parameter, the system can find the S3 objects that you want to import in a more efficient manner. Valid values: All: To import all S3 objects that meet specified conditions, select this option. From Specific Time: To import S3 objects that are modified after a point in time, select this option. Specific Time Range: To import S3 objects that are modified within a time range, select this option.
Data Format	The format of the S3 objects. Valid values: CSV: You can specify the first line of an S3 object as field names or specify custom field names. All lines except the first line are parsed as the values of log fields. Single-line JSON: An S3 object is read line by line. Each line is parsed as a JSON object. The fields in JSON objects are log fields. Single-line Text Log: Each line in an S3 object is parsed as a log. Multi-line Text Logs: Multiple lines in an S3 object are parsed as a log. You can specify a regular expression to match the first line or the last line of a log.
Compression Format	The compression format of the S3 objects. Simple Log Service decompresses the S3 objects based on the specified format to read data.
Encoding Format	The encoding format of the S3 objects. UTF-8 and GBK are supported.
New File Check Cycle	If new objects are constantly generated in the specified directory of S3 objects, you can configure New File Check Cycle based on your business requirements. After you configure this parameter, the data import job is continuously running in the background, and new objects are automatically detected and read at regular intervals. The system ensures that data in an S3 object is not repeatedly written to Simple Log Service. If new objects are no longer generated in the specified directory of S3 objects, you can change the value of New File Check Cycle to Never Check. Then, the data import job automatically exits after all objects that meet specified conditions are read.
Log Time Configuration
Time Field	The time field. Enter the name of a time column in an S3 object. If you set Data Format to CSV or Single-line JSON, you must configure this parameter. This parameter specifies the log time.
Regular Expression to Extract Time	The regular expression that you want to use to extract log time. For example, if a sample log is 127.0.0.1 - - [10/Sep/2018:12:36:49 0800] "GET /index.html HTTP/1.1", you can set Regular Expression to Extract Time to [0-9]{0,2}\/[0-9a-zA-Z]+\/[0-9:,]+. Note For other data types, if you want to extract part of the time field, you can specify a regular expression.
Time Field Format	The time format that you want to use to parse the value of the time field. You can specify a time format that is supported by the Java SimpleDateFormat class. Example: `yyyy-MM-dd HH:mm:ss`. For more information about the time format syntax, see Class SimpleDateFormat. For more information about common time formats, see Time formats. You can specify an epoch time format, which can be epoch, epochMillis, epochMicro, or epochNano.
Time Zone	The time zone for the value of the time field. If the value of Time Field Format is an epoch time format, you do not need to configure this parameter. If you want to use daylight saving time when you parse logs, you can select a time zone in UTC. Otherwise, select a time zone in GMT. Note By default, UTC+8 is used.

If you set Data Format to CSV, you must configure additional parameters. The following tables describe the parameters.

Additional parameters when you set Data Format to CSV

Parameter	Description
Delimiter	The delimiter for logs. The default value is a comma (,).
Quote	The quote that is used to enclose a CSV-formatted string.
Escape Character	The escape character for logs. The default value is a backslash (\).
Maximum Lines	If you turn on First Line as Field Name, the first line in a CSV file is used to extract field names.
Custom Fields	If you turn off First Line as Field Name, you can specify custom field names. Separate multiple field names with commas (,).
Lines to Skip	The number of lines that are skipped. For example, if you set this parameter to 1, the first line of a CSV file is skipped, and log collection starts from the second line.

Additional parameters when you set Data Format to Multi-line Text Logs

Parameter	Description
Position to Match Regular Expression	The usage of a regular expression. Regular Expression to Match First Line: If you select this option, the regular expression that you specify is used to match the first line of a log. The unmatched lines are collected as a part of the log until the maximum number of lines that you specify is reached. Regular Expression to Match Last Line: If you select this option, the regular expression that you specify is used to match the last line of a log. The unmatched lines are collected as a part of the next log until the maximum number of lines that you specify is reached.
Regular Expression	The regular expression. You can specify a regular expression based on the log content. For more information about how to debug a regular expression, see How do I debug a regular expression?
Maximum Lines	The maximum number of lines allowed for a log.

Click Preview to preview the import result.
After you confirm the result, click Next.

Preview data, configure indexes, and then click Next.
By default, full-text indexing is enabled in Simple Log Service. You can also configure field indexes based on collected logs in manual or automatic mode. To configure field indexes in automatic mode, click Automatic Index Generation. This way, Simple Log Service automatically creates field indexes. For more information, see Create indexes.
Important
If you want to query and analyze logs, you must enable full-text indexing or field indexing. If you enable both full-text indexing and field indexing, the system uses only field indexes.
Click Query Log. On the query and analysis page that appears, check whether S3 data is imported.
Wait for approximately 1 minute. If the required S3 data exists, the data is imported.

View a data import configuration

After you create a data import configuration, you can view the configuration details and related statistical reports in the Simple Log Service console.

In the Projects section, click the project to which the data import configuration belongs.
On the Log Storage > Logstores tab, click the Logstore to which the data import configuration belongs, choose Data Collection > Data Import, and then click the name of the data import configuration.
On the Import Configuration Overview page, view the basic information and statistical reports of the data import configuration.

On the Import Configuration Overview page, you can perform the following operations on the data import configuration:

Modify the data import configuration
To modify the data import configuration, click Edit Configurations. For more information, see Import configuration.
Start a data import job
To start or resume a data import job, click Start.
Stop a data import job
To stop a data import job, click Stop.
Delete the data import configuration
To delete the data import configuration, click Delete Configuration.
Warning
After the data import configuration is deleted, it cannot be restored.

Billing

You are not charged for the data import feature of Simple Log Service. However, the feature calls AWS API. You are charged by AWS for the traffic and requests that are generated. The daily fee that is generated when you import S3 objects is calculated by using the following formula. You can view the fees in your AWS bill.

Field	Description
`T`	The total size of data that is imported from S3 to Simple Log Service per day. Unit: GB.
`p_read`	The fee per GB of outbound data that flows over the Internet.
`p_put`	The fee per 10,000 PUT requests.
`p_get`	The fee per 10,000 GET requests.
`M`	The interval at which the system detects new objects. Unit: minutes. You can configure New File Check Cycle to specify the interval when you create a data import configuration.
`N`	The number of objects that are obtained based on File Path Prefix Filter.

FAQ

Problem description	Possible cause	Solution
No data is displayed during preview.	The S3 bucket contains no objects, the objects contain no data, or no objects meet the filter conditions.	Check whether the S3 bucket contains objects that are not empty or check whether CSV files contain only the headers line. If no S3 objects contain data, you must wait until the objects contain data and then import the objects. Modify File Path Prefix Filter, File Path Regex Filter, and File Modification Time Filter.
Garbled characters exist.	The data format, compression format, or encoding format is not configured as expected.	Check the actual format of the S3 objects and modify Data Format, Compression Format, or Encoding Format. To handle the existing garbled characters, create a Logstore and a data import configuration.
The log time displayed in Simple Log Service is different from the actual log time.	No time field is specified in the data import configuration, or the specified time format or time zone is invalid.	Specify a time field or specify a valid time format and time zone. For more information, see Log Time Configuration.
After data is imported, the data cannot be queried or analyzed.	The data is not within the query time range. No indexes are configured. Configured indexes do not take effect.	Check whether the time of the log that you want to query is within the query time range. If no, adjust the query time range and query the data again. Check whether indexes are configured for the Logstore to which the objects are imported. If no, configure indexes. For more information, see Create indexes and Reindex logs for a Logstore. If indexes are configured for the Logstore and the volume of imported data is displayed as expected on the Data Processing Insight dashboard, the possible cause is that the indexes do not take effect. In this case, reconfigure the indexes. For more information, see Reindex logs for a Logstore.
The number of imported data entries is less than expected.	Some S3 objects contain lines that are greater than 3 MB in size. In this case, the lines are discarded during the import. For more information, see Limits on collection.	When you write data to an S3 object, make sure that the size of a line does not exceed 3 MB.
The number of S3 objects and the total volume of data are large, but the import speed does not meet expectations. In most cases, the import speed can reach 80 MB/s.	The number of shards in the Logstore is excessively small. For more information, see Limits on performance.	If the number of shards in a Logstore is small, increase the number of shards to 10 or more and check the latency. For more information, see Manage shards.
Some S3 objects failed to be imported to Simple Log Service.	The settings of the filter conditions are invalid or the size of an object exceeds 5 GB. For more information, see Limits on collection.	Check whether the S3 objects that you want to import meet the filter conditions. If no, modify the filter conditions. Check whether the size of each S3 object that you want to import is less than 5 GB. If no, reduce the size of the object.
An error occurred in parsing an S3 object that is in the Multi-line Text Logs format.	The regular expression that is specified to match the first line or the last line in a log is invalid.	Check whether the regular expression that is specified to match the first line or the last line in a log is valid.
The latency to import new S3 objects is higher than expected.	The number of existing S3 objects that are obtained based on File Path Prefix Filter exceeds the upper limit.	If the number of existing S3 objects that are obtained based on File Path Prefix Filter exceeds one million, we recommend that you specify a more precise value for File Path Prefix Filter and create more data import jobs. Otherwise, the efficiency of new file discovery is low.

Error handling

Error	Description
File read failure	If an S3 object fails to be completely read because a network exception occurs or the object is damaged, the data import job automatically retries to read the object. If the object fails to be read after three retries, the object is skipped. The retry interval is the same as value of New File Check Cycle. If New File Check Cycle is set to Never Check, the retry interval is 5 minutes.
Compression format parsing error	If an S3 object is in an invalid format, the data import job skips the object during decompression.
Data format parsing error	If data fails to be parsed, the data import job stores the original text content in the content field of logs.
S3 bucket absence	A data import job periodically retries. After an S3 bucket is re-created, the data import job automatically resumes the import.
Permission error	If a permission error occurs when data is read from an S3 bucket or data is written to a Simple Log Service Logstore, the data import job periodically retries. After the error is fixed, the data import job automatically resumes the import. If a permission error occurs, the data import job does not skip any S3 objects. After the error is fixed, the data import job automatically imports data from the unprocessed objects in the S3 bucket to the Simple Log Service Logstore.