After you obtain an offline log file, you can use a command line interface (CLI) to parse the log file and extract the top 10 IP addresses, User-Agent headers, and Referer headers of the visits. This topic describes how to use a CLI to analyze the offline logs of Alibaba Cloud CDN in a Linux environment.
Prerequisites
Offline logs are downloaded. For more information, see Download offline logs.
Usage notes
Naming rule for log files: Accelerated domain name_year_month_day_start time_end time[extension field].gz. The extension field starts with an underscore (_). Example:
aliyundoc.com_2018_10_30_000000_010000_xx.gz
.NoteNames of specific log files may not contain an extension field. Example:
aliyundoc.com_2018_10_30_000000_010000.gz
.Sample log entry
[9/Jun/2015:01:58:09 +0800] 10.10.10.10 - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://example.com/robot/)" "text/html"
Parse logs
Collect and prepare log data
Download the compressed log file
aliyundoc.com_2018_10_30_000000_010000.gz
. For more information, see Download offline logs.Upload the log file to the local Linux server.
Log on to the local Linux server and run the following command to decompress the log file:
gzip -d aliyundoc.com_2018_10_30_000000_010000.gz
After you decompress the log file, the
aliyundoc.com_2018_10_30_000000_010000
file is displayed.
Identify and filter abnormal behaviors
Check requests
You can identify abnormal requests by analyzing the number of requests from an IP address in offline log data. In most cases, abnormal requests have the following characteristics:
Abnormal high volume of requests: You can analyze the access frequency of a single source IP address. If the number of requests from an IP address within a specific period of time significantly deviates from the normal value, data transmission abuse may have occurred.
A large number of requests in a short period of time: You can check whether sudden traffic spikes or abnormal periodic requests exist.
Collect top 10 IP addresses that initiated the most requests.
cat [$Log_Txt] | awk '{print $3}' |sort|uniq -c|sort -nr |head -10
Noteawk '{print $3}'
: extracts the third column of the log file, which is the IP address. Values are separated by spaces.sort
: sorts IP addresses.uniq -c
: counts the number of each IP address.sort -nr
: sorts the results by count in descending order.head -n 10
: obtains the top 10 IP addresses that initiated the most requests.[$Log_Txt]
: renames the log file. Example:aliyundoc.com_xxxxxxx
.
Analyze user agents
You can identify abnormal requests by analyzing the User-Agent headers in offline log data. In most cases, abnormal User-Agent headers have the following characteristics:
Abnormal or forged User-Agent headers: Many data transmission abuse tools use the default or forged User-Agent headers. You can filter unusual, suspicious, or empty User-Agent headers for analysis.
Extract and collect User-Agent headers.
grep -o '"Mozilla[^"]*' [$Log_Txt] | cut -d'"' -f2 | sort | uniq -c | sort -nr | head -n 10
Filter suspicious User-Agent headers by removing common User-Agent headers.
grep -v -E "Firefox|Chrome|Safari|Edge" [$Log_Txt]
Count the number of rows where the User-Agent header is empty, which is the number of visits.
awk '!/Mozilla/' [$Log_Txt] | wc -l
Notegrep -o
: displays only the matching content.grep -v -E
: displays the characters that meet the conditions.wc -l
: counts the number of visits.
Analyze the request mode
You can identify abnormal requests by analyzing the number of request URLs in offline log data. In most cases, abnormal request URLs have the following characteristics:
High URL similarity: In most cases, data transmission abuse involves a large number of similar or identical URLs. You can analyze the URLs to identify abnormal requests.
High percentage of visits to specific types of resources: You can analyze frequently visited resources, such as images, CSS files, and JavaScript files. If there is a large number of abnormal visits to specific resources, data transmission abuse may have occurred.
Count the top 10 visited URLs.
grep -oP '"https?://[^"]+"' [$Log_Txt] | sort | uniq -c | sort -nr | head -n 10
Analyze HTTP status codes
You can identify abnormal requests by analyzing the HTTP status codes in offline log data. In most cases, HTTP status codes for abnormal requests have the following characteristics:
High percentage of HTTP status code 4xx or 5xx: If a large number of HTTP status codes 4xx or 5xx are returned to an IP address, the IP address may be used to crawl content.
Count the number of each HTTP status code.
awk '{print $9}' [$Log_Txt] | sort | uniq -c | sort -nr
Collect the top 10 IP addresses to which an HTTP status code 400 is returned.
grep ' 400 ' [$Log_Txt] | awk '{print $3}' | sort | uniq -c | sort -nr | head -n 10