Nine Tips for Kubernetes Log Output

By Zhang Cheng (Yuanyi), Alibaba Cloud Storage Service Technical Expert

In recent years, more users have sought consulting on how to build a log system for Kubernetes or how to solve issues throughout this process. This article describes our years of experience in building log systems to give you a shortcut to successfully building a log system for Kubernetes. This article is the third in a Kubernetes-related series that focuses on our practical experience. The content is subject to updates, as the involved technology evolves.

Preface

In the previous article Design and Practices of Kubernetes Log Systems, I introduced how to build a Kubernetes log system from the global dimension. This article is centered on practice and introduces how to build a log monitoring system for Kubernetes step by step.

The first step in building a log system is to generate logs. This is often the most complicated and difficult step.

On the first working day after the Chinese New Year Festival in 2009, in a freezing office in Beijing, a group of engineers wrote the first line of code for "Apsara." The Apsara platform is the core technical platform of Alibaba Cloud. It was named after an Angkor mythological deity.

The Alibaba Cloud Apsara system was originally designed as a log system. Now, Logstores of the Apsara logging system are applied in all Apsara systems including Apsara Distributed File System, Apsara Name Service and Distributed Lock Synchronization System, Job Scheduler, and Cloud Network Management.

The Importance of Logs in Kubernetes

Logs are essentially records about how an application runs. Many functions are derived from logs, such as online monitoring, alerting, operation analysis, and security analysis. In turn, these functions have certain requirements for logs. The logs must be standardized to reduce the costs of log collection, parsing, and analysis.

In Kubernetes, the environment is highly dynamic and logs are volatile. It is necessary to collect logs to centralized storage in real-time. To facilitate log collection, log output must meet a series of requirements.

The following describes the common considerations for log output in Kubernetes. Items marked with asterisks (*) are Kubernetes-specific.

Select a log level
Normalize log content
Control log output
Use multiple log output targets
Control resource consumption of logs
Select a log library
Select a log output mode (*)
Flush logs into disks and select disk media (*)
Ensure a log retention period (*)

Select a Log Level

Every log must have a log level, which is used to determine the severity of log events. Logs are usually classified into six levels:

FATAL indicates very severe or unexpected errors. When encountering such errors, you must immediately trigger alerts and handle them manually.
ERROR indicates unexpected errors that may cause exceptions in some systems. Such errors do not affect the normal operation of core businesses and systems.
WARN indicates potential risks or information of concern for core paths.
INFO indicates the details of application execution. Such information includes the main execution process of each request.
DEBUG indicates logs used for offline debugging and analysis of application execution logic. For online applications, do not enable this level.
TRACE indicates the most detailed running trajectories, which may contain data content involved.

Programmers must set the log level properly. Here is some of my in-practice experience from development:

FATAL logs must be output because they involve severe errors that require manual processing.
It is hard for many programmers to tell the differences between ERROR logs and WARNING logs. Generally, ERROR logs require alerts but WARNING logs do not.
Log levels are used to indicate the severity of logs and control the log output of applications. Generally, only INFO and WARN logs can be enabled online.
You can generate more DEBUG logs to facilitate problem analysis.
All user request logs must be recorded.
For unknown external system calls, logs must cover as much information as possible.
Log libraries in applications must support log level changes during runtime. This allows you to temporarily change log levels for log analysis when problems occur.
Generally, when releasing a new feature, you can upgrade the log involved to a higher level to facilitate real-time monitoring. When the performance of the new feature becomes stable, you can adjust the log to its original level. When changing the log level, add an annotation for potential recovery.

Normalize Log Content

Programmers generally write code in unconstrained styles, resulting in a hodge-podge of log content. These logs can be understood only by the developers themselves, making analysis and alerting difficult. Therefore, you need a set of top-down log specifications to constrain developers, so that all logs appear to be output by one person and can be analyzed easily.

Log Fields

Generally, a log contains the required Time, Level, and Location fields. Specific modules, processes, or businesses may also require some common fields. For example:

Add a TraceID field to the log if a Trace system is used.
Add corresponding fields for some specific processes. For example, an order lifecycle log must contain information such as the order number and user ID. You can attach such information to the log instance of the process through Context.
In HTTP request logs, add the following fields: URL, Method, Status, Latency, Inflow, OutFlow, ClientIP, and UserAgent. For more information, see the NGINX log format.
If the logs of multiple modules are output into one stream or file, add a field that identifies each module name.

The log field specifications are preferred to be implemented in a top-down fashion from O&M platforms or middleware platforms. This is to restrict programmers of each module or process to output logs according to the specifications.

Log Formats

Generally, we recommend that you use the key-value pair format. Alibaba Cloud's Apsara Logstore uses this format:

[2019-12-30 21:45:30.611992]    [WARNING]       [958] [block_writer.cpp:671]  path:pangu://localcluster/index/3/prom/7/1577711464522767696_0_1577711517     min_time:1577712000000000       max_time:1577715600000000       normal_count:27595      config:prom     start_line:57315569     end_line:57343195       latency(ms):42  type:AddBlock

Logs in the key-value pair format are completely self-parsable and easy to understand. They are automatically parsed during log collection.

Another recommended log format is JSON. It is widely supported by many log libraries and most log collection agents.

{"addr":"tcp://0.0.0.0:10010","caller":"main.go:98","err":"listen tcp: address tcp://0.0.0.0:10010: too many colons in address","level":"error","msg":"Failed to listen","ts":"2019-03-08T10:02:47.469421Z"}

Note: In most scenarios, do not use non-readable log formats such as ProtoBuf and Binlog.

Line Feed of a Single Log

Do not output a log into multiple lines unless necessary. This increases the costs of log collection, parsing, and indexing.

Control Log Output

The output volume of logs directly affects disk utilization and application performance. An excessive output hinders log viewing, collection, and analysis. By contrast, an excessively low output hinders monitoring and can make troubleshooting impossible.

The log data volume must be properly controlled for online applications:

Output and collect all the request and response logs of service endpoints unless otherwise specified. You can adjust to-be-collected fields as needed.
In general, output all ERROR logs. If there are too many ERROR logs, you can output them through sampling.
Reduce invalid log output cases, especially looped log output.
Keep the output rate of request logs (such as Ingress and NGINX access logs) to within 5 MB/s, which is equivalent to 10 thousand 500-byte log entries every second. Keep the output rate of application logs to within 200 KB/s, which is equivalent to one hundred 2-KB log entries every second.

Use Multiple Log Output Targets

We recommend that you output different types of logs of one application to different targets or files to facilitate collection, viewing, and monitoring by category. For example:

Output access logs in a separate file. If there are not many domain names, you can create one file for each domain name.
Output ERROR logs in a separate file and configure monitoring alerts separately.
Output external system call logs in a separate file to facilitate subsequent reconciliation and auditing.
Middleware is normally provided by a unified platform. Therefore, output middleware-related logs in a separate file.

Control Resource Consumption of Logs

In a business system, logs are an auxiliary module and must not affect the normal operation of business. You must pay special attention to the resource consumption of the log module. When you select or develop a log library, carry out a performance test on it. Make sure the resource consumption of logs accounts for less than 5% of the overall CPU utilization under normal circumstances.

Note: Log output must be asynchronous and must not jam the business system.

Select a Log Library

There are dozens of open-source log libraries in each programming language. When you select a log library for your company or business, use a stable version of a mainstream log library whenever possible. For example:

For Java, use Log4J or LogBack.
For Golang, use go-kit.
Default integrated log libraries in Python apply to most scenarios. For details, see CookBook.
For C++, we recommend using spdlog for its high performance and cross-platform support.

Select a Log Output Mode

In a virtual machine or physical machine scenario, most applications output logs into files, and only certain system applications output logs into the syslog or journal. In a container scenario, there is one more standard output mode. In this mode, applications output logs to the stdout or stderr, and the logs are imported to the Docker log module. You can view the logs by running docker logs or kubectl logs.

The standard output mode of containers applies only to relatively simple applications and some system components of Kubernetes. By contrast, online service applications usually involve multiple levels (middleware) and interact with a variety of services. As a result, an online service application generates multiple types of logs. If all these logs are output in container standard output mode, it is difficult to differentiate and process them.

The container standard output mode greatly reduces the performance of DockerEngine. In a test, every additional output of 100,000 entries every second occupied one more CPU core (100% of the single core) of DockerEngine.

Flush Logs into Disks and Select Disk Media

In Kubernetes, you can directly integrate a log library with a log system. This allows logs to directly go to the backend of the log system for output without being flushed into a disk. In this way, logs do not need to be stored into disks or collected by agents, greatly improving the overall performance.

However, we recommend that you use this method only in scenarios with a large number of logs. In general cases, flushing logs into disks improves the overall reliability because it provides additional file caching. In the event of a network failure, data can be cached. This allows developers and O&M engineers to check log files when the log system is unavailable.

Kubernetes provides multiple storage methods. For an off-premises application, you can choose local storage, remote file storage, or object storage. Log writing involves high query per second (QPS) and directly affects application performance. Remote storage requires two or three additional network interactions. Therefore, we recommend that you use local storage, such as HostVolume and EmptyDir, to minimize the impact on data writing and collection performance.

Ensure a Log Retention Period

Compared with conventional virtual machine or physical machine scenarios, Kubernetes provides robust fault tolerance and powerful scheduling and scaling capabilities for the application layer and nodes. Kubernetes readily provides you with high reliability and ultimate elasticity but the dynamic creation or deletion of nodes or containers destroys relevant logs. Consequently, the log retention period may be unable to meet DevOps, audit, or other requirements.

Long-term log storage in a dynamic environment can only be implemented with centralized log storage. Through real-time log collection, logs of every node or container are collected to the centralized log system within seconds. In this way, logs can be used for backtracking when a node or container is down.

Summary

Log output is an essential part of log system construction. Enterprises and product lines must follow unified log specifications to ensure smooth log collection, analysis, monitoring, and visualization.

In the next article, I will talk about the best practices for planning log collection and storage for Kubernetes.

The first article of this blog series is available by clicking here.

Community

Nine Tips for Kubernetes Log Output

Preface

The Importance of Logs in Kubernetes

Select a Log Level

Normalize Log Content

Log Fields

Log Formats

Line Feed of a Single Log

Control Log Output

Use Multiple Log Output Targets

Control Resource Consumption of Logs

Select a Log Library

Select a Log Output Mode

Flush Logs into Disks and Select Disk Media

Ensure a Log Retention Period

Summary

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Simple Log Service

Managed Service for Prometheus

Application Real-Time Monitoring Service

Storage Capacity Unit