By Rainbond
Kubernetes has become the de facto standard for infrastructure in the cloud-native era, with increasingly more applications running in Kubernetes. Relying on the powerful automated O&M capability, Kubernetes has solved most of the O&M problems of business systems. However, there are still some problems that need to be solved manually by O&M personnel. This article discusses whether there are some basic ideas for Kubernetes to solve business O&M problems and whether other tools can be used to simplify the troubleshooting process compared with traditional O&M.
First of all, it is necessary to clarify what kind of problems are considered business system problems in the Kubernetes field. Kubernetes has become the de facto standard for the operating environment of various cloud-based business systems in the cloud-native era.
We assume you already have a robust Kubernetes environment. The running status of the business system will not be affected by the exceptions of the underlying running environment. When errors occur in the business system, Kubernetes can collect the running status information of the business system correctly.
With this assumption, we can constrain the business system problems to the time interval from deployment to the normal operation of a business. Therefore, the business system problems discussed in this article include:
The significance of solving business system problems is obvious because getting the business system running is a basic requirement. A robust Kubernetes runtime environment or a set of business system codes will not generate direct value for us. Only when the business system code is run in a stable environment to provide services to end users will it generate real value for us.
Fortunately, most of these problems only need to be solved once. When most new business systems are deployed to the Kubernetes environment, the problems that may occur only need to be dealt with once. Once the deployment is completed, the business system can focus on iterative features, and the release process can be completed in a continuous cycle, smoothly entering a CI/CD process.
In addition to the obvious significance of basic requirements, we will explore how to reduce the difficulty of solving such problems, which is also significant. In the cloud-native era, we advocate that every developer can control their business systems. This control also places a new requirement on developers to control the use of Kubernetes. It means giving the O&M work to the developers, and the actual promotion process is not smooth. Enterprises can use a cloud-native application platform to make it easier for developers to deploy and debug their business systems using Kubernetes. Rainbond is such a cloud-native application management platform. Its ease of use lowers the learning threshold for developers and empowers business systems.
Normally, the staff responsible for deploying the business system defines the business system through the declarative configuration file, the key part of which is called specification. These specification fields are defined using strictly formatted YAML configuration files. Extensive knowledge of Kubernetes is required to fill in the correct keys and values. Mastering the formats of configuration files and the configuration content is the first high threshold for developers to learn native Kubernetes.
In the native mode, the kubectl command line tool provides a strict verification mechanism for these configuration files. However, when the verification fails, the prompt is unfriendly.
Let's use a simple YAML configuration file as an example:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: my-nginx
name: my-nginx
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: my-nginx
template:
metadata:
labels:
app: my-nginx
spec:
containers:
- image: nginx
name: nginx
env:
- name: DEMO_GREETING
value: "true" # You must use quotation marks here because it is of string type.
securityContext:
privileged: true # You must not use quotation marks here because it is of the bool type.
There are two true
values in the configuration, but one must be enclosed in quotation marks and the other not, which is unfriendly to some novices. When you load the wrong version of this configuration file, the error report given by the system can locate the problem, but the interactive experience is more unfriendly.
$ kubectl apply -f my-deployment.yaml
Error from server (BadRequest): error when creating "my-deployment.yaml": Deployment in version "v1" cannot be handled as a Deployment: v1.Deployment.Spec: v1.DeploymentSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.Containers: []v1.Container: v1.Container.Env: []v1.EnvVar: v1.EnvVar.Value: ReadString: expects " or n, but found t, error found in #10 byte of ...|,"value":true}],"ima|..., bigger context ...|ainers":[{"env":[{"name":"DEMO_GREETING","value":true}],"image":"nginx","name":"nginx"}]}}}}
Such problems do not occur in a cloud-native application management platform like Rainbond. When the product is designed, some common input errors have been shielded. Users do not need to pay attention to the types of incoming values, as the platform will convert them.
The platform automatically encloses environment variables with quotation marks to match the string data type:
The platform uses an enabled/disabled state to indicate the bool data type:
For some special input, reasonable verification will be carried out, and the feedback information will be more user-friendly:
With these features, even novice users can correctly define the specifications of the business system.
After the specification of the business system is defined, you can submit it to the Kubernetes system. Next, Kubernetes will use its scheduling mechanism to allocate the business system to an appropriate host to run. During scheduling, the business system will be in the Pending
state for a short period. However, if a long-term Pending
state occurs, it means there is something wrong during scheduling.
Kubernetes records every step before the business system enters the running state in the form of events. Once a Warning
or a more serious event occurs, it indicates that the deployment of the business system is being blocked. Knowing how to view these events and understanding what they represent is helpful for troubleshooting scheduling problems.
Common problems that keep business systems in a Pending
state for a long time include image pull failures and insufficient resources. When you use native Kubernetes, you must use command line to obtain the event information of the corresponding pod.
$ kubectl describe pod <podName> -n <nameSpace>
When all compute nodes do not have sufficient memory resources to schedule pods of the business system, the event information is listed below:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 Insufficient memory.
The event information of image pull failure is listed below:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 26s kubelet, cn-shanghai.10.10.10.25 Error: ErrImagePull
Normal BackOff 26s kubelet, cn-shanghai.10.10.10.25 Back-off pulling image "nginx_error"
Warning Failed 26s kubelet, cn-shanghai.10.10.10.25 Error: ImagePullBackOff
Normal Pulling 15s (x2 over 29s) kubelet, cn-shanghai.10.10.10.25 Pulling image "nginx_error"
Users need deep Kubernetes knowledge to interpret the event list. Developers need to find the keyword from the event list and then take action to solve the problem.
Rainbond has taken the need to reduce the cost of troubleshooting in advance into account. Users can click the box representing the pod of problematic business systems to see the details. The description of the core problem, the status of the current pod, and the description of the current pod are presented on this page, which can help users locate the problem quickly.
After the business system is scheduled to an appropriate host successfully, the Kubernetes system starts the corresponding pod of the business system, and the business system will soon be available for external service. However, do not take it lightly. Pods may run abnormally when they start.
In most cases, a pod running normally is in a Running
state. Developers can run the following command line to obtain the status:
$ kubectl get pod <podName> -n <nameSpace>
However, if it is in an abnormal state, you may get the following results:
NAME READY STATUS RESTARTS AGE
demo-test-my-nginx-6b78f5fc8-f9rkz 0/1 CrashLoopBackOff 3 86s
CrashLoopBackOff
is an abnormal state. In addition, other abnormal states may occur, such as OOMkilled
and Evicted
. The handling of each type of error varies. This requires extensive experience in troubleshooting Kubernetes problems.
For example, CrashLoopBackOff
means a container in the pod cannot run normally, an intolerable problem occurs during the code running, and the system reports an error and exits. The correct way to handle this problem is to query the log of the problematic pod to understand the exceptions at the business code level.
$ kubectl logs -f <podName> -n <nameSpace>
This troubleshooting idea can be solidified and has nothing to do with the deployed business system. Therefore, Rainbond has made some user-friendly designs. If the pod of the business system is in an abnormal state and is captured by the operation record, users can click this abnormal operation record to directly jump to the log page to view problem logs. This design implicitly provides troubleshooting ideas for users.
Note: There is also a special type of problem during operation. CrashLoopBackOff
usually occurs when the pod is started, so users can easily capture it. However, OOMkilled
usually occurs after the business system has been running for a long time. This kind of problem is difficult for users to capture because Kubernetes will automatically recover by automatically restarting the pods of problematic business systems.
Rainbond will automatically record this abnormal state and leave a corresponding log for subsequent analysis to learn which container in the pod causes the memory leak.
In order to troubleshoot problems in each stage of a business system based on native Kubernetes, developers must have a deep understanding of the Kubernetes knowledge system and be able to accept interactive command line operations. This virtually raises the technical requirements for developers and imposes an O&M burden on developers, resulting in poor cloud-native implementation experience. Developers should not be given command line permissions to operate directly Kubernetes because it is not in compliance with security regulations.
A cloud-native application management platform will be a good choice to enable developers to debug business systems in a reasonable way. The designer of the cloud-native application management platform has an in-depth understanding of the demands of developers. With simple and easy-to-use features and user-friendly design provided by Rainbond, developers can debug business systems more efficiently.
1,060 posts | 259 followers
FollowAlibaba Cloud Community - July 5, 2023
Alibaba Cloud Community - May 19, 2022
Alibaba Clouder - July 12, 2019
DavidZhang - December 30, 2020
Aliware - March 19, 2021
Alibaba Cloud Native - November 29, 2023
1,060 posts | 259 followers
FollowAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreThis solution helps you improve and secure network and application access performance.
Learn MoreMore Posts by Alibaba Cloud Community
Dikky Ryan Pratama July 5, 2023 at 6:14 am
awesome!