By Mingshan Zhao
OpenKruise is an open-source cloud-native application automation management suite. It is also a current incubating project hosted by the Cloud Native Computing Foundation (CNCF). It is a standard extension component based on Kubernetes that is widely used in production of internet scale company. It also closely follows upstream community standards and adapts to the technical improvement and best practices for internet-scale scenarios.
OpenKruise has released the latest version v1.4 on March 31, 2023 (ChangeLog), with the addition of the Job Sidecar Terminator feature. This article provides a comprehensive overview of the new version.
In Kubernetes, for Job workloads, it is commonly desired that when the main container completes its task and terminates, the Pod should enter a completed state. However, when these Pods have Long-Running Sidecar containers, the Sidecar container cannot terminate itself after the main container has exited, causing the Pod to remain in an incomplete state. The community's common solution to this problem usually involves modifying both the Main and Sidecar containers to use Volume sharing to achieve the effect of the Sidecar container exiting after the Main container has completed.
While the community's solution can solve this problem, it requires modification of the containers, especially for commonly used Sidecar containers, which incurs high costs for modification and maintenance.
To address this, we have added a controller called SidecarTerminator to Kruise. This controller is specifically designed to listen for completion status of the main container in this scenario and select an appropriate time to terminate the Sidecar container in the Pod, without requiring intrusive modification of the Main and Sidecar containers.
For pods running on regular nodes, it is very easy to use this feature since Kruise daemon can be installed. Users just need to add a special env to identify the target sidecar container in the pod, and the controller will use the ContainerRecreateRequest(CRR) capability provided by Kruise Daemon to terminate these sidecar containers at the appropriate time.
kind: Job
spec:
template:
spec:
containers:
- name: sidecar
env:
- name: KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT
value: "true"
- name: main
...
For some platforms that provide Serverless containers, such as ECI or Fargate, their pods can only run on virtual nodes such as Virtual-Kubelet. However, Kruise Daemon cannot be deployed and work on these virtual nodes, which makes it impossible to use the CRR capability to terminate containers.
Fortunately, we can use the Pod in-place upgrade mechanism provided by native Kubernetes to achieve the same goal: just construct a special image whose only purpose is to make the container exit quickly once started. In this way, when exiting the sidecar, just replace the original sidecar image with the fast exit image to achieve the purpose of exiting the sidecar.
kind: Job
spec:
template:
spec:
containers:
- name: sidecar
env:
- name: KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT_WITH_IMAGE
value: "example/quick-exit:v1.0.0"
- name: main
...
Replace "example/quick-exit:v1.0.0" with the fast exit image that you have prepared in step 1.
Containers with the environment variable KRUISE_TERMINATE_SIDECAR_WHEN_JOB_EXIT will be treated as sidecar containers, while other containers will be treated as main containers. The sidecar container will only be terminated after all main containers have completed:
Currently, whether it's a change in the state or metadata of a Pod,, the Pod Update event will trigger the CloneSet reconcile logic. CloneSet Reconcile is configured with three workers by default, which is not a problem for smaller cluster scenarios.
However, for larger or busy clusters, these unneccesary reconciles will block the true CloneSet reconcile and delay changes such as rolling updates of CloneSet. To solve this problem, you can turn on the feature-gate CloneSetEventHandlerOptimization to reduce some unnecessary enqueueing of reconciles.
If a Pod is directly deleted or evicted by other controller or user, the PVCs associated with the Pod still remain. When the CloneSet controller creates new Pods, it will reuse existing PVCs.
However, if the Node where the Pod is located experiences a failure, reusing existing PVCs may cause the new Pod to fail to start. For details, please refer to issue 1099. To solve this problem, you can set the disablePVCReuse=true field. After the Pod is evicted or deleted, the PVCs associated with the Pod will be automatically deleted and will no longer be reused.
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
...
replicas: 4
scaleStrategy:
disablePVCReuse: true
CloneSet currently supports two lifecycle hooks, PreparingUpdate and PreparingDelete, which are used for graceful application termination. For details, please refer to the Community Documentation. In order to support graceful application deployment, a new state called PreNormal has been added, as follows:
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# define with finalizer
lifecycle:
preNormal:
finalizersHandler:
- example.io/unready-blocker
# or define with label
lifecycle:
preNormal:
labelsHandler:
example.io/block-unready: "true"
When CloneSet creates a Pod (including normal scaling and upgrades):
This is useful for some post-checks when creating Pods, such as checking if the Pod has been mounted to the SLB backend, so as to avoid traffic loss caused by new instance mounting failure after the old instance is destroyed during rolling upgrade.
When creating a CRR resource, if the container is in the process of starting up, the CRR will not restart the container again. If you want to force a container restart, you can enable the following field:
apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
spec:
...
strategy:
forceRecreate: true
When Kubelet creates a Pod, Kubelet will attach metadata to the container runtime using CRI interface. The image repository can use this metadata information to identify the business related to the starting container. Some container actions of low business value can be degraded to protect the overloaded repository.
OpenKruise's imagePullJob also supports similar capabilities, as follows:
apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
spec:
...
image: nginx:1.9.1
sandboxConfig:
annotations:
io.kubernetes.image.metrics.tags: "cluster=cn-shanghai"
labels:
io.kubernetes.image.app: "foo"
Welcome to get involved with OpenKruise by joining us in Github/Slack. Have something you’d like to broadcast to our community? Share your voice through the channels below:
Join the community on Slack.
A Quick View of OpenYurt v1.2: Cloud-Edge Traffic Peak Reduced by 90% Compared to Native Kubernetes
503 posts | 48 followers
FollowAlibaba Cloud Community - May 6, 2023
Alibaba Cloud Community - February 20, 2024
Alibaba Developer - March 31, 2021
Alibaba Developer - October 13, 2020
Alibaba Cloud Native Community - October 18, 2022
Alibaba Cloud Native Community - June 19, 2024
503 posts | 48 followers
FollowAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreMore Posts by Alibaba Cloud Native Community