Container Service for Kubernetes allows you to use multiple methods to prevent container escapes that allow attackers to access the host of containers. For example, you can forbid containers to run in privileged mode, require application pods to run as a non-root user, and disable automatic ServiceAccount token mounting. This topic describes how to configure pod security policies to enable security hardening for your ACK clusters to protect the ACK clusters from attackers.
Prevent container escapes that allow attackers to escalate privileges
Kubernetes developers and O&M administrators must focus on how to prevent processes from escaping a container. Container escapes allow attackers to escalate privileges to control the host of the container. Preventing container escapes is important due to the following reasons:
By default, the processes within a container run under the context of the root user in Linux. The operations that the
root
user can perform are restricted by theLinux capabilities
that are assigned by Docker to the container. However, an attacker can exploit the default capabilities to escalate privileges or access sensitive information on the host, such asSecrets
andConfigMaps
. The following list shows the defaultcapabilities
that are assigned to a Docker container. For more information, see capabilities(7) Linux manual page.cap_chown, cap_dac_override, cap_fowner, cap_fsetid, cap_kill, cap_setgid, cap_setuid, cap_setpcap, cap_net_bind_service, cap_net_raw, cap_sys_chroot, cap_mknod, cap_audit_write, cap_setfcap
.To prevent container escapes, you must avoid running Docker containers in
privileged
mode because privileged containers are assigned all Linuxcapabilities
of the root user.All Kubernetes worker nodes use the node authorizer, which is a special-purpose authorization mode. The node authorizer is used to authorize all API requests that are sent by a kubelet. The node authorizer also allows a node to perform the following operations:
Read operations:
Services
Endpoints
Nodes
Pods
Secrets, ConfigMaps, persistent volumes (PVs), and persistent volume claims (PVCs) of pods that are deployed on the node on which the kubelet runs
Write operations:
Nodes and node status. You can enable the NodeRestriction admission controller to allow the kubelet to modify only the node on which the kubelet runs.
Pods and pod status. You can enable the NodeRestriction admission controller to allow the kubelet to modify only the pod that is bound to the node on which the kubelet runs.
Events
Authentication-related operations:
The read and write permissions on the
CertificateSigningRequest (CSR)
API for Transport Layer Security (TLS) bootstrappingThe permissions to create
TokenReview
andSubjectAccessReview
for reviewing delegated identity authentication and authorization
By default, ACK clusters use the NodeRestriction admission controller. The NodeRestriction admission controller allows a kubelet to modify only a limited set of node attributes and pod objects that are bound to the node. However, the admission controller cannot prevent attackers from collecting sensitive information about the cluster environment by using the Kubernetes API. For more information, see NodeRestriction.
Suggestions on pod security
Forbid containers to run in privileged mode
Privileged containers inherit all Linux capabilities of the root user on the same host. In most scenarios, containers do not need these capabilities to handle workloads. You can create a pod security policy to forbid pods to run in privileged mode. The pod security policy is a group of constraints that a pod must meet before the pod can be created. ACK allows you to configure pod security policies based on Open Policy Agent (OPA) and Gatekeeper. The policies are used to validate requests for creating and updating pods in your cluster based on the security rules that you configure. If a request for creating or updating a pod does not meet the configured rules, the request is rejected and an error is returned. You can also use the ACKPSPPrivilegedContainer security policy to forbid the deployment of privileged containers within the specified namespaces of the cluster.
Run pods as a non-root user
By default, all containers run as the root user. Attackers can exploit vulnerabilities in applications and gain access to the shell of a container that is running. This poses security risks. You can use multiple methods to mitigate the risks. You can delete the shell from the container image. You can also add the USER instruction to the Dockerfile or run the containers as a non-root user. The
spec.securityContext
attribute in the podSpec contains therunAsUser
andrunAsGroup
fields. The two fields specify the user and user group under which containers are run. You can create an ACKPSPAllowedUsers policy to allow only the specified users and user groups to run containers.Forbid users to run containers in Docker-in-Docker mode or mount Docker.sock to containers
You can efficiently build or deploy container images inside a Docker container by using the Docker-in-Docker method or mounting Docker.sock to containers. However, doing so grants control over the node to the processes that are running inside the container. For more information about how to build container images on Kubernetes, see Use a Container Registry Enterprise Edition instance to build an image, kaniko, and img.
Restrict the use of hostPath volumes, or allow only the mounting of hostPath volumes to directories that have specified prefixes and configure the volumes to be read-only
A
hostPath
volume mounts a directory from the host to a pod. In most cases, pods do not require hostPath volumes. Make sure that you understand the risks if you need to use hostPath volumes. By default, pods that run with the root privileges have the write permissions on the file systems that are exposed by usinghostPath
volumes. Attackers can modify thekubelet
settings and create symbolic links to directories or files that are not directly exposed byhostPath
volumes. For example, attackers can access/etc/shadow
, install SSH keys, read Secrets that are mounted to the host, and perform other malicious activities. To mitigate the risks that arise fromhostPath
volumes, setspec.containers.volumeMounts
to read-only. Sample code:volumeMounts: - name: hostPath-volume readOnly: true mountPath: /host-path
You can also deploy the ACKPSPHostFilesystem policy to limit the host directories that can be mounted to pods in the specified namespaces of the cluster by using hostPath volumes.
Set resource requests and limits for each container to prevent resource contention and protect against DoS attacks
A pod without resource requests or limits can consume all of the resources on a host. If additional pods are scheduled to a node, the CPU or memory resources of the node may become insufficient. As a result, the
kubelet
may crash or pods may be evicted from the node. This issue is inevitable. However, you can set resource requests and limits to minimize resource contention and reduce the risks from improperly programmed applications that consume excessive resources.You can specify requests and limits for CPU and memory resources in the podSpec. You can set a resource quota or limit range on a namespace to forcibly limit the use of requests and resources. A resource quota specifies the total amount of resources that are allocated to a namespace, such as CPU and memory resources. After you apply a resource quota to a namespace, the resource quota forces you to specify requests and limits for all containers deployed in the namespace. A limit range can be used to enforce fine-grained control on the resources that are allocated. You can set limit ranges to specify the maximum and minimum amounts of CPU and memory resources that each pod or container in a namespace can use. You can also use limit ranges to set the default request values or limit values if no default values are provided. For more information, see Managing Resources for Containers.
You can also deploy the ACKContainerLimits policy to enforce resource limits on pods in the specified namespaces of the cluster.
Forbid privileged escalation
Privileged escalation allows a process to change the security context under which it runs. For example,
sudo
files are binary files with theSUID
orSGID
bit. Privileged escalation is a method that can be used by a user to execute a file with the permissions of another user or user group. To prevent privileged escalation, you can use apod
security policy that sets theallowPriviledgedEscalation
parameter tofalse
or specifysecurityContext.allowPrivilegedEscalation
in thepodSpec
.You can also deploy the ACKPSPAllowPrivilegeEscalationContainer policy to enforce the configuration of the allowPrivilegeEscalation parameter for pods in the specified namespaces of the cluster.
Disable automatic ServiceAccount token mounting
For pods that do not need to access the Kubernetes API, you can disable automatic
ServiceAccount
token mounting in thepodSpec
of specific pods, or disable this feature for all pods that use a specificServiceAccount
.apiVersion: v1 kind: Pod metadata: name: pod-no-automount spec: automountServiceAccountToken: false
After you disable automatic
ServiceAccount
token mounting for a pod, the pod can still access the Kubernetes API. To prevent a pod from accessing the Kubernetes API, you must regulate access control on theendpoint
of the ACK cluster and configure network policies to block the pod. For more information, see Use network policies in ACK clusters.apiVersion: v1 kind: ServiceAccount metadata: name: sa-no-automount automountServiceAccountToken: false
You can also deploy the ACKBlockAutomountToken policy to enforce the configuration of the
automountServiceAccountToken: false
field for application pods to prevent automatic ServiceAccount token mounting.Disable service discovery
You can reduce the amount of information provided to a pod if the pod does not need to look up or call cluster services. You can set the Domain Name System (DNS) policy of a pod to neither use CoreDNS nor expose Services as environment variables in the namespace of the pod. For more information, see Environment variables.
By default, the DNS policy of a pod is set to
ClusterFirst
, which requires the pod to use the in-cluster DNS service. If the DNS policy is set toDefault
, the pod is required to use the DNS resolution configurations from the underlying node. For more information, see Pod's DNS policy.After you disable service links and change the DNS policy of a pod, the pod can still access the in-cluster DNS service. Attackers can enumerate services in an ACK cluster by accessing the in-cluster DNS service. For example, attackers can run the
dig SRV *.*.svc.cluster.local @$CLUSTER_DNS_IP
command to discover services in the cluster. For more information about how to prevent service discovery in a cluster, see Use network policies in ACK clusters.apiVersion: v1 kind: Pod metadata: name: pod-no-service-info spec: dnsPolicy: Default # The value Default does not indicate the default setting of a DNS policy. enableServiceLinks: false
Configure container images to use a read-only file system
You can configure container images to use a read-only file system to prevent attackers from overwriting files in the file system that is used by your application. If your application must write data to the file system, you can set the application to write to a temporary directory or mount a volume to the application. You can configure container images to use a read-only file system by setting the following pod SecurityContext:
... securityContext: readOnlyRootFilesystem: true ...
You can also deploy the ACKPSPReadOnlyRootFilesystem policy to enforce the use of a read-only root file system for pods in the specified namespaces of the cluster.