Configure pod security policies and prevent container escapes that allow attackers to escalate privileges - Container Service for Kubernetes

Container Service for Kubernetes allows you to use multiple methods to prevent container escapes that allow attackers to access the host of containers. For example, you can forbid containers to run in privileged mode, require application pods to run as a non-root user, and disable automatic ServiceAccount token mounting. This topic describes how to configure pod security policies to enable security hardening for your ACK clusters to protect the ACK clusters from attackers.

Prevent container escapes that allow attackers to escalate privileges

Kubernetes developers and O&M administrators must focus on how to prevent processes from escaping a container. Container escapes allow attackers to escalate privileges to control the host of the container. Preventing container escapes is important due to the following reasons:

By default, the processes within a container run under the context of the root user in Linux. The operations that the root user can perform are restricted by the Linux capabilities that are assigned by Docker to the container. However, an attacker can exploit the default capabilities to escalate privileges or access sensitive information on the host, such as Secrets and ConfigMaps. The following list shows the default capabilities that are assigned to a Docker container. For more information, see capabilities(7) Linux manual page.
cap_chown, cap_dac_override, cap_fowner, cap_fsetid, cap_kill, cap_setgid, cap_setuid, cap_setpcap, cap_net_bind_service, cap_net_raw, cap_sys_chroot, cap_mknod, cap_audit_write, cap_setfcap.
To prevent container escapes, you must avoid running Docker containers in privileged mode because privileged containers are assigned all Linux capabilities of the root user.
All Kubernetes worker nodes use the node authorizer, which is a special-purpose authorization mode. The node authorizer is used to authorize all API requests that are sent by a kubelet. The node authorizer also allows a node to perform the following operations:
Read operations:
- Services
- Endpoints
- Nodes
- Pods
- Secrets, ConfigMaps, persistent volumes (PVs), and persistent volume claims (PVCs) of pods that are deployed on the node on which the kubelet runs
Write operations:
- Nodes and node status. You can enable the NodeRestriction admission controller to allow the kubelet to modify only the node on which the kubelet runs.
- Pods and pod status. You can enable the NodeRestriction admission controller to allow the kubelet to modify only the pod that is bound to the node on which the kubelet runs.
- Events
Authentication-related operations:
- The read and write permissions on the CertificateSigningRequest (CSR) API for Transport Layer Security (TLS) bootstrapping
- The permissions to create TokenReview and SubjectAccessReview for reviewing delegated identity authentication and authorization

By default, ACK clusters use the NodeRestriction admission controller. The NodeRestriction admission controller allows a kubelet to modify only a limited set of node attributes and pod objects that are bound to the node. However, the admission controller cannot prevent attackers from collecting sensitive information about the cluster environment by using the Kubernetes API. For more information, see NodeRestriction.

Suggestions on pod security

Forbid containers to run in privileged mode
Privileged containers inherit all Linux capabilities of the root user on the same host. In most scenarios, containers do not need these capabilities to handle workloads. You can create a pod security policy to forbid pods to run in privileged mode. The pod security policy is a group of constraints that a pod must meet before the pod can be created. ACK allows you to configure pod security policies based on Open Policy Agent (OPA) and Gatekeeper. The policies are used to validate requests for creating and updating pods in your cluster based on the security rules that you configure. If a request for creating or updating a pod does not meet the configured rules, the request is rejected and an error is returned. You can also use the ACKPSPPrivilegedContainer security policy to forbid the deployment of privileged containers within the specified namespaces of the cluster.
Run pods as a non-root user
By default, all containers run as the root user. Attackers can exploit vulnerabilities in applications and gain access to the shell of a container that is running. This poses security risks. You can use multiple methods to mitigate the risks. You can delete the shell from the container image. You can also add the USER instruction to the Dockerfile or run the containers as a non-root user. The spec.securityContext attribute in the podSpec contains the runAsUser and runAsGroup fields. The two fields specify the user and user group under which containers are run. You can create an ACKPSPAllowedUsers policy to allow only the specified users and user groups to run containers.
Forbid users to run containers in Docker-in-Docker mode or mount Docker.sock to containers
You can efficiently build or deploy container images inside a Docker container by using the Docker-in-Docker method or mounting Docker.sock to containers. However, doing so grants control over the node to the processes that are running inside the container. For more information about how to build container images on Kubernetes, see Use a Container Registry Enterprise Edition instance to build an image, kaniko, and img.
Restrict the use of hostPath volumes, or allow only the mounting of hostPath volumes to directories that have specified prefixes and configure the volumes to be read-only
A hostPath volume mounts a directory from the host to a pod. In most cases, pods do not require hostPath volumes. Make sure that you understand the risks if you need to use hostPath volumes. By default, pods that run with the root privileges have the write permissions on the file systems that are exposed by using hostPath volumes. Attackers can modify the kubelet settings and create symbolic links to directories or files that are not directly exposed by hostPath volumes. For example, attackers can access /etc/shadow, install SSH keys, read Secrets that are mounted to the host, and perform other malicious activities. To mitigate the risks that arise from hostPath volumes, set spec.containers.volumeMounts to read-only. Sample code:
```
volumeMounts:
- name: hostPath-volume
    readOnly: true
    mountPath: /host-path
```
You can also deploy the ACKPSPHostFilesystem policy to limit the host directories that can be mounted to pods in the specified namespaces of the cluster by using hostPath volumes.
Set resource requests and limits for each container to prevent resource contention and protect against DoS attacks
A pod without resource requests or limits can consume all of the resources on a host. If additional pods are scheduled to a node, the CPU or memory resources of the node may become insufficient. As a result, the kubelet may crash or pods may be evicted from the node. This issue is inevitable. However, you can set resource requests and limits to minimize resource contention and reduce the risks from improperly programmed applications that consume excessive resources.
You can specify requests and limits for CPU and memory resources in the podSpec. You can set a resource quota or limit range on a namespace to forcibly limit the use of requests and resources. A resource quota specifies the total amount of resources that are allocated to a namespace, such as CPU and memory resources. After you apply a resource quota to a namespace, the resource quota forces you to specify requests and limits for all containers deployed in the namespace. A limit range can be used to enforce fine-grained control on the resources that are allocated. You can set limit ranges to specify the maximum and minimum amounts of CPU and memory resources that each pod or container in a namespace can use. You can also use limit ranges to set the default request values or limit values if no default values are provided. For more information, see Managing Resources for Containers.
You can also deploy the ACKContainerLimits policy to enforce resource limits on pods in the specified namespaces of the cluster.
Forbid privileged escalation
Privileged escalation allows a process to change the security context under which it runs. For example, sudo files are binary files with the SUID or SGID bit. Privileged escalation is a method that can be used by a user to execute a file with the permissions of another user or user group. To prevent privileged escalation, you can use a pod security policy that sets the allowPriviledgedEscalation parameter to false or specify securityContext.allowPrivilegedEscalation in the podSpec.
You can also deploy the ACKPSPAllowPrivilegeEscalationContainer policy to enforce the configuration of the allowPrivilegeEscalation parameter for pods in the specified namespaces of the cluster.
Disable automatic ServiceAccount token mounting
For pods that do not need to access the Kubernetes API, you can disable automatic ServiceAccount token mounting in the podSpec of specific pods, or disable this feature for all pods that use a specific ServiceAccount.
```
apiVersion: v1
kind: Pod
metadata:
  name: pod-no-automount
spec:
  automountServiceAccountToken: false
```
After you disable automatic ServiceAccount token mounting for a pod, the pod can still access the Kubernetes API. To prevent a pod from accessing the Kubernetes API, you must regulate access control on the endpoint of the ACK cluster and configure network policies to block the pod. For more information, see Use network policies in ACK clusters.
```
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-no-automount
automountServiceAccountToken: false
```
You can also deploy the ACKBlockAutomountToken policy to enforce the configuration of the automountServiceAccountToken: false field for application pods to prevent automatic ServiceAccount token mounting.
Disable service discovery
You can reduce the amount of information provided to a pod if the pod does not need to look up or call cluster services. You can set the Domain Name System (DNS) policy of a pod to neither use CoreDNS nor expose Services as environment variables in the namespace of the pod. For more information, see Environment variables.
By default, the DNS policy of a pod is set to ClusterFirst, which requires the pod to use the in-cluster DNS service. If the DNS policy is set to Default, the pod is required to use the DNS resolution configurations from the underlying node. For more information, see Pod's DNS policy.
After you disable service links and change the DNS policy of a pod, the pod can still access the in-cluster DNS service. Attackers can enumerate services in an ACK cluster by accessing the in-cluster DNS service. For example, attackers can run the dig SRV *.*.svc.cluster.local @$CLUSTER_DNS_IP command to discover services in the cluster. For more information about how to prevent service discovery in a cluster, see Use network policies in ACK clusters.
```
apiVersion: v1
kind: Pod
metadata:
  name: pod-no-service-info
spec:
    dnsPolicy: Default # The value Default does not indicate the default setting of a DNS policy. 
    enableServiceLinks: false
```
Configure container images to use a read-only file system
You can configure container images to use a read-only file system to prevent attackers from overwriting files in the file system that is used by your application. If your application must write data to the file system, you can set the application to write to a temporary directory or mount a volume to the application. You can configure container images to use a read-only file system by setting the following pod SecurityContext:
```
...
securityContext:
  readOnlyRootFilesystem: true
...
```
You can also deploy the ACKPSPReadOnlyRootFilesystem policy to enforce the use of a read-only root file system for pods in the specified namespaces of the cluster.