Kubernetes Taints and Tolerations

By Alwyn Botha, Alibaba Cloud Community Blog author.

Taints and Tolerations are best understood using several exercises, which can be achieved through this tutorial.

Taints spoil a node electronically - marking it as undesirable for Pods. Pods specify tolerations - meaning they will tolerate a node with certain taints.

You can use taints and tolerations to deliberately prevent certain Pods from running on a node, or, to deliberately let certain Pods run on a node ( for example Pods that need ssd or GPUs, etc. ).

This tutorial contains several examples of taints and tolerations to help you get practical experience of this abstract concept.

Prerequisites

This tutorial will work best if run on a Kubernetes cluster with only one node.

One node gets tainted and Pods are run to determine if they tolerate the taints on that one node.

If you have a vast cluster of nodes your Pod will automatically run on the any of the untainted nodes.

To learn taints and tolerations fastest it is best to have access to only one node.

If you must run this tutorial on a multi-node cluster you can simulate a single node: for all Pod specs below also add a nodename. Set nodename equal to the one node you have control over.

Nodename reference information : https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodename

Summary: simply add a nodeName: my-node-name in your YAML specs file.

This way the Kubernetes scheduler will only attempt to run your Pod on that ONE node. You can then learn in a tiny environment that you completely control.

Unfortunately this part of the tutorial will make most sense after you did all the exercises below.

So for the moment just believe me: run on a single node cluster or use nodeName.

After you followed this tutorial redo the Pods that failed in a cluster with more than one node. You will see those - Pods unschedulable on a one tainted node cluster - get scheduled on the other (untainted) nodes. ( These previous 2 sentences will also make sense after you did the complete tutorial. )

Taint Beginner Demo: NoSchedule

We add taints to a node using this syntax:

kubectl taint nodes node-name key=value:NoSchedule
kubectl taint nodes node-name key=value:NoExecute

You have to supply your node-name , your key and your value.

Our first taint:

kubectl taint nodes minikube dedicated-app=my-dedi-app-a:NoSchedule

We taint our node called minikube so that

the key : dedicated-app

with

the value : my-dedi-app-a

cannot be scheduled on this node ( NoSchedule )

Note the label below: app: my-dedi-app-a

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 3600']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0

Our Pod uses busybox to sleep for 3600 seconds.

Note there are no tolerations in our Pod spec.

Our node is tainted, but our Pod does not have a toleration for that taint.

Theory suggests that this Pod will not be able to run on this node. Let's investigate :

Create Pod

kubectl create -f mybusybox.yaml
pod/mybusypod created

Get list of Pods:

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   0/1     Pending   0          5s

Only relevant output from describe command:

kubectl describe pod/mybusypod

Name:               mybusypod
Labels:             app=my-dedi-app-a
Status:             Pending
Containers:
  my-dedi-container-a:
    Image:      busybox
   
Conditions:
  Type           Status
  PodScheduled   False
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  25s   default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

Last line explains what happened: 1 node(s) had taints that the pod didn't tolerate.

PodScheduled False ... Pod cannot be scheduled onto this node.

Tainting works. It prevents Pods that do not have a toleration for that taint from running on that node.

( I only have one Kubernetes node. In a setup where you have several nodes, Kubernetes will automatically seek out all the nodes until it finds a node where this Pod can run ... or it will state ... 0/397 nodes are available: 397 node(s) had taints that the pod didn't tolerate.)

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

Let's now add a toleration for that taint to our Pod:

Note last 5 last lines add a toleration.

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 3600']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0 
  
  tolerations:
  - key: "dedicated-app"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoSchedule"

Create:

kubectl create -f mybusybox.yaml
pod/mybusypod created

List Pods:

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          4s

Success: now Pod is running. It tolerated the node with that taint.

Note the important Tolerations: dedicated-app=my-dedi-app-a:NoSchedule line below.

The other 2 tolerations are automatically added by Kubernetes to all Pods.

kubectl describe pod/mybusypod

Name:               mybusypod
Node:               minikube/10.0.2.15
Start Time:         Mon, 11 Feb 2019 07:53:59 +0200
Labels:             app=my-dedi-app-a
Status:             Running
Containers:
  my-dedi-container-a:
    Command:
    State:          Running
    Ready:          True
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Tolerations:     dedicated-app=my-dedi-app-a:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  11s   default-scheduler  Successfully assigned default/mybusypod to minikube
  Normal  Pulled     10s   kubelet, minikube  Container image "busybox" already present on machine
  Normal  Created    10s   kubelet, minikube  Created container
  Normal  Started    10s   kubelet, minikube  Started container

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

NoExecute Taint

We keep our existing taint and toleration in place. ( It works )

We add another taint: Pods with key:dedicated-app-exec and value:my-dedi-app-a must not be allowed to run on our node ( NoExecute ).

kubectl taint nodes minikube dedicated-app-exec=my-dedi-app-a:NoExecute
node/minikube tainted

Investigate first few lines of our node:

Note we now have 2 taints ( at bottom ).

kubectl describe node | head -n13

Name:               minikube
Roles:              master
Taints:             dedicated-app-exec=my-dedi-app-a:NoExecute
                    dedicated-app=my-dedi-app-a:NoSchedule

Our Pod spec is as before: no toleration for this second taint.

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 3600']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0 
  
  tolerations:
  - key: "dedicated-app"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoSchedule"

We expect our Pod to be unable to run on this tainted node.

kubectl create -f mybusybox.yaml
pod/mybusypod created

List Pods:

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   0/1     Pending   0          3s

As expected, Pod is pending.

Investigate why: look at the final line in the output.

kubectl describe pod/mybusypod

Name:               mybusypod
Labels:             app=my-dedi-app-a
Status:             Pending
IP:
Containers:
  my-dedi-container-a:
    Image:      busybox
Conditions:
  Type           Status
  PodScheduled   False
Tolerations:     dedicated-app=my-dedi-app-a:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  17s (x2 over 17s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

Behavior as expected.

It is disappointing to see the warning does not state WHICH taint the Pod did not tolerate. ( In production you will have many nodes each with many taints and lists of tolerations for your Pods. So you have to manually go through those lists to see which taint is not tolerated. )

NoExecute Toleration

Now we specify a toleration for the taint so that our Pod can run on the node. ( Note last 5 lines )

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 3600']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0 
  
  tolerations:
  - key: "dedicated-app"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoSchedule"
        
  - key: "dedicated-app-exec"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoExecute"
    tolerationSeconds: 60

The tolerationSeconds: 60 specify that our Pod can handle the dedicated-app-exec taint for only 60 seconds. After 60 seconds it will be swiftly and forcefully completely removed from the node.

Create Pod

kubectl create -f mybusybox.yaml
pod/mybusypod created

NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          2s

A minute later ...

kubectl get pods
No resources found.

Pod automatically deleted from our node.

which-end: frontend

A node may have unlimited number of taints.

Pods may have unlimited number of tolerations.

We add more taints by first adding more labels to our Pod.

Below we add which-end: frontend label.

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
    which-end: frontend
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 10']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0 
  
  tolerations:
  - key: "dedicated-app"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoSchedule"

  - key: "dedicated-app-exec"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoExecute"
    tolerationSeconds: 60

We taint our node with a new taint using the which-end key.

kubectl taint nodes minikube which-end=frontend:NoSchedule
node/minikube tainted

kubectl describe node | head -n13
Name:               minikube
Taints:             dedicated-app-exec=my-dedi-app-a:NoExecute
                    dedicated-app=my-dedi-app-a:NoSchedule
                    which-end=frontend:NoSchedule

Our Pod has no toleration for this taint. Running will fail.

Tolerations:     dedicated-app=my-dedi-app-a:NoSchedule
                 dedicated-app-exec=my-dedi-app-a:NoExecute for 60s
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

kubectl create -f mybusybox.yaml
pod/mybusypod created

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   0/1     Pending   0          2s

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  12s (x2 over 12s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

All this should be old news to you at this point.

In the next section you will learn an alternative way to tolerate taints.

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

Wildcard Global Tolerations

Note the last 2 lines of our Pod spec:

  - key: "which-end"
    operator: "Exists"

This special syntax specifies that our Pod tolerates ALL which-end key values.

Note our Pod spec labels below : which-end: frontend ... our Pod provides front-end functionality.

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
    which-end: frontend
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 10']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0 
  
  tolerations:
  - key: "dedicated-app"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoSchedule"

  - key: "dedicated-app-exec"
    operator: "Equal"
    value: "my-dedi-app-a"
    effect: "NoExecute"
    tolerationSeconds: 60  
    
  - key: "which-end"
    operator: "Exists"

Note on the command 'sleep 10' . We only let Pod run 10 seconds. It has tolerationSeconds: 60 . So it will run to completion within 10 seconds.

kubectl get pods

NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          3s

NAME        READY   STATUS      RESTARTS   AGE
mybusypod   0/1     Completed   0          18s

List of tolerations for our Pod - note which-end at bottom. Pod tolerates ALL which-end taints.

Tolerations:     dedicated-app=my-dedi-app-a:NoSchedule
                 dedicated-app-exec=my-dedi-app-a:NoExecute for 60s
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 which-end

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

Tolerate All Taints

Note the special syntax below: tolerate ALL taints.

  tolerations:
  - operator: "Exists"

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
    which-end: frontend
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 10']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0 
  
  tolerations:
  - operator: "Exists"

Create Pod

kubectl create -f mybusybox.yaml
pod/mybusypod created

We see below that our Pod runs to completion with no problems.

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          2s

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          7s

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          12s

kubectl get pods
NAME        READY   STATUS      RESTARTS   AGE
mybusypod   0/1     Completed   0          16s

Investigate kubectl describe pod/mybusypod output

Tolerations:

Amazing: nothing there specifies our Pod tolerates all taints.

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

PreferNoSchedule

The NoSchedule taints prevents scheduling Pods on a node.

The PreferNoSchedule taints prevents scheduling Pods on a node, BUT, if no suitable untainted node can be found then it WILL schedule the Pod on that node.

From https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/

This is a "preference" or "soft" version of NoSchedule – the system will try to avoid placing a pod that does not tolerate the taint on the node, but it is not required.

Current taints on our node:

Taints:             dedicated-app-exec=my-dedi-app-a:NoExecute
                    dedicated-app=my-dedi-app-a:NoSchedule
                    which-end=frontend:NoSchedule

Let's remove all these taints. ( Note syntax hyphen at the end ... I understand it as : subtract taint from this node )

kubectl taint nodes minikube dedicated-app-exec:NoExecute-
kubectl taint nodes minikube dedicated-app:NoSchedule-
kubectl taint nodes minikube which-end:NoSchedule-
node/minikube untainted
node/minikube untainted
node/minikube untainted

Our node now totally untainted:

Taints:             <none>

Let's only add PreferNoSchedule taint - so that we can see how it works.

kubectl taint nodes minikube dedicated-app=my-dedi-app-a:PreferNoSchedule
node/minikube tainted

Our Pod spec specifies NO tolerations.

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
    which-end: frontend
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 10']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0

Create Pod

kubectl create -f mybusybox.yaml
pod/mybusypod created

PreferNoSchedule could not find any untainted nodes, so it will allow this Pod on this node.

kubectl get pods
NAME        READY   STATUS              RESTARTS   AGE
mybusypod   0/1     ContainerCreating   0          2s

NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          5s

NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          8s

NAME        READY   STATUS      RESTARTS   AGE
mybusypod   0/1     Completed   0          13s

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

Running Pods on Unschedulable Nodes

Extract from details about our node.

kubectl describe node | head -n13
Name:               minikube
Taints:             dedicated-app=my-dedi-app-a:PreferNoSchedule
Unschedulable:      false

Unschedulable: false means Schedulable: true

( Why the double negative: Unschedulable ... I do not know )

We can set the node to Unschedulable = true by cordoning the node.

kubectl cordon minikube
node/minikube cordoned

Node will not allow any new Pods to start running :

Create Pod

kubectl create -f mybusybox.yaml
pod/mybusypod created

kubectl get pods

NAME        READY   STATUS    RESTARTS   AGE
mybusypod   0/1     Pending   0          3s

Pending as expected

Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  7s    default-scheduler  0/1 nodes are available: 1 node(s) were unschedulable.

We use uncordon to allow Pods to run on node again.

kubectl uncordon minikube
node/minikube uncordoned

Our Pod starts running automatically.

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          45s

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          52s

kubectl get pods
NAME        READY   STATUS      RESTARTS   AGE
mybusypod   0/1     Completed   0          60s

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

Note the last 3 lines of our Pod spec: this is how we tolerate the unschedulable condition.

nano mybusybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mybusypod
  labels:
    app: my-dedi-app-a
    which-end: frontend
spec:
  containers:
  - name: my-dedi-container-a
    image: busybox
    imagePullPolicy: IfNotPresent
    
    command: ['sh', '-c', 'sleep 10']
        
  restartPolicy: Never
  terminationGracePeriodSeconds: 0 

  tolerations:
  - key: "node.kubernetes.io/unschedulable"
    operator: "Exists"

Taint node as unschedulable:

kubectl cordon minikube
node/minikube cordoned

Attempt to run our Pod :

kubectl create -f mybusybox.yaml
pod/mybusypod created

Success below: our Pod tolerates the unschedulable status.

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          2s

NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          7s

NAME        READY   STATUS    RESTARTS   AGE
mybusypod   1/1     Running   0          11s

NAME        READY   STATUS      RESTARTS   AGE
mybusypod   0/1     Completed   0          15s

Using the same syntax as above you can ( during emergencies ) run Pods on nodes with these conditions

node.kubernetes.io/unreachable
node.kubernetes.io/out-of-disk
node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/network-unavailable
node.kubernetes.io/unschedulable

These are not tolerations you should add to your Pods in general. This trick is for emergency use only.

Most critical Kubernetes system daemons tolerate those taints. ( These daemons must continue to run during these out of resources conditions to keep the Kubernetes system running. )

Make node available again.

kubectl uncordon minikube
node/minikube uncordoned

kubectl delete -f mybusybox.yaml
pod "mybusypod" deleted

Cleanup

Determine list of taints on this node:

kubectl describe node | head -n13

Name:               minikube
Taints:             dedicated-app=my-dedi-app-a:PreferNoSchedule
Unschedulable:      false
Conditions:

Let's remove taints.

kubectl taint nodes minikube dedicated-app:PreferNoSchedule-
node/minikube untainted

Community

Kubernetes Taints and Tolerations

Prerequisites

Taint Beginner Demo: NoSchedule

NoExecute Taint

NoExecute Toleration

which-end: frontend

Wildcard Global Tolerations

Tolerate All Taints

PreferNoSchedule

Running Pods on Unschedulable Nodes

Cleanup

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

ECS(Elastic Compute Service)

Container Service for Kubernetes