Kubernetes Schedule
When we talk about scheduling, actually, what we want to know is:
what are the basic rules for the assignments of resources to nodes
and what is the order/priority of managing these resources
There are the issues that kube-scheduler
dealing with.
The
kube-scheduler
is responsible for evaluating and enforcing scheduling operations, including:
Pod affinity and anti-affinity.
Node taints and pod tolerations.
Priority and preemption, using
PriorityClass
.Other scheduling constraints (e.g., resource requests and limits, node selectors).
These are what we will discuss in this part, which correspondingly three related topics: Affinity & AntiAffinity, Tolerance (Taints) and the Priority.
0. Assigning a Pod to an Exact Node
Before discuss other rules, first let’s have a look at one special case, when we need a pod to run on a specific node. Instead of using affinity, we can bypass affinity rules and directly assign it to a node by setting nodeName
under the pod’s spec
.
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
containers:
- name: nginx-container
image: nginx
nodeName: node01
Here,
nodeName: node01
specifies that the pod should only run onnode01
, effectively locking it to the node.This approach bypasses any affinity rules or constraints, as the scheduler no longer needs to select a node based on these rules.
1. Affinity & AntiAffinity
Affinity and anti-affinity act as "soft" preferences (or "hard" requirements) for where a pod should or should not run relative to other pods or nodes.
1.1 Affinity/AntiAffinity Type
Node Affinity
- Node affinity uses
nodeSelectorTerms
, which define specific criteria (e.g., labels) nodes must meet to schedule the pod.
inter-pod Affinity
- Pod affinity/anti-affinity uses
podAffinityTerm
, which defines conditions related to other pods. For example, it might require certain pods to be co-located on the same node.
1.2 Scheduling type
There are two types of scheduling: preferredDuringSchedulingIgnoredDuringExecution
and requiredDuringSchedulingIgnoredDuringExecution
.
preferredDuringSchedulingIgnoredDuringExecution
(soft preference):This is a soft rule, meaning the scheduler will try to place the pod on nodes that meet these conditions but won't enforce it strictly.
If a node matches the soft rule, the pod will be more likely to be scheduled there, but if no nodes match, the pod will still be scheduled on other nodes.
requiredDuringSchedulingIgnoredDuringExecution
(hard constraint):This is a hard rule, meaning the pod must be scheduled on nodes that satisfy the specified conditions.
If no nodes meet the hard constraint, the pod won’t be scheduled at all.
1.3 Examples
Example with Node Affinity
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "environment"
operator: In
values:
- production
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: "diskType"
operator: In
values:
- ssd
Here, the hard constraint
requiredDuringSchedulingIgnoredDuringExecution
specifies that the pod can only be scheduled on nodes labeled withenvironment=production
.The soft preference
preferredDuringSchedulingIgnoredDuringExecution
then indicates a preference for nodes withdiskType=ssd
. The scheduler will try to prioritizessd
nodes but will still schedule on anyproduction
node if nossd
nodes are available.
Example with Pod Affinity
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: backend
topologyKey: "kubernetes.io/hostname"
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchLabels:
app: frontend
topologyKey: "kubernetes.io/hostname"
The hard constraint here is that this pod must be scheduled on the same node as any pod labeled
app=backend
(usingkubernetes.io/hostname
as the topology key).The soft preference is to co-locate this pod with
frontend
pods on the same node if possible, but it’s not strictly enforced.
1.4 topologyKey
The topologyKey
controls how pods are spread across different topologies (like nodes, zones, or regions). For example, if you want high availability by spreading replicas across zones, you can use topologyKey
in your affinity rules.
Example: Spreading Pods Across Zones
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: myapp
topologyKey: "failure-domain.beta.kubernetes.io/zone"
In this example:
The
topologyKey
is set tofailure-domain.beta.kubernetes.io/zone
, which means the scheduler will try to placemyapp
pods in separate zones.This approach helps distribute replicas across different zones, making the application more resilient to zone-level failures.
Example: Ensuring Pods are on Different Nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: myapp
topologyKey: "kubernetes.io/hostname"
Here, setting topologyKey:
kubernetes.io/hostname
will cause the scheduler to place myapp
pods on separate nodes whenever possible. This is useful for spreading pods within the same zone but across multiple nodes for redundancy.
In above, we metioned the concept “Zone”, which is popular used in cloud providers like Azure, AWS. The corresponding name is Availiablity zone. More info can check Azure or AWS related docs.
2. Tolerance (Taints)
Taints and tolerations act as a "hard" exclusion or inclusion criterion.
2.1 Taints
To check node Taints
and taint a node or remove a taint with kubectl
, use:
# check node Taints
kubectl describe node node01 | grep Taints
# taint node
kubectl taint nodes node1 key1=value1:NoSchedule
# remove the taint on node. Notice: the minus sign (-) at the end.
kubectl taint nodes node1 key1=value1:NoExecute-
# eg.
kubectl taint nodes node01 nodeName=workerNode01:NoSchedule
The last cmd line taint means:
The taint key is "
nodeName
"The taint value is "
workerNode01
"The effect is "
NoSchedule
" which means pods that don't tolerate this taint will not be scheduled on this node
Kubernetes nodes come with built-in taints, such as:
node.kubernetes.io/not-ready
: Indicates the node is not ready. Pods without tolerations for this taint cannot be scheduled on these nodes.
node.kubernetes.io/unreachable
: Indicates the node is unreachable from the control plane.These taints ensure workloads don’t get scheduled on or remain on nodes with specific operational issues. (more details please check the documents)
2.2 Tolerations
To schedule Pods on a node with Taints
, we need tolerations
. Let’s use above tainted node01
as an example.
kubectl taint nodes node01 nodeName=workerNode01:NoSchedule
and the pod file as:
apiVersion: v1
kind: Pod
metadata:
name: redis-pod
spec:
containers:
- name: redis-container
image: redis:latest
ports:
- containerPort: 6379
tolerations:
- key: "nodeName"
operator: "Equal"
value: "workerNode01"
effect: "NoSchedule"
We see the toleration
matches the node's taints, this will allow the pod to be scheduled on the node01
. Without this toleration, the scheduler would ignore node01
for scheduling this pod.
Here are more toleration examples:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "example-key"
operator: "Exists"
effect: "NoSchedule"
3. Priority
Priority is assigned to pods using the PriorityClass
resource (Not direct assignment by value). Each PriorityClass
defines a specific priority level with an integer value, and this is used to rank pods when deciding which ones are scheduled or preempted first during times of resource scarcity.
Pods without a
PriorityClass
inherit the default priority level, which is effectively zero.The
PriorityClass
also influences eviction order in case of resource shortages, so that higher-priority pods are kept running longer than lower-priority ones.
3.1 Priority Class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
preemptionPolicy: Never
globalDefault: false
The default priority level can also be set explicitly by creating a
PriorityClass
withglobalDefault: true
. This priority level is applied to all pods that don’t specify a differentPriorityClass
.preemptionPolicy
: Defines whether a pod can preempt others (PreemptLowerPriority
) or not (Never
).
These settings in a PriorityClass
allow fine control over which pods are prioritized and how they behave regarding preemption.
You can also check the running priorityclass
by:
kubectl get priorityclass
3.2 Pod Management via Priority
Resource scarcity: If a node doesn’t have enough resources, it may evict lower-priority pods to reclaim resources for higher-priority ones.
New high-priority pod: When a new high-priority pod is created, the scheduler may preempt lower-priority pods on a node to make room for it.
In both cases, lower-priority pods may be removed (or repeled to other nodes) based on the priority hierarchy, ensuring critical workloads have the resources they need.