Kubernetes Schedule

When we talk about scheduling, actually, what we want to know is:

what are the basic rules for the assignments of resources to nodes
and what is the order/priority of managing these resources

There are the issues that kube-scheduler dealing with.

The kube-scheduler is responsible for evaluating and enforcing scheduling operations, including:

Pod affinity and anti-affinity.

Node taints and pod tolerations.

Priority and preemption, using PriorityClass.

Other scheduling constraints (e.g., resource requests and limits, node selectors).

These are what we will discuss in this part, which correspondingly three related topics: Affinity & AntiAffinity, Tolerance (Taints) and the Priority.

0. Assigning a Pod to an Exact Node

Before discuss other rules, first let’s have a look at one special case, when we need a pod to run on a specific node. Instead of using affinity, we can bypass affinity rules and directly assign it to a node by setting nodeName under the pod’s spec.

apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  containers:
  - name: nginx-container
    image: nginx
  nodeName: node01

Here, nodeName: node01 specifies that the pod should only run on node01, effectively locking it to the node.
This approach bypasses any affinity rules or constraints, as the scheduler no longer needs to select a node based on these rules.

1. Affinity & AntiAffinity

Affinity and anti-affinity act as "soft" preferences (or "hard" requirements) for where a pod should or should not run relative to other pods or nodes.

1.1 Affinity/AntiAffinity Type

Node Affinity

Node affinity uses nodeSelectorTerms, which define specific criteria (e.g., labels) nodes must meet to schedule the pod.

inter-pod Affinity

Pod affinity/anti-affinity uses podAffinityTerm, which defines conditions related to other pods. For example, it might require certain pods to be co-located on the same node.

1.2 Scheduling type

There are two types of scheduling: preferredDuringSchedulingIgnoredDuringExecution and requiredDuringSchedulingIgnoredDuringExecution.

preferredDuringSchedulingIgnoredDuringExecution (soft preference):
- This is a soft rule, meaning the scheduler will try to place the pod on nodes that meet these conditions but won't enforce it strictly.
- If a node matches the soft rule, the pod will be more likely to be scheduled there, but if no nodes match, the pod will still be scheduled on other nodes.
requiredDuringSchedulingIgnoredDuringExecution (hard constraint):
- This is a hard rule, meaning the pod must be scheduled on nodes that satisfy the specified conditions.
- If no nodes meet the hard constraint, the pod won’t be scheduled at all.

1.3 Examples

Example with Node Affinity

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: "environment"
              operator: In
              values:
                - production
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: "diskType"
              operator: In
              values:
                - ssd

Here, the hard constraint requiredDuringSchedulingIgnoredDuringExecution specifies that the pod can only be scheduled on nodes labeled with environment=production.
The soft preference preferredDuringSchedulingIgnoredDuringExecution then indicates a preference for nodes with diskType=ssd. The scheduler will try to prioritize ssd nodes but will still schedule on any production node if no ssd nodes are available.

Example with Pod Affinity

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: backend
        topologyKey: "kubernetes.io/hostname"
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: frontend
          topologyKey: "kubernetes.io/hostname"

The hard constraint here is that this pod must be scheduled on the same node as any pod labeled app=backend(using kubernetes.io/hostname as the topology key).
The soft preference is to co-locate this pod with frontend pods on the same node if possible, but it’s not strictly enforced.

1.4 topologyKey

The topologyKey controls how pods are spread across different topologies (like nodes, zones, or regions). For example, if you want high availability by spreading replicas across zones, you can use topologyKey in your affinity rules.

Example: Spreading Pods Across Zones

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: myapp
        topologyKey: "failure-domain.beta.kubernetes.io/zone"

In this example:

The topologyKey is set to failure-domain.beta.kubernetes.io/zone, which means the scheduler will try to place myapp pods in separate zones.
This approach helps distribute replicas across different zones, making the application more resilient to zone-level failures.

Example: Ensuring Pods are on Different Nodes

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: myapp
        topologyKey: "kubernetes.io/hostname"

Here, setting topologyKey: kubernetes.io/hostname will cause the scheduler to place myapp pods on separate nodes whenever possible. This is useful for spreading pods within the same zone but across multiple nodes for redundancy.

In above, we metioned the concept “Zone”, which is popular used in cloud providers like Azure, AWS. The corresponding name is Availiablity zone. More info can check Azure or AWS related docs.

2. Tolerance (Taints)

Taints and tolerations act as a "hard" exclusion or inclusion criterion.

2.1 Taints

To check node Taints and taint a node or remove a taint with kubectl, use:

# check node Taints
kubectl describe node node01 | grep Taints
# taint node
kubectl taint nodes node1 key1=value1:NoSchedule
# remove the taint on node. Notice: the minus sign (-) at the end.
kubectl taint nodes node1 key1=value1:NoExecute-
# eg.
kubectl taint nodes node01 nodeName=workerNode01:NoSchedule

The last cmd line taint means:

The taint key is "nodeName"
The taint value is "workerNode01"
The effect is "NoSchedule" which means pods that don't tolerate this taint will not be scheduled on this node

Kubernetes nodes come with built-in taints, such as:

node.kubernetes.io/not-ready: Indicates the node is not ready. Pods without tolerations for this taint cannot be scheduled on these nodes.

node.kubernetes.io/unreachable: Indicates the node is unreachable from the control plane.

These taints ensure workloads don’t get scheduled on or remain on nodes with specific operational issues. (more details please check the documents)

2.2 Tolerations

To schedule Pods on a node with Taints, we need tolerations. Let’s use above tainted node01 as an example.

kubectl taint nodes node01 nodeName=workerNode01:NoSchedule

and the pod file as:

apiVersion: v1
kind: Pod
metadata:
  name: redis-pod
spec:
  containers:
    - name: redis-container
      image: redis:latest
      ports:
        - containerPort: 6379
  tolerations:
  - key: "nodeName"
    operator: "Equal"
    value: "workerNode01"
    effect: "NoSchedule"

We see the toleration matches the node's taints, this will allow the pod to be scheduled on the node01. Without this toleration, the scheduler would ignore node01 for scheduling this pod.

Here are more toleration examples:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "example-key"
  operator: "Exists"
  effect: "NoSchedule"

3. Priority

Priority is assigned to pods using the PriorityClass resource (Not direct assignment by value). Each PriorityClass defines a specific priority level with an integer value, and this is used to rank pods when deciding which ones are scheduled or preempted first during times of resource scarcity.

Pods without a PriorityClass inherit the default priority level, which is effectively zero.
The PriorityClass also influences eviction order in case of resource shortages, so that higher-priority pods are kept running longer than lower-priority ones.

3.1 Priority Class

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000
preemptionPolicy: Never
globalDefault: false

The default priority level can also be set explicitly by creating a PriorityClass with globalDefault: true. This priority level is applied to all pods that don’t specify a different PriorityClass.
preemptionPolicy: Defines whether a pod can preempt others (PreemptLowerPriority) or not (Never).

These settings in a PriorityClass allow fine control over which pods are prioritized and how they behave regarding preemption.

You can also check the running priorityclass by:

kubectl get priorityclass

3.2 Pod Management via Priority

Resource scarcity: If a node doesn’t have enough resources, it may evict lower-priority pods to reclaim resources for higher-priority ones.
New high-priority pod: When a new high-priority pod is created, the scheduler may preempt lower-priority pods on a node to make room for it.

In both cases, lower-priority pods may be removed (or repeled to other nodes) based on the priority hierarchy, ensuring critical workloads have the resources they need.