MigrationController Plugin of Koordinator Descheduler: Ensuring Successful Scheduling of Pods After Eviction

Since the scheduler and descheduler are two independently operating components, there is no communication or negotiation mechanism between them. After the descheduler evicts a pod, there can be a resource contention issue between the scheduling of pods generated by the eviction and the scheduling of normal pods (those generated for non-eviction reasons). For instance, if both a normal pod and a newly generated pod after eviction require the same resources, and the cluster’s remaining schedulable resources are tight (or fragmented), the normal pod may get scheduled before the new pod, causing the newly generated pod to be rescheduled to its original node or even fail to be scheduled.

The MigrationController plugin addresses this issue by communicating with the koordinator scheduler through the Reservation resource, allowing the koordinator scheduler to reserve resources before evicting the pod.

The arbitration mechanism, introduced in version 1.4.0, is a mechanism that intervenes in the pod eviction process. It temporarily pauses the eviction of application pods during application maintenance, such as during a deployment’s rolling update when the descheduler also evicts pods, which can affect the application’s stability and potentially leave no ready pods. The arbitration mechanism provides a way to add an approval process during pod eviction, enabling actions such as approval or rejection.

This article is based on koordinator v1.4.0.

The MigrationController plugin in the koordinator descheduler implements the filter and evict plugin extension points. It includes functionalities for resource reservation and the arbitration mechanism.

Terminology:

Reservation: A koordinator custom resource. When resource reservation is enabled, each evicted pod is associated with a reservation. It is used by the koordinator-scheduler to track the status of resource reservations, while the koordinator-descheduler updates the PodMigrationJob status based on this reservation.

PodMigrationJob: A koordinator custom resource. Each time the MigrationController evicts a pod, it creates a PodMigrationJob for that pod. It is used to track the eviction status of the pod.

arbitrator: A module within the MigrationController plugin responsible for filtering pod evictions and deciding whether a pod should be evicted.

In the koordinator-descheduler configuration file, add the following configuration under profiles[*].plugins.evict and profiles[*].plugins.pluginConfig:

yaml

  profiles:
    - name: koord-descheduler
      plugins:
       ........
        evict:
          disabled:
            - name: "*"
          enabled:
            - name: MigrationController
      pluginConfig:
      - name: MigrationController
        args:    

The resource reservation feature is enabled by default, with defaultJobMode: ReservationFirst set in the MigrationController plugin configuration.

yaml

    pluginConfig:
      - name: MigrationController
        args:
          defaultJobMode: ReservationFirst

For manually created PodMigrationJob, the resource reservation feature is enabled by default.

yaml

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: PodMigrationJob
metadata:
  name: migrationjob-demo
spec:
  mode: ReservationFirst

You can also disable the resource reservation feature by setting defaultJobMode: EvictDirectly in the MigrationController plugin configuration.

yaml

    pluginConfig:
      - name: MigrationController
        args:
          defaultJobMode: EvictDirectly

For manually created PodMigrationJob, you can disable the resource reservation feature.

yaml

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: PodMigrationJob
metadata:
  name: migrationjob-demo
spec:
  mode: EvictDirectly

The koordinator-scheduler treats reservations as scheduling units similar to pods. It schedules these reservations based on their information (acting as virtual pods to hold a spot) and allocates a suitable node for resource reservation. When the new pod related to the reservation is created, the reserved resources are allocated to this new pod.

Let’s take an example where both LowNodeLoad and MigrationController plugins are enabled. The following is the koordinator-descheduler configuration, with LowNodeLoad as a balance-type plugin and MigrationController as an evict-type plugin.

yaml

profiles:
  - name: koord-descheduler
    plugins:
      deschedule:
        disabled:
          - name: "*"
      balance:
        enabled:
          - name: LowNodeLoad
      evict:
        disabled:
          - name: "*"
        enabled:
          - name: MigrationController
    pluginConfig:
      - name: MigrationController
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          evictionPolicy: Eviction
          namespaces:
            exclude:
              - kube-system
          evictQPS: "10"
          evictBurst: 1
      - name: LowNodeLoad
        args:
          apiVersion: descheduler/v1alpha2
          kind: LowNodeLoadArgs
          evictableNamespaces:
            exclude:
              - kube-system
          useDeviationThresholds: false
          lowThresholds:
            cpu: 45
            memory: 55
          highThresholds:
            cpu: 75
            memory: 80

During the pod eviction phase) on nodes executed by the LowNodeLoadArgs plugin, the evict plugin (here configured as MigrationController) is called to execute the eviction.

koordinator-descheduler-migrationController-process
koordinator-descheduler migrationController process

When MigrationController executes evict, it performs the following steps:

  1. Calls the arbitrator to decide whether to proceed with the subsequent eviction process (arbitration mechanism).

  2. Creates a PodMigrationJob (a custom CR) resource to record the pod migration processing status.

  3. If the resource reservation feature is enabled in the MigrationController configuration (defaultJobMode is ReservationFirst, which is enabled by default), it creates a Reservation (a custom CR) resource to allow the koordinator-scheduler to track the reservation status and reserve resources for the newly created pod. Once the koordinator-scheduler successfully reserves the resources, the koordinator-descheduler synchronizes the Reservation status to the PodMigrationJob.

  4. Executes the pod eviction.

  5. If the resource reservation feature is enabled, waits for the new pod to complete scheduling (i.e., successfully allocates the reserved resources to the new pod).

The arbitrator performs arbitration (deciding whether to continue the eviction) both before and after creating the job.

It includes some built-in eviction restrictions and provides a method for manual intervention to prevent pod eviction (i.e., adding “scheduling.koordinator.sh/eviction-cost” with the value of int32 max value in the pod’s annotation).

Built-in restrictions (any of which prevent eviction) include:

  1. The annotation “scheduling.koordinator.sh/eviction-cost” with the value of int32 max value.

  2. The defaultJobMode is ReservationFirst, and the pod’s Spec.SchedulerName is not in the MigrationController plugin’s SchedulerNames configuration item. If the resource reservation feature is enabled, pods whose scheduler is not in the MigrationController’s configured scheduler list will not be evicted.

  3. Non-evictable pods on the node (similar to the official descheduler, please refer to descheduler plugin filtering rules).

  4. The pod’s namespace is not included in the MigrationController configuration Namespaces.Include.

  5. The pod’s namespace is included in the MigrationController configuration Namespaces.Exclude.

  6. The number of pods migrating per workload exceeds MaxMigratingPerWorkload, or the workload’s pod count is 1.

  7. The pod’s annotation does not have “descheduler.alpha.kubernetes.io/evict” and meets any of the following conditions:

    • The overall descheduler eviction has reached the rate limit in MigrationController’s ObjectLimiters configuration.

    • The number of pods passed arbitration on the node exceeds MaxMigratingPerNode in the MigrationController configuration.

    • The number of PodMigrationJobs passed arbitration in the pod’s namespace exceeds MaxMigratingPerNamespace in the MigrationController configuration.

    • The number of migrating pods in the workload exceeds MaxMigratingPerWorkload, or the number of migrating pods plus the unavailable pods in the workload exceeds MaxUnavailablePerWorkload.

After creating the PodMigrationJob, the above conditions are checked again (this is to check for manually created PodMigrationJobs).

For example, manually migrating a specified pod will create a PodMigrationJob for that pod.

The MigrationController ultimately executes the eviction operation, supporting three types of eviction actions:

  1. Delete Method: Executes deletion by calling the Delete method. This is configured by setting the MigrationController’s EvictionPolicy parameter to “Delete”.
  2. Evict Method: Executes deletion by calling the Evict method. This is configured by setting the MigrationController’s EvictionPolicy parameter to “Eviction”.
  3. Soft Eviction: Does not remove the pod but adds an annotation to the pod with the key “scheduling.koordinator.sh/soft-eviction”. The value records the reason for the eviction, who triggered it, the time, and a JSON string of the DeleteOptions. This is configured by setting the MigrationController’s EvictionPolicy parameter to “SoftEviction”.
ParameterDescriptionDefault Value
dryRunExecutes the eviction process without creating Reservations or evicting podsfalse
evictFailedBarePodsWhether to evict orphaned pods in a failed statefalse
evictLocalStoragePodsWhether to evict pods with local storagefalse
evictSystemCriticalPodsWhether to evict SystemCritical priority podsfalse
IgnorePvcPodsWhether to evict pods with PVCsfalse
labelSelectorOnly pods matching this LabelSelector will be evictednil
priorityThresholdOnly pods with a priority lower than this level will be evictednil
maxConcurrentReconcilesMaximum number of concurrent reconciles1
namespaces.includeNamespaces to include for eviction. Checked before excludenil
namespaces.excludeNamespaces to exclude from eviction. Checked after includenil
nodeFitWhether new pods can be scheduled on suitable nodes, including NodeAffinity and Taint tolerationfalse
nodeSelectorFilters target nodes and checks if they can accommodate the evicted podsfalse
maxMigratingPerNodeMaximum number of migrating pods per node (0 means no limit)2
maxMigratingPerNamespaceMaximum number of migrating pods per namespace (0 means no limit)0
maxMigratingPerWorkloadMaximum number of migrating pods per workload
0: Represents the following conditions:
For workloads with more than 10 replicas, the maximum migrating pods are 10% of the workload replicas.
For workloads with 4 to 10 replicas, the maximum migrating pods are 2.
For workloads with fewer than 4 replicas, the maximum migrating pods are 1.
0
maxUnavailablePerWorkloadMaximum number of unavailable pods per workload during migration
0: Represents the following conditions:
For workloads with more than 10 replicas, the maximum migrating pods are 10% of the workload replicas.
For workloads with 4 to 10 replicas, the maximum migrating pods are 2.
For workloads with fewer than 4 replicas, the maximum migrating pods are 1.
0
skipCheckExpectedReplicasIf false, checks that maxMigratingPerWorkload and maxUnavailablePerWorkload are less than expected replicas for workloadfalse
objectLimitersEviction rate limits per workload dimension
Configures eviction rate limits for each dimension, with parameters maxMigrating and duration, indicating the maximum number of pods that can be migrated within the specified duration window.
This feature overlaps with maxMigratingPerWorkload, but the logic differs. maxMigratingPerWorkload calculates based on all current PodMigrationJobs, whereas objectLimiters uses a rate limiter for tracking.
Both configurations face a similar issue: if a pod is generated by a deployment and the workload is a ReplicaSet, during a deployment rolling update, the deployment replicas might not match the expected number of replicas.
nil
defaultJobModeMode for executing PodMigrationJob: “ReservationFirst” or “EvictDirectly”“ReservationFirst”
defaultJobTTLTime-to-live for PodMigrationJob resources created by koordinator-descheduler5m
schedulerNamesList of scheduler names that can handle resource reservations[“koord-scheduler”]
evictQPSGlobal limit for pod evictions per second10
evictBurstGlobal limit for burst pod evictions1
evictionPolicyMethod for executing pod evictions: “Eviction”, “Delete”, or “SoftEviction”“Eviction”
defaultDeleteOptionsParameters passed when evicting podsnil
arbitrationArgs.enabledWhether to enable the arbitration mechanismtrue
arbitrationArgs.intervalInterval for executing arbitration on PodMigrationJobs500ms

The MigrationController plugin addresses the issue of newly created pods being unschedulable after descheduler evictions. It allows for manual eviction of specific pods (by manually creating a PodMigrationJob) and enables certain pods to be exempt from eviction.

The resource reservation feature requires integration with the koordinator-scheduler. Therefore, both koordinator-descheduler and koordinator-scheduler need to be running simultaneously to leverage this functionality.

Related Content