kubernetes · pohly · Sep 4, 2025 · nojnhuh · Sep 10, 2025 · pohly
diff --git a/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md b/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md
@@ -70,8 +70,10 @@ SIG Architecture for cross-cutting KEPs).
   - [User Stories](#user-stories)
     - [Degraded Devices](#degraded-devices)
     - [External Health Monitoring](#external-health-monitoring)
+    - [Safe Pod Eviction](#safe-pod-eviction)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
+  - [kubectl describe DeviceTaintRule](#kubectl-describe-devicetaintrule)
   - [API](#api)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
@@ -167,6 +169,10 @@ ResourceClaim.
 - Enable users to decide whether they want to keep running a workload in a degraded
   mode while a device is unhealthy or prefer to get pods rescheduled.
 
+- Publish information about devices ("device health") such that control plane
+  components or admins can decide how to react, without immediately affecting
+  scheduling or workloads.
+
 ### Non-Goals
 
 - Not part of the plan for alpha: developing a kubectl command for managing device taints.
@@ -181,9 +187,17 @@ ResourceClaim.
 A driver itself can detect problems which may or may not be tolerable for
 workloads, like degraded performance due to overheating. Removing such devices
 from the ResourceSlice would unconditionally prevent using them for new
-pods. Instead, publishing with a taint informs users about this degradation and
-leaves them the choice whether the device is still usable enough to run pods.
-It also automates stopping pods which don't tolerate such a degradation.
+pods. Instead, taints can be added to ResourceSlices to provide sufficient information
+for decision making on the indicated problems.
+
+A control plane component or the admins react to that information. They may
+publish a DeviceTaintRule which prevents using the degraded device for new pods
+or even evict all pods using it at the moment to replace or reset the device
+once it is idle.
+
+Users can decide to tolerate less critical taints in their workload, at their
+own risk. Admins scheduling maintenance pods need to tolerate their own taints
+to get the pod scheduled.
 
 #### External Health Monitoring
 
@@ -193,6 +207,22 @@ not supported by that DRA driver. When that component detects problems, it can
 check its policy configuration and decide to take devices offline by creating
 a DeviceTaintRule with a taint for affected devices.
 
+#### Safe Pod Eviction
+
+Selecting the wrong set of devices in a DeviceTaintRule can have potentially
+disastrous consequences, including quickly evicting all workloads using any
+device in the cluster instead of those using a single device. To avoid this, a
+cluster admin can first create a DeviceTaintRule such that it has no immediate
+effect. `kubectl describe` then includes information about the matched devices
+and how many pods would be evicted if the effect was to evict. Then the admin
+can edit the DeviceTaintRule to set the desired effect.
+
+Once eviction starts, it happens at a low enough rate that admins have a chance
+to delete the DeviceTaintRule before all pods are evicted if they made a
+mistake after all. This rate is configurable to enable faster eviction, if
+admins are sure that this is what they want. The DeviceTaintRule status
+provides information about the progress.
+
 ### Risks and Mitigations
 
 A device can be identified by its names (`<driver name>/<pool name>/<device
@@ -229,6 +259,24 @@ the purpose of this KEP:
 - The ResourceClaim controller will remove such a completed pod from the claim's
   `ReservedFor` and deallocate the claim once it has no consumers.
 
+The semantic of the value associated with a taint key is defined by whoever
+publishes taints with that key. DRA drivers should use the driver name as
+domain in the key to avoid conflicts. To support tolerating taints by value,
+values should not require parsing to extract information. It's better to use
+different keys with simple values than one key with a complex value.
+
+The taint value may or may not be sufficient to represent the state of a
+device. Therefore the API also allows publishing structured information in JSON
+format as a raw extension field, similar to the ResourceClaim device status.
+For humans, a description field can be used to provide a summary or additional
+explanations why the taint was added.
+
+If some common patterns emerge, then Kubernetes could standardize the name,
+value and data for certain taints. Those then have a certain semantic. This
+is similar to standardizing device attributes.
+
+`Effect: None` can be used to publish taints which are merely informational.
+
 Taints are cumulative:
 - Taints defined by an admin in a DeviceTaintRule get added to the
   set of taints defined by the DRA driver in a ResourceSlice.
@@ -258,30 +306,85 @@ Device and node taints are applied independently. A node taint applies to all
 pods on a node, whereas a device taint affects claim allocation and only those
 pods using the claim.
 
+### kubectl describe DeviceTaintRule
+
+An extension of `kubectl describe` for DeviceTaintRule uses the same code as
+the eviction controller to calculate which devices are matched and which pods
+would need to get evicted. The input for that code is the single
+DeviceTaintRule which gets described and the complete set of ResourceSlices,
+ResourceClaims and Pods.
+
+Calculating the outcome on the client side was chosen because the alternative,
+always updating the status of DeviceTaintRules on the server side, would cause
+permanent churn regardless whether someone wants the information or not.
+
+The exact output will be decided during the implementation phase.
+It can change at any time because the `kubectl describe` output does not fall
+under any Kubernetes API stability guarantees.
+
 ### API
 
-The ResourceSlice content gets extended:
+The ResourceSlice content gets extended. A natural place for a taint that
+affects one specific device would be inside the `Device` struct. The problem
+with that approach is that with 128 allowed devices per ResourceSlice the
+maximum size of each taint would have to be very small to keep the worst-case
+size of a ResourceSlice within the limits imposed by etcd and the API request
+machinery.
+
+Therefore a different approach is used where each ResourceSlice either provides
+information about devices or about taints, but never both. Using the
+ResourceSlice for taints instead of a different type has the advantage that the
+existing mechanisms and code for publishing and consuming the information can
+be reused. The same approach is likely to be used for other additional
+ResourceSlice pool information, like mixin definitions. The downside is that at
+least two ResourceSlices are necessary once taints get published by a DRA
+driver.
 
 ```Go
-type Device struct {
+type ResourceSliceSpec struct {
     ...
 
-    // If specified, these are the driver-defined taints.
+    // Devices lists some or all of the devices in this pool.
     //
-    // The maximum number of taints is 4.
+    // Must not have more than 128 entries. Either Devices or Taints may be set, but not both.
+    //
+    // +optional
+    // +listType=atomic
+    // +oneOf=ResourceSliceContent
+    Devices []Device
+
+    // If specified, these are driver-defined taints.
+    //
+    // The maximum number of taints is 32. Either Devices or Taints may be set, but not both.
     //
     // This is an alpha field and requires enabling the DRADeviceTaints
     // feature gate.
     //
     // +optional
     // +listType=atomic
     // +featureGate=DRADeviceTaints
-    Taints []DeviceTaint
+    // +oneOf=ResourceSliceContent
+    Taints []SliceDeviceTaint
+}
+
+// DeviceTaintsMaxLength is the maximum number of taints per ResourceSlice.
+const DeviceTaintsMaxLength = 32
+
+// SliceDeviceTaint defines one taint within a ResourceSlice.
+type SliceDeviceTaint struct {
+   // Device is the name of the device in the pool that the ResourceSlice belongs to
+   // which is affected by the taint. Multiple taints may affect the same device.
+   Device string
+
+   Taint DeviceTaint
 }
+```
 
-// DeviceTaintsMaxLength is the maximum number of taints per device.
-const DeviceTaintsMaxLength = 4
+DeviceTaint has all the fields of a v1.Taint, but the description is a bit
+different. In particular, PreferNoSchedule is not valid. Other fields got added
+to satisfy additional use cases like device health information.
 
+```Go
 // The device this taint is attached to has the "effect" on
 // any claim which does not tolerate the taint and, through the claim,
 // to pods using the claim.
@@ -300,7 +403,7 @@ type DeviceTaint struct {
 
     // The effect of the taint on claims that do not tolerate the taint
     // and through such claims on the pods using them.
-    // Valid effects are NoSchedule and NoExecute. PreferNoSchedule as used for
+    // Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for
     // nodes is not valid here.
     //
     // +required
@@ -322,12 +425,51 @@ type DeviceTaint struct {
     // This field was defined as "It is only written for NoExecute taints." for node taints.
     // But in practice, Kubernetes never did anything with it (no validation, no defaulting,
     // ignored during pod eviction in pkg/controller/tainteviction).
+
+    // Description is a human-readable explanation for the taint.
+    //
+    // The length must be smaller or equal to 1024.
+    //
+    // +optional
+    Description *string
+
+    // Data contains arbitrary data specific to the taint key.
+    //
+    // The length of the raw data must be smaller or equal to 10 Ki.
+    //
+    // +optional
+    Data *runtime.RawExtension
+
+    // EvictionsPerSecond controls how quickly Pods get evicted if that is
+    // the effect of the taint. If multiple taints cause eviction
+    // of the same set of Pods, then the lowest rate defined in
+    // any of those taints applies.
+    //
+    // The default is 100 Pods/s.
+    //
+    // +optional
+    EvictionsPerSecond *int64
 }
 
+const (
+    // DefaultEvictionsPerSecond is the default for [DeviceTaint.EvictionsPerSecond]
+    // if none is specified explicitly.
+    DefaultEvictionsPerSecond = 100
+
+    // TaintDescriptionMaxLength is the maximum size of [DeviceTaint.Description].
+    TaintDescriptionMaxLength = 1024
+
+    // TaintDataMaxLength is the maximum size of [DeviceTaint.Data].
+    TaintDataMaxLength = 10 * 1024
+)
+
 // +enum
 type DeviceTaintEffect string
 
 const (
+    // No effect, the taint is purely informational.
+    DeviceTaintEffectNone DeviceTaintEffect = "None"
+
     // Do not allow new pods to schedule which use a tainted device unless they tolerate the taint,
     // but allow all pods submitted to Kubelet without going through the scheduler
     // to start, and allow all already-running pods to continue running.
@@ -338,9 +480,6 @@ const (
 )
 ```
 
-Taint has the exact same fields as a v1.Taint, but the description is a bit
-different. In particular, PreferNoSchedule is not valid.
-
 Tolerations get added to a DeviceRequest:
 
 ```Go
@@ -447,9 +586,15 @@ would be repetitive work. Instead, a
 reacts to informer events for ResourceSlice and DeviceTaintRule and
 maintains a set of updated ResourceSlices which also contain the taints
 set via a DeviceTaintRule. The tracker provides the API of an informer
-and thus can be used as a drop-in replacement for a ResourceSlice
+and thus can be used as a replacement for a ResourceSlice
 informer.
 
+It uses the type from `k8s.io/dynamic-resource-allocation/api` to represent
+ResourceSlices. The difference compared to the `k8s.io/resource/v1` API is that
+taints are stored in the device struct together with information about where
+they came from. The eviction controller uses this to apply the DeviceTaintRule
+rate limit and to update the DeviceTaintStatus.
+
 ```Go
 // DeviceTaintRule adds one taint to all devices which match the selector.
 // This has the same effect as if the taint was specified directly
@@ -465,20 +610,16 @@ type DeviceTaintRule struct {
     // Changing the spec automatically increments the metadata.generation number.
     Spec DeviceTaintRuleSpec
 
-    // ^^^
-    // A spec gets added because adding a status seems likely.
-    // Such a status could provide feedback on applying the
-    // eviction and/or statistics (number of matching devices,
-    // affected allocated claims, pods remaining to be evicted,
-    // etc.).
+    // Status provides information about an on-going pod eviction.
+    Status DeviceTaintRuleStatus
 }
 
 // DeviceTaintRuleSpec specifies the selector and one taint.
 type DeviceTaintRuleSpec struct {
     // DeviceSelector defines which device(s) the taint is applied to.
     // All selector criteria must be satified for a device to
     // match. The empty selector matches all devices. Without
-    // a selector, no devices are matches.
+    // a selector, no devices are matched.
     //
     // +optional
     DeviceSelector *DeviceTaintSelector
@@ -535,6 +676,28 @@ type DeviceTaintSelector struct {
     // +listType=atomic
     Selectors []DeviceSelector
 }
+
+// DeviceTaintRuleStatus provides information about an on-going pod eviction.
+type DeviceTaintRuleStatus struct {
+    // PodsPendingEviction counts the number of Pods which still need to be evicted.
+    // Because taints with the NoExecute effect also prevent scheduling new pods,
+    // this number should eventually reach zero.
+    //
+    // The count gets updated periodically, so it is not guaranteed to be 100%
+    // accurate.
+    //
+    // +optional
+    PodsPendingEviction *int64
+
+    // PodsEvicted counts the number of Pods which were evicted because of the taint.
+    //
+    // This gets updated periodically, so it is not guaranteed to be 100%
+    // accurate. The actual count may be higher if the controller evicted
+    // some Pods and then gets restarted before updating this field.
+    //
+    // +optional
+    PodsEvicted *int64
+}
 ```
 
 ### Test Plan

diff --git a/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml b/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml
@@ -20,7 +20,7 @@ stage: alpha
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.33"
+latest-milestone: "v1.35"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone: