Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 185 additions & 22 deletions keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,10 @@ SIG Architecture for cross-cutting KEPs).
- [User Stories](#user-stories)
- [Degraded Devices](#degraded-devices)
- [External Health Monitoring](#external-health-monitoring)
- [Safe Pod Eviction](#safe-pod-eviction)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [kubectl describe DeviceTaintRule](#kubectl-describe-devicetaintrule)
- [API](#api)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
Expand Down Expand Up @@ -167,6 +169,10 @@ ResourceClaim.
- Enable users to decide whether they want to keep running a workload in a degraded
mode while a device is unhealthy or prefer to get pods rescheduled.

- Publish information about devices ("device health") such that control plane
components or admins can decide how to react, without immediately affecting
scheduling or workloads.

### Non-Goals

- Not part of the plan for alpha: developing a kubectl command for managing device taints.
Expand All @@ -181,9 +187,17 @@ ResourceClaim.
A driver itself can detect problems which may or may not be tolerable for
workloads, like degraded performance due to overheating. Removing such devices
from the ResourceSlice would unconditionally prevent using them for new
pods. Instead, publishing with a taint informs users about this degradation and
leaves them the choice whether the device is still usable enough to run pods.
It also automates stopping pods which don't tolerate such a degradation.
pods. Instead, taints can be added to ResourceSlices to provide sufficient information
for decision making on the indicated problems.

A control plane component or the admins react to that information. They may
publish a DeviceTaintRule which prevents using the degraded device for new pods
or even evict all pods using it at the moment to replace or reset the device
once it is idle.

Users can decide to tolerate less critical taints in their workload, at their
own risk. Admins scheduling maintenance pods need to tolerate their own taints
to get the pod scheduled.

#### External Health Monitoring

Expand All @@ -193,6 +207,22 @@ not supported by that DRA driver. When that component detects problems, it can
check its policy configuration and decide to take devices offline by creating
a DeviceTaintRule with a taint for affected devices.

#### Safe Pod Eviction

Selecting the wrong set of devices in a DeviceTaintRule can have potentially
disastrous consequences, including quickly evicting all workloads using any
device in the cluster instead of those using a single device. To avoid this, a
cluster admin can first create a DeviceTaintRule such that it has no immediate
effect. `kubectl describe` then includes information about the matched devices
and how many pods would be evicted if the effect was to evict. Then the admin
can edit the DeviceTaintRule to set the desired effect.

Once eviction starts, it happens at a low enough rate that admins have a chance
to delete the DeviceTaintRule before all pods are evicted if they made a
mistake after all. This rate is configurable to enable faster eviction, if
Comment on lines +220 to +222
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is effect: None + kubectl describe by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea of effect: None essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the rate limiting would uncover anything that kubectl describe may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).

But primarily it is that kubectl describe is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.

admins are sure that this is what they want. The DeviceTaintRule status
provides information about the progress.

### Risks and Mitigations

A device can be identified by its names (`<driver name>/<pool name>/<device
Expand Down Expand Up @@ -229,6 +259,24 @@ the purpose of this KEP:
- The ResourceClaim controller will remove such a completed pod from the claim's
`ReservedFor` and deallocate the claim once it has no consumers.

The semantic of the value associated with a taint key is defined by whoever
publishes taints with that key. DRA drivers should use the driver name as
Comment on lines +262 to +263
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth drawing a parallel here to standardized device attributes? At least to call out that taint keys could also become standardized across vendors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's worth describing.

domain in the key to avoid conflicts. To support tolerating taints by value,
values should not require parsing to extract information. It's better to use
different keys with simple values than one key with a complex value.

The taint value may or may not be sufficient to represent the state of a
device. Therefore the API also allows publishing structured information in JSON
format as a raw extension field, similar to the ResourceClaim device status.
For humans, a description field can be used to provide a summary or additional
explanations why the taint was added.

If some common patterns emerge, then Kubernetes could standardize the name,
value and data for certain taints. Those then have a certain semantic. This
is similar to standardizing device attributes.

`Effect: None` can be used to publish taints which are merely informational.

Taints are cumulative:
- Taints defined by an admin in a DeviceTaintRule get added to the
set of taints defined by the DRA driver in a ResourceSlice.
Expand Down Expand Up @@ -258,30 +306,85 @@ Device and node taints are applied independently. A node taint applies to all
pods on a node, whereas a device taint affects claim allocation and only those
pods using the claim.

### kubectl describe DeviceTaintRule

An extension of `kubectl describe` for DeviceTaintRule uses the same code as
the eviction controller to calculate which devices are matched and which pods
would need to get evicted. The input for that code is the single
DeviceTaintRule which gets described and the complete set of ResourceSlices,
ResourceClaims and Pods.

Calculating the outcome on the client side was chosen because the alternative,
always updating the status of DeviceTaintRules on the server side, would cause
permanent churn regardless whether someone wants the information or not.

The exact output will be decided during the implementation phase.
It can change at any time because the `kubectl describe` output does not fall
under any Kubernetes API stability guarantees.

### API

The ResourceSlice content gets extended:
The ResourceSlice content gets extended. A natural place for a taint that
affects one specific device would be inside the `Device` struct. The problem
with that approach is that with 128 allowed devices per ResourceSlice the
maximum size of each taint would have to be very small to keep the worst-case
size of a ResourceSlice within the limits imposed by etcd and the API request
machinery.

Therefore a different approach is used where each ResourceSlice either provides
information about devices or about taints, but never both. Using the
ResourceSlice for taints instead of a different type has the advantage that the
existing mechanisms and code for publishing and consuming the information can
be reused. The same approach is likely to be used for other additional
ResourceSlice pool information, like mixin definitions. The downside is that at
least two ResourceSlices are necessary once taints get published by a DRA
driver.

```Go
type Device struct {
type ResourceSliceSpec struct {
...

// If specified, these are the driver-defined taints.
// Devices lists some or all of the devices in this pool.
//
// The maximum number of taints is 4.
// Must not have more than 128 entries. Either Devices or Taints may be set, but not both.
//
// +optional
// +listType=atomic
// +oneOf=ResourceSliceContent
Devices []Device

// If specified, these are driver-defined taints.
//
// The maximum number of taints is 32. Either Devices or Taints may be set, but not both.
//
// This is an alpha field and requires enabling the DRADeviceTaints
// feature gate.
//
// +optional
// +listType=atomic
// +featureGate=DRADeviceTaints
Taints []DeviceTaint
// +oneOf=ResourceSliceContent
Taints []SliceDeviceTaint
}

// DeviceTaintsMaxLength is the maximum number of taints per ResourceSlice.
const DeviceTaintsMaxLength = 32

// SliceDeviceTaint defines one taint within a ResourceSlice.
type SliceDeviceTaint struct {
// Device is the name of the device in the pool that the ResourceSlice belongs to
// which is affected by the taint. Multiple taints may affect the same device.
Device string

Taint DeviceTaint
}
```

// DeviceTaintsMaxLength is the maximum number of taints per device.
const DeviceTaintsMaxLength = 4
DeviceTaint has all the fields of a v1.Taint, but the description is a bit
different. In particular, PreferNoSchedule is not valid. Other fields got added
to satisfy additional use cases like device health information.

```Go
// The device this taint is attached to has the "effect" on
// any claim which does not tolerate the taint and, through the claim,
// to pods using the claim.
Expand All @@ -300,7 +403,7 @@ type DeviceTaint struct {

// The effect of the taint on claims that do not tolerate the taint
// and through such claims on the pods using them.
// Valid effects are NoSchedule and NoExecute. PreferNoSchedule as used for
// Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for
// nodes is not valid here.
//
// +required
Expand All @@ -322,12 +425,51 @@ type DeviceTaint struct {
// This field was defined as "It is only written for NoExecute taints." for node taints.
// But in practice, Kubernetes never did anything with it (no validation, no defaulting,
// ignored during pod eviction in pkg/controller/tainteviction).

// Description is a human-readable explanation for the taint.
//
// The length must be smaller or equal to 1024.
//
// +optional
Description *string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose another option could be to include this when applicable inside Data instead of having its own field. Or is the idea that Description would be shown by kubectl describe and Data would not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly the idea behind it being a first-class field.


// Data contains arbitrary data specific to the taint key.
//
// The length of the raw data must be smaller or equal to 10 Ki.
//
// +optional
Data *runtime.RawExtension

// EvictionsPerSecond controls how quickly Pods get evicted if that is
// the effect of the taint. If multiple taints cause eviction
// of the same set of Pods, then the lowest rate defined in
// any of those taints applies.
//
// The default is 100 Pods/s.
//
// +optional
EvictionsPerSecond *int64
}

const (
// DefaultEvictionsPerSecond is the default for [DeviceTaint.EvictionsPerSecond]
// if none is specified explicitly.
DefaultEvictionsPerSecond = 100

// TaintDescriptionMaxLength is the maximum size of [DeviceTaint.Description].
TaintDescriptionMaxLength = 1024

// TaintDataMaxLength is the maximum size of [DeviceTaint.Data].
TaintDataMaxLength = 10 * 1024
)

// +enum
type DeviceTaintEffect string

const (
// No effect, the taint is purely informational.
DeviceTaintEffectNone DeviceTaintEffect = "None"

// Do not allow new pods to schedule which use a tainted device unless they tolerate the taint,
// but allow all pods submitted to Kubelet without going through the scheduler
// to start, and allow all already-running pods to continue running.
Expand All @@ -338,9 +480,6 @@ const (
)
```

Taint has the exact same fields as a v1.Taint, but the description is a bit
different. In particular, PreferNoSchedule is not valid.

Tolerations get added to a DeviceRequest:

```Go
Expand Down Expand Up @@ -447,9 +586,15 @@ would be repetitive work. Instead, a
reacts to informer events for ResourceSlice and DeviceTaintRule and
maintains a set of updated ResourceSlices which also contain the taints
set via a DeviceTaintRule. The tracker provides the API of an informer
and thus can be used as a drop-in replacement for a ResourceSlice
and thus can be used as a replacement for a ResourceSlice
informer.

It uses the type from `k8s.io/dynamic-resource-allocation/api` to represent
ResourceSlices. The difference compared to the `k8s.io/resource/v1` API is that
taints are stored in the device struct together with information about where
they came from. The eviction controller uses this to apply the DeviceTaintRule
rate limit and to update the DeviceTaintStatus.

```Go
// DeviceTaintRule adds one taint to all devices which match the selector.
// This has the same effect as if the taint was specified directly
Expand All @@ -465,20 +610,16 @@ type DeviceTaintRule struct {
// Changing the spec automatically increments the metadata.generation number.
Spec DeviceTaintRuleSpec

// ^^^
// A spec gets added because adding a status seems likely.
// Such a status could provide feedback on applying the
// eviction and/or statistics (number of matching devices,
// affected allocated claims, pods remaining to be evicted,
// etc.).
// Status provides information about an on-going pod eviction.
Status DeviceTaintRuleStatus
}

// DeviceTaintRuleSpec specifies the selector and one taint.
type DeviceTaintRuleSpec struct {
// DeviceSelector defines which device(s) the taint is applied to.
// All selector criteria must be satified for a device to
// match. The empty selector matches all devices. Without
// a selector, no devices are matches.
// a selector, no devices are matched.
//
// +optional
DeviceSelector *DeviceTaintSelector
Expand Down Expand Up @@ -535,6 +676,28 @@ type DeviceTaintSelector struct {
// +listType=atomic
Selectors []DeviceSelector
}

// DeviceTaintRuleStatus provides information about an on-going pod eviction.
type DeviceTaintRuleStatus struct {
Copy link
Contributor Author

@pohly pohly Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While implement the type updates I was reminded that bumping the spec increments the Generation. This was meant as preparation for conditions which can record the observed generation to document what they are based on.

But now these new status fields aren't using conditions. Should they? There's no field to store data like "pods pending eviction" in machine-readable form in https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition. The numbers could be part of the message field:

Condition {
   Type: "InProgress",
   Status: ConditionTrue,
   ObservedGeneration: rule.spec.Generation,
   LastTransitionTime: time.Now,
  Reason: "PodsPendingEviction",
  Message: fmt.Sprintf("%d pods are pending eviction, %d evicted already", numPodsPending, numPodsEvicted),
}`

This becomes `Status: ConditionFalse` when there are no pods which need to be evicted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of conditions is that kubectl wait --for=condition=InProgress=false DeviceTaintRule/my-taint-rule can be used, which is nicer than kubectl wait --for=jsonpath='{.status.podsPendingEviction}'=0.

// PodsPendingEviction counts the number of Pods which still need to be evicted.
// Because taints with the NoExecute effect also prevent scheduling new pods,
// this number should eventually reach zero.
//
// The count gets updated periodically, so it is not guaranteed to be 100%
// accurate.
//
// +optional
PodsPendingEviction *int64

// PodsEvicted counts the number of Pods which were evicted because of the taint.
//
// This gets updated periodically, so it is not guaranteed to be 100%
// accurate. The actual count may be higher if the controller evicted
// some Pods and then gets restarted before updating this field.
//
// +optional
PodsEvicted *int64
Copy link
Contributor Author

@pohly pohly Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided about this field. As it stands, this is a lower bound: "at least this number of pods were evicted, but maybe also more". 99% of the times it'll be accurate, but guaranteeing that would imply evicting a pod and updating this field in a single atomic transaction. If the kube-controller-manager dies and restarts, it cannot figure out whether all evicted pods were accounted for.

When using conditions, determining the old count becomes icky (need to parse string on startup, which may not always work because IMHO we shouldn't make the string format part of the official API).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea with the PodsEvicted field that it will monotonically increase forever? Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.

I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.

}
```

### Test Plan
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ stage: alpha
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.33"
latest-milestone: "v1.35"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down