-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP 5055: DRA Device Taints: update for 1.35 #5512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,8 +70,10 @@ SIG Architecture for cross-cutting KEPs). | |
- [User Stories](#user-stories) | ||
- [Degraded Devices](#degraded-devices) | ||
- [External Health Monitoring](#external-health-monitoring) | ||
- [Safe Pod Eviction](#safe-pod-eviction) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [kubectl describe DeviceTaintRule](#kubectl-describe-devicetaintrule) | ||
- [API](#api) | ||
- [Test Plan](#test-plan) | ||
- [Prerequisite testing updates](#prerequisite-testing-updates) | ||
|
@@ -167,6 +169,10 @@ ResourceClaim. | |
- Enable users to decide whether they want to keep running a workload in a degraded | ||
mode while a device is unhealthy or prefer to get pods rescheduled. | ||
|
||
- Publish information about devices ("device health") such that control plane | ||
components or admins can decide how to react, without immediately affecting | ||
scheduling or workloads. | ||
|
||
### Non-Goals | ||
|
||
- Not part of the plan for alpha: developing a kubectl command for managing device taints. | ||
|
@@ -181,9 +187,17 @@ ResourceClaim. | |
A driver itself can detect problems which may or may not be tolerable for | ||
workloads, like degraded performance due to overheating. Removing such devices | ||
from the ResourceSlice would unconditionally prevent using them for new | ||
pods. Instead, publishing with a taint informs users about this degradation and | ||
leaves them the choice whether the device is still usable enough to run pods. | ||
It also automates stopping pods which don't tolerate such a degradation. | ||
pods. Instead, taints can be added to ResourceSlices to provide sufficient information | ||
for decision making on the indicated problems. | ||
|
||
A control plane component or the admins react to that information. They may | ||
publish a DeviceTaintRule which prevents using the degraded device for new pods | ||
or even evict all pods using it at the moment to replace or reset the device | ||
once it is idle. | ||
|
||
Users can decide to tolerate less critical taints in their workload, at their | ||
own risk. Admins scheduling maintenance pods need to tolerate their own taints | ||
to get the pod scheduled. | ||
|
||
#### External Health Monitoring | ||
|
||
|
@@ -193,6 +207,22 @@ not supported by that DRA driver. When that component detects problems, it can | |
check its policy configuration and decide to take devices offline by creating | ||
a DeviceTaintRule with a taint for affected devices. | ||
|
||
#### Safe Pod Eviction | ||
|
||
Selecting the wrong set of devices in a DeviceTaintRule can have potentially | ||
disastrous consequences, including quickly evicting all workloads using any | ||
device in the cluster instead of those using a single device. To avoid this, a | ||
cluster admin can first create a DeviceTaintRule such that it has no immediate | ||
effect. `kubectl describe` then includes information about the matched devices | ||
and how many pods would be evicted if the effect was to evict. Then the admin | ||
can edit the DeviceTaintRule to set the desired effect. | ||
|
||
Once eviction starts, it happens at a low enough rate that admins have a chance | ||
to delete the DeviceTaintRule before all pods are evicted if they made a | ||
mistake after all. This rate is configurable to enable faster eviction, if | ||
admins are sure that this is what they want. The DeviceTaintRule status | ||
provides information about the progress. | ||
|
||
### Risks and Mitigations | ||
|
||
A device can be identified by its names (`<driver name>/<pool name>/<device | ||
|
@@ -229,6 +259,24 @@ the purpose of this KEP: | |
- The ResourceClaim controller will remove such a completed pod from the claim's | ||
`ReservedFor` and deallocate the claim once it has no consumers. | ||
|
||
The semantic of the value associated with a taint key is defined by whoever | ||
publishes taints with that key. DRA drivers should use the driver name as | ||
Comment on lines
+262
to
+263
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth drawing a parallel here to standardized device attributes? At least to call out that taint keys could also become standardized across vendors? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's worth describing. |
||
domain in the key to avoid conflicts. To support tolerating taints by value, | ||
values should not require parsing to extract information. It's better to use | ||
different keys with simple values than one key with a complex value. | ||
|
||
The taint value may or may not be sufficient to represent the state of a | ||
device. Therefore the API also allows publishing structured information in JSON | ||
format as a raw extension field, similar to the ResourceClaim device status. | ||
For humans, a description field can be used to provide a summary or additional | ||
explanations why the taint was added. | ||
|
||
If some common patterns emerge, then Kubernetes could standardize the name, | ||
value and data for certain taints. Those then have a certain semantic. This | ||
is similar to standardizing device attributes. | ||
|
||
`Effect: None` can be used to publish taints which are merely informational. | ||
|
||
Taints are cumulative: | ||
- Taints defined by an admin in a DeviceTaintRule get added to the | ||
set of taints defined by the DRA driver in a ResourceSlice. | ||
|
@@ -258,30 +306,85 @@ Device and node taints are applied independently. A node taint applies to all | |
pods on a node, whereas a device taint affects claim allocation and only those | ||
pods using the claim. | ||
|
||
### kubectl describe DeviceTaintRule | ||
|
||
An extension of `kubectl describe` for DeviceTaintRule uses the same code as | ||
the eviction controller to calculate which devices are matched and which pods | ||
would need to get evicted. The input for that code is the single | ||
DeviceTaintRule which gets described and the complete set of ResourceSlices, | ||
ResourceClaims and Pods. | ||
|
||
Calculating the outcome on the client side was chosen because the alternative, | ||
always updating the status of DeviceTaintRules on the server side, would cause | ||
permanent churn regardless whether someone wants the information or not. | ||
|
||
The exact output will be decided during the implementation phase. | ||
It can change at any time because the `kubectl describe` output does not fall | ||
under any Kubernetes API stability guarantees. | ||
|
||
### API | ||
|
||
The ResourceSlice content gets extended: | ||
The ResourceSlice content gets extended. A natural place for a taint that | ||
affects one specific device would be inside the `Device` struct. The problem | ||
with that approach is that with 128 allowed devices per ResourceSlice the | ||
maximum size of each taint would have to be very small to keep the worst-case | ||
size of a ResourceSlice within the limits imposed by etcd and the API request | ||
machinery. | ||
|
||
Therefore a different approach is used where each ResourceSlice either provides | ||
information about devices or about taints, but never both. Using the | ||
ResourceSlice for taints instead of a different type has the advantage that the | ||
existing mechanisms and code for publishing and consuming the information can | ||
be reused. The same approach is likely to be used for other additional | ||
ResourceSlice pool information, like mixin definitions. The downside is that at | ||
least two ResourceSlices are necessary once taints get published by a DRA | ||
driver. | ||
|
||
```Go | ||
type Device struct { | ||
type ResourceSliceSpec struct { | ||
... | ||
|
||
// If specified, these are the driver-defined taints. | ||
// Devices lists some or all of the devices in this pool. | ||
// | ||
// The maximum number of taints is 4. | ||
// Must not have more than 128 entries. Either Devices or Taints may be set, but not both. | ||
// | ||
// +optional | ||
// +listType=atomic | ||
// +oneOf=ResourceSliceContent | ||
Devices []Device | ||
|
||
// If specified, these are driver-defined taints. | ||
// | ||
// The maximum number of taints is 32. Either Devices or Taints may be set, but not both. | ||
// | ||
// This is an alpha field and requires enabling the DRADeviceTaints | ||
// feature gate. | ||
// | ||
// +optional | ||
// +listType=atomic | ||
// +featureGate=DRADeviceTaints | ||
Taints []DeviceTaint | ||
// +oneOf=ResourceSliceContent | ||
Taints []SliceDeviceTaint | ||
} | ||
|
||
// DeviceTaintsMaxLength is the maximum number of taints per ResourceSlice. | ||
const DeviceTaintsMaxLength = 32 | ||
|
||
// SliceDeviceTaint defines one taint within a ResourceSlice. | ||
type SliceDeviceTaint struct { | ||
// Device is the name of the device in the pool that the ResourceSlice belongs to | ||
// which is affected by the taint. Multiple taints may affect the same device. | ||
Device string | ||
|
||
Taint DeviceTaint | ||
} | ||
``` | ||
|
||
// DeviceTaintsMaxLength is the maximum number of taints per device. | ||
const DeviceTaintsMaxLength = 4 | ||
DeviceTaint has all the fields of a v1.Taint, but the description is a bit | ||
different. In particular, PreferNoSchedule is not valid. Other fields got added | ||
to satisfy additional use cases like device health information. | ||
|
||
```Go | ||
// The device this taint is attached to has the "effect" on | ||
// any claim which does not tolerate the taint and, through the claim, | ||
// to pods using the claim. | ||
|
@@ -300,7 +403,7 @@ type DeviceTaint struct { | |
|
||
// The effect of the taint on claims that do not tolerate the taint | ||
// and through such claims on the pods using them. | ||
// Valid effects are NoSchedule and NoExecute. PreferNoSchedule as used for | ||
// Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for | ||
// nodes is not valid here. | ||
// | ||
// +required | ||
|
@@ -322,12 +425,51 @@ type DeviceTaint struct { | |
// This field was defined as "It is only written for NoExecute taints." for node taints. | ||
// But in practice, Kubernetes never did anything with it (no validation, no defaulting, | ||
// ignored during pod eviction in pkg/controller/tainteviction). | ||
|
||
// Description is a human-readable explanation for the taint. | ||
// | ||
// The length must be smaller or equal to 1024. | ||
// | ||
// +optional | ||
Description *string | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose another option could be to include this when applicable inside There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's exactly the idea behind it being a first-class field. |
||
|
||
// Data contains arbitrary data specific to the taint key. | ||
// | ||
// The length of the raw data must be smaller or equal to 10 Ki. | ||
// | ||
// +optional | ||
Data *runtime.RawExtension | ||
|
||
// EvictionsPerSecond controls how quickly Pods get evicted if that is | ||
// the effect of the taint. If multiple taints cause eviction | ||
// of the same set of Pods, then the lowest rate defined in | ||
// any of those taints applies. | ||
// | ||
// The default is 100 Pods/s. | ||
// | ||
// +optional | ||
EvictionsPerSecond *int64 | ||
} | ||
|
||
const ( | ||
// DefaultEvictionsPerSecond is the default for [DeviceTaint.EvictionsPerSecond] | ||
// if none is specified explicitly. | ||
DefaultEvictionsPerSecond = 100 | ||
|
||
// TaintDescriptionMaxLength is the maximum size of [DeviceTaint.Description]. | ||
TaintDescriptionMaxLength = 1024 | ||
|
||
// TaintDataMaxLength is the maximum size of [DeviceTaint.Data]. | ||
TaintDataMaxLength = 10 * 1024 | ||
) | ||
|
||
// +enum | ||
type DeviceTaintEffect string | ||
|
||
const ( | ||
// No effect, the taint is purely informational. | ||
DeviceTaintEffectNone DeviceTaintEffect = "None" | ||
|
||
// Do not allow new pods to schedule which use a tainted device unless they tolerate the taint, | ||
// but allow all pods submitted to Kubelet without going through the scheduler | ||
// to start, and allow all already-running pods to continue running. | ||
|
@@ -338,9 +480,6 @@ const ( | |
) | ||
``` | ||
|
||
Taint has the exact same fields as a v1.Taint, but the description is a bit | ||
different. In particular, PreferNoSchedule is not valid. | ||
|
||
Tolerations get added to a DeviceRequest: | ||
|
||
```Go | ||
|
@@ -447,9 +586,15 @@ would be repetitive work. Instead, a | |
reacts to informer events for ResourceSlice and DeviceTaintRule and | ||
maintains a set of updated ResourceSlices which also contain the taints | ||
set via a DeviceTaintRule. The tracker provides the API of an informer | ||
and thus can be used as a drop-in replacement for a ResourceSlice | ||
and thus can be used as a replacement for a ResourceSlice | ||
informer. | ||
|
||
It uses the type from `k8s.io/dynamic-resource-allocation/api` to represent | ||
ResourceSlices. The difference compared to the `k8s.io/resource/v1` API is that | ||
taints are stored in the device struct together with information about where | ||
they came from. The eviction controller uses this to apply the DeviceTaintRule | ||
rate limit and to update the DeviceTaintStatus. | ||
|
||
```Go | ||
// DeviceTaintRule adds one taint to all devices which match the selector. | ||
// This has the same effect as if the taint was specified directly | ||
|
@@ -465,20 +610,16 @@ type DeviceTaintRule struct { | |
// Changing the spec automatically increments the metadata.generation number. | ||
Spec DeviceTaintRuleSpec | ||
|
||
// ^^^ | ||
// A spec gets added because adding a status seems likely. | ||
// Such a status could provide feedback on applying the | ||
// eviction and/or statistics (number of matching devices, | ||
// affected allocated claims, pods remaining to be evicted, | ||
// etc.). | ||
// Status provides information about an on-going pod eviction. | ||
Status DeviceTaintRuleStatus | ||
} | ||
|
||
// DeviceTaintRuleSpec specifies the selector and one taint. | ||
type DeviceTaintRuleSpec struct { | ||
// DeviceSelector defines which device(s) the taint is applied to. | ||
// All selector criteria must be satified for a device to | ||
// match. The empty selector matches all devices. Without | ||
// a selector, no devices are matches. | ||
// a selector, no devices are matched. | ||
// | ||
// +optional | ||
DeviceSelector *DeviceTaintSelector | ||
|
@@ -535,6 +676,28 @@ type DeviceTaintSelector struct { | |
// +listType=atomic | ||
Selectors []DeviceSelector | ||
} | ||
|
||
// DeviceTaintRuleStatus provides information about an on-going pod eviction. | ||
type DeviceTaintRuleStatus struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While implement the type updates I was reminded that bumping the spec increments the But now these new status fields aren't using conditions. Should they? There's no field to store data like "pods pending eviction" in machine-readable form in https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition. The numbers could be part of the message field:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The advantage of conditions is that |
||
// PodsPendingEviction counts the number of Pods which still need to be evicted. | ||
// Because taints with the NoExecute effect also prevent scheduling new pods, | ||
// this number should eventually reach zero. | ||
// | ||
// The count gets updated periodically, so it is not guaranteed to be 100% | ||
// accurate. | ||
// | ||
// +optional | ||
PodsPendingEviction *int64 | ||
|
||
// PodsEvicted counts the number of Pods which were evicted because of the taint. | ||
// | ||
// This gets updated periodically, so it is not guaranteed to be 100% | ||
// accurate. The actual count may be higher if the controller evicted | ||
// some Pods and then gets restarted before updating this field. | ||
// | ||
// +optional | ||
PodsEvicted *int64 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm undecided about this field. As it stands, this is a lower bound: "at least this number of pods were evicted, but maybe also more". 99% of the times it'll be accurate, but guaranteeing that would imply evicting a pod and updating this field in a single atomic transaction. If the kube-controller-manager dies and restarts, it cannot figure out whether all evicted pods were accounted for. When using conditions, determining the old count becomes icky (need to parse string on startup, which may not always work because IMHO we shouldn't make the string format part of the official API). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the idea with the I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense. |
||
} | ||
``` | ||
|
||
### Test Plan | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
effect: None
+kubectl describe
by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea ofeffect: None
essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the rate limiting would uncover anything that
kubectl describe
may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).But primarily it is that
kubectl describe
is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.