-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP 5055: DRA Device Taints: update for 1.35 #5512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@pohly: GitHub didn't allow me to request PR reviews from the following users: eero-t. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, `kubectl describe DeviceTaintRule`).
73ef40f
to
100e650
Compare
pods. Instead, publishing ResourceSlices with a taint informs about the problem. | ||
It includes sufficient information about the problem to enable decision making. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise updates look good to me, but I think this description would read better e.g. as:
pods. Instead, publishing ResourceSlices with a taint informs about the problem. | |
It includes sufficient information about the problem to enable decision making. | |
pods. Instead, taints can be added to ResourceSlices to provide sufficient information | |
for decision making on the indicated problems. |
Once eviction starts, it happens at a low enough rate that admins have a chance | ||
to delete the DeviceTaintRule before all pods are evicted if they made a | ||
mistake after all. This rate is configurable to enable faster eviction, if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is effect: None
+ kubectl describe
by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea of effect: None
essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the rate limiting would uncover anything that kubectl describe
may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).
But primarily it is that kubectl describe
is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.
The semantic of the value associated with a taint key is defined by whoever | ||
publishes taints with that key. DRA drivers should use the driver name as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth drawing a parallel here to standardized device attributes? At least to call out that taint keys could also become standardized across vendors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's worth describing.
ResourceClaims and Pods. | ||
|
||
Calculating the outcome on the client side was chosen because the alternative, | ||
always updating the status of DeviceTailRules on the server side, would cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always updating the status of DeviceTailRules on the server side, would cause | |
always updating the status of DeviceTaintRules on the server side, would cause |
// EvictionRate controls how quickly Pods get evicted if that is | ||
// the effect of the taint. If multiple taints cause eviction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this first sentence mention the unit here being pods/s? Or rename the field something like EvictionsPerSecond
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like EvictionsPerSecond
.
// The length must be smaller or equal to 1024. | ||
// | ||
// +optional | ||
Description *string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose another option could be to include this when applicable inside Data
instead of having its own field. Or is the idea that Description
would be shown by kubectl describe
and Data
would not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly the idea behind it being a first-class field.
One-line PR description: update for 1.35
Issue link: DRA: device taints and tolerations #5055
Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status,
kubectl describe DeviceTaintRule
).The "informational taint" is a potential alternative to #5469.
The one-of ResourceSlice is relevant for #5234.
/cc @nojnhuh @byako @eero-t @mortent