Skip to content

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 4, 2025

  • One-line PR description: update for 1.35

  • Issue link: DRA: device taints and tolerations #5055

  • Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, kubectl describe DeviceTaintRule).

The "informational taint" is a potential alternative to #5469.

The one-of ResourceSlice is relevant for #5234.

/cc @nojnhuh @byako @eero-t @mortent

@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: eero-t.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

  • One-line PR description: update for 1.35

  • Issue link: DRA: device taints and tolerations #5055

  • Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, kubectl describe DeviceTaintRule).

The "informational taint" is a potential alternative to #5469.

The one-of ResourceSlice is relevant for #5234.

/cc @nojnhuh @byako @eero-t @mortent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 4, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Sep 4, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2025
This now covers publishing purely informational taints ("Effect: None"),
rate limited eviction and improved user experience (DeviceTaintRule status,
`kubectl describe DeviceTaintRule`).
@pohly pohly force-pushed the dra-device-taints-1.35 branch from 73ef40f to 100e650 Compare September 4, 2025 09:29
Comment on lines +190 to +191
pods. Instead, publishing ResourceSlices with a taint informs about the problem.
It includes sufficient information about the problem to enable decision making.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise updates look good to me, but I think this description would read better e.g. as:

Suggested change
pods. Instead, publishing ResourceSlices with a taint informs about the problem.
It includes sufficient information about the problem to enable decision making.
pods. Instead, taints can be added to ResourceSlices to provide sufficient information
for decision making on the indicated problems.

Comment on lines +220 to +222
Once eviction starts, it happens at a low enough rate that admins have a chance
to delete the DeviceTaintRule before all pods are evicted if they made a
mistake after all. This rate is configurable to enable faster eviction, if
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is effect: None + kubectl describe by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea of effect: None essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the rate limiting would uncover anything that kubectl describe may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).

But primarily it is that kubectl describe is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.

Comment on lines +262 to +263
The semantic of the value associated with a taint key is defined by whoever
publishes taints with that key. DRA drivers should use the driver name as
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth drawing a parallel here to standardized device attributes? At least to call out that taint keys could also become standardized across vendors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's worth describing.

ResourceClaims and Pods.

Calculating the outcome on the client side was chosen because the alternative,
always updating the status of DeviceTailRules on the server side, would cause
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
always updating the status of DeviceTailRules on the server side, would cause
always updating the status of DeviceTaintRules on the server side, would cause

Comment on lines +609 to +610
// EvictionRate controls how quickly Pods get evicted if that is
// the effect of the taint. If multiple taints cause eviction
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this first sentence mention the unit here being pods/s? Or rename the field something like EvictionsPerSecond?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like EvictionsPerSecond.

// The length must be smaller or equal to 1024.
//
// +optional
Description *string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose another option could be to include this when applicable inside Data instead of having its own field. Or is the idea that Description would be shown by kubectl describe and Data would not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly the idea behind it being a first-class field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

4 participants