-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5501: Reflect PreEnqueue rejections in Pod status #5510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5501: Reflect PreEnqueue rejections in Pod status #5510
Conversation
macsko
commented
Sep 1, 2025
- One-line PR description: Add a KEP-5501
- Issue link: Reflect PreEnqueue rejections in pod status #5501
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: macsko The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @dom4ha @sanposhiho |
/hold |
e4698c8
to
c6ddb75
Compare
c6ddb75
to
63fca5d
Compare
63fca5d
to
4e3d0ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
- Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic. | ||
|
||
- **Cons:** | ||
- Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it could be inconsistency, but is that any harm really? If there is a need to display a human-readable message to the user, then an empty rejection message can be interpreted as some default message (e.g. saying something about possible transient errors or other potential vague reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this point about no reporting the fact a pod waits or just skipping adding a custom message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this point about no reporting the fact a pod waits or just skipping adding a custom message?
Yes. See the DRA case (kubernetes/kubernetes#129698 (comment)) and the explanation in the How should plugins provide the status message?
section:
This flexibility is important for plugins that wish to avoid reporting transient rejections.
For example, theDynamicResources
plugin might observe a rejection because the scheduler processes a Pod faster
than aResourceClaim
becomes visible through the watcher (see a comment).
In such cases where the condition is expected to resolve in seconds, populating the status would be inappropriate noise.
keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md
Show resolved
Hide resolved
|
||
- **Alternatives Considered:** | ||
- Actively clearing the message: The scheduler could clear the condition if, | ||
on a subsequent check, the original rejecting plugin no longer rejects the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would simply checking the diff between states "before" and "now" be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You said the condition would be cleared if the original rejecting plugin no longer rejects the pod.
I am suggesting that it may be simpler than that: that updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run
But that looks like the current proposal, isn't it? The only case when we wouldn't do this is transition from Plugin A (that reported a message) to Plugin B (that didn't report).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my point. My thinking is that it's better to display some generic default message for failing Plugin B, than to keep displaying a no-longer-correct message saying that Plugin A failed.
- Actively clearing the message: The scheduler could clear the condition if, | ||
on a subsequent check, the original rejecting plugin no longer rejects the pod. | ||
This would provide a more accurate real-time status but could cause confusion | ||
if the condition appears and disappears while the Pod remains `Pending` for another, unreported reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe displaying a default message to the user in case of an empty status (as mentioned in my comment above) would help solve this?
If our goal is to make the messaging meaningful and informative to users, then an explanatory default message could possibly help make things clearer for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if some default message would be that informative. It won't answer clearly why the pod is on hold. I believe, the slate message might be better than sending such placeholder. And, it would reduce the number of API calls sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote above - my personal opinion is that being not-very-informative is better than being misleading.
I might be wrong, but I assume that if a user looks at the state message, they are inclined to investigate the reason behind the pending state, e.g. check if indeed some resource is not ready. And if they see that the resource is ready and the pod still keeps waiting for it, they might consider this a bug in k8s and report it, causing more work triaging bugs and poor user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that in the case of the Unschedulable
reason, when the pod is rejected by filters, we don't remove this message. It is displayed in the Pod status until it's retried. I'm not sure if we can treat a similar mechanism as misleading in the case of PreEnqueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a fair point, but Unschedulable
reason reflects that the pod is in unschedulable queue (or pending a scheduling cycle) and the scheduler does not yet know if the pod will become schedulable.
In case of PreEnqueue the scheduler does know that the specific PreEnqueue plugin is no longer failing, but in the proposed scenario the scheduler doesn't display this knowledge to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, but from the user's POV, the Unschedulable
status may be outdated, especially when the scheduling queue contains plenty of pods. So, I expect users to be aware that the longer the status remains unchanged, the more likely it is to be outdated.
Obviously, the NonReadyForScheduling
reason could have its own semantic. I'm just still not sure what's the best choice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any sort of an "average scenario" where we could measure how often pods actually fail scheduling in preenqueue? Or would that look very different for various users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the scenario, but:
- For SchedulingGates, PreEnqueue fails until all the gates are removed (so could be multiple times per pod)
- For DefaultPreemption, PreEnqueue fails until the API calls for all preemption's victims are made (this shouldn't take really long, so maybe 1 PreEnqueue failure per pod is a good assumption)
- For DRA, PreEnqueue fails until all the ResourceClaims are available. Usually, there is one failure per pod, unless someone forgets to create the ResourceClaim 😃
keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md
Show resolved
Hide resolved
The integration tests listed above should cover all the scenarios, | ||
so implementing e2e tests is a redundant effort. | ||
|
||
### Graduation Criteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually what is the intended behavior when the feature is disabled? Using the new API for PreEnqueue, but not filling the new optional field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PreEnqueue would use a new interface, but the reported message will be ignored, i.e. no API call dispatched.
|
||
During the beta period, the feature gate `SchedulerPreEnqueuePodStatus` is enabled by default, | ||
so users don't need to opt in. This is a purely in-memory feature for the kube-scheduler, | ||
so no special actions are required outside the scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that out-of-tree plugin code needs to be updated to use the new API.
Unless we don't consider them "outside of scheduler"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, in an upgrade strategy, we assume that the scheduler has a new version, so its plugins use the updated PreEnqueue interface. Enablement/Disablement of the feature gate doesn't impact the plugins, as they are already migrated.
keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md
Show resolved
Hide resolved
keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very well written and very valuable feature. I wonder how Workload Aware Scheduling may change the approach if we could write workload status instead for repeating pods. Maybe we could even write status to a hypothetical PodGroups?
Currently, when a `PreEnqueue` plugin rejects a Pod, this decision is not communicated back to the user | ||
via the Pod's status. The Pod simply remains `Pending`, leaving the user to manually inspect scheduler logs | ||
or other components to diagnose the issue. This lack of feedback is particularly problematic for plugins | ||
that operate transparently to the user, as the reason for the delay is completely hidden. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any specific sentence in mind? In general, this paragraph describes the current behavior, where we don't communicate the PreEnqueue rejection to the users.
Only for Scheduling Gates we do that through the kube-apiserver - it knows that the pod would be rejected, because it infer that from a pod spec.
Maybe I should rephrase the sentence about operating transparently. I meant that other plugins, like DRA, reject the Pod based on some more indirect rules, which can't be easily inferred from a Pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right, I highlighted "plugins that operate transparently to the user", but the GitHub does not reflect that.
maybe rephrase it?
keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md
Outdated
Show resolved
Hide resolved
(e.g., by returning an empty message described above). | ||
|
||
- **Pros:** | ||
- Reduces API server load and potential "noise" for plugins that reject pods for very short, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot we mitigate this problem by introducing some sort of delay or rate limiting (for the number of PreEnque updates)? Can document such a possibility with your assessment whether it does or does not make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take that into consideration and mention in the KEP. We have some kind of rate limiting, as we limit the number of dispatched goroutines in the API dispatcher. I'm not sure how the delay could help here.
Some more advanced mechanisms could be a part of the Asynchronous API calls extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delay can help to not report things which have a follow up call clearing the message, so only the prolonged states would be reflected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing a delay shifts responsibility from the plugin to the framework. However, I'm not sure if that's really what we want. Why shouldn't the plugin, which has the most information about why it rejected the Pod, decide what to do with the message?
Even if we have such a delay mechanism, how should we set the delay so that it is correct? If it is too low, we may send irrelevant messages (as in the case of DRA or async preemption) and increase the load on the kube-apiserver. If it is too high, we will delay the status update, as a result reducing the possibility of debugging and the purpose for which we want to have this feature.
In addition, we must remember about our current plugins:
- SchedulingGates — its status is reported by kube-apiserver. We could move it to kube-scheduler, but this would involve unnecessary API calls that can be avoided by keeping the current behavior in kube-apiserver. Also, keep in mind that this plugin has its own status reason (
SchedulingGated
), which would have to be differentiated somehow if we move it to kube-scheduler. - DefaultPreemption — I believe we should rely on the status returned by PostFilter, so it should not be overwritten with the less informative PreEnqueue message.
- DynamicResources — as mentioned, they may want to decide what and when they want to report.
- Allows plugins to opt out of reporting by returning an empty message. | ||
|
||
- **Cons:** | ||
- Requires a breaking change to the `PreEnqueue` plugin interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use some standard message for beta to not introduce a breaking change? Providing info that a pod waits and which plugin blocks it should already improve the current state. Also we could prevent plugin developers from creating non-parsable messages by blocking variable messages. We could reevaluate it before moving to GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't introduce such changes just before moving to GA, but rather do so during the beta phase. Keep in mind that v1.35 will include the final changes related to moving interfaces to staging, so plugin developers will have to change their plugin interfaces anyway.
- Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic. | ||
|
||
- **Cons:** | ||
- Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this point about no reporting the fact a pod waits or just skipping adding a custom message?
- Provides a clear, logical progression for users to follow. | ||
|
||
- **Cons:** | ||
- A stale message could be displayed if a Pod is first rejected by a reporting plugin (Plugin A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the message cannot be updated once it happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is in the Actively clearing the message alternative below:
This would provide a more accurate real-time status but could cause confusion
if the condition appears and disappears while the Pod remainsPending
for another, unreported reason.
Moreover, we need to consider the number of API calls we would like to send. I believe that if we can omit some of them, we should do so to mitigate the performance impact of this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had delay that addresses the problem of noice, we could require setting status message for all plugins (even default one "Blocked by plugin X"). Of course clearing would not have any delay and would cancel setting a message if it's no longer relevant. Also new message would instantly replace the previous one, but would be delayed as well so we increase likelihood it's also canceled (no unnecessary noice).
I barely remember, but we had some argument against delaying api-calls, but maybe it was only for setting nominations, which are time critical. In this case it should be rather safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for default messages "blocked by plugin X" and to clearing messages / updating them with another plugin
keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md
Outdated
Show resolved
Hide resolved
The Cluster Autoscaler can be updated to recognize this condition as a valid trigger for scale-up. | ||
When it sees a Pod with this condition, it could correctly request a new node from a node pool that provides required resource. | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another risk is generating quite significant traffic especially for big workloads. Not sure how to mitigate it, but possible options is to rate limit setting the same type of message. I know it does not sound simple. In Workload Aware Scheduling we could set some kind of messages in the workload object instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that risk to this section.
Remember that PreEnqueue mostly reports issues with the pod and its direct dependencies' states, not the filtering rejection (=no fit). If the users create large quantity of such pods, it's a problem anyway.
The difference might come with WAS or gang scheduling, which would use PreEnqueue to block the pods from being tried. However, as you mentioned, it would be good to have some workload-level message for WAS to mitigate that.
For other plugins, please remember that such traffic would be made anyway, if we reject the pod on (Pre)Filter, not on the PreEnqueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delay in setting PreEnqueue status should help (assuming that the waiting is short). True, unschedulable pods need to get error status anyway.
@@ -377,7 +377,8 @@ To prevent API server flooding when a pod is rejected for the same reason repeat | |||
- Before dispatching a status patch, the scheduler will compare the new rejection message | |||
with the cached `lastPreEnqueueRejectionMessage`. | |||
- The asynchronous call to the API server will only be dispatched if the status is new or different from the cached one. | |||
This deduplication is important for performance. | |||
This deduplication is important for performance. Moreover, when the new message is empty or the whole `PreEnqueueResult` is `nil`, | |||
the call won't be send, as explained in the [how should outdated messages be handled](#4-how-should-outdated-messages-be-handled) section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the call won't be send, as explained in the [how should outdated messages be handled](#4-how-should-outdated-messages-be-handled) section. | |
the call won't be sent, as explained in the [how should outdated messages be handled](#4-how-should-outdated-messages-be-handled) section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad we can finally start this work, thanks to the async API call feature :)
I just skimmed through, and left some initial comments.
- Use the raw status message: This is simpler as it requires no interface changes. | ||
However, these messages are often not written well for end-users. | ||
It also makes it difficult for a plugin to conditionally opt out of reporting a status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually this alternative looks better to me because that would match with messages from other extension points: we're using the messages from the statuses returned from Filter etc already.
It also makes it difficult for a plugin to conditionally opt out of reporting a status.
Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, even if you disagree with me and we do want to implement such opt-out behavior, I'd like to just extend the semantics of framework.Status. e.g., if framework.Status has code == Unschedulable
with reasons == nil
, then we regard plugins want to reject this pod, but not report anything to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're using the messages from the statuses returned from Filter etc already
I agree that it would be more consistent with the actual plugin's implementation, but if we want to have an opt-out behavior, I'd consider having a new field.
Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.
From the In-Tree Plugin Changes
section:
- SchedulingGates: Will always return an empty message, as its status will continue to be set by the kube-apiserver.
- DefaultPreemption: Will always return an empty message, as its state is transient
and theUnschedulable
status from thePostFilter
stage is more descriptive.- DynamicResources: Will be updated to return a descriptive message when a
ResourceClaim
is missing.
It can also return anil
result to avoid reporting transient delays,
for example when waiting for aResourceClaim
to become available in the scheduler's cache.
Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.
if framework.Status has
code == Unschedulable
withreasons == nil
, then we regard plugins want to reject this pod, but not report anything to users.
I am not in favor of this approach. If we do this, it will affect logging (we would like to register why plugin rejected the pod, but might not want to display that to user immediately) and make it harder to manage by plugin developers - the interface would be indirect, so they would have to rely on and remember that the empty reason has its own behavior. Introducing a direct field in PreEnqueueResult would make it more direct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have impression that all reasons to not report is to save on an api-server traffic except of "SchedulingGates" which I may not understand well.
Delaying message can reduce a traffic generated by transient issues. I'm only worried that delayed message may overwrite some other message that was set in the meantime by someone else. But all message changes initiated by the kube-scheduler should cancel any previously pending (waiting) one and replace it with a new one or clear it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you updated KEP with good arguments against the delay. Some alternatives are to
- make the delay optional (but not sure if it's worth implementing it)
- update workload object status instead of blocked pods (once the generic workload object is defined, it's in the design phase now). Lack of a delay still makes generating noice for tancient issues, although on just one object only.
- detect that objects are waiting on PreEnqueue for longer period of time. It's like double checking whether the condition still holds before applying a status. Still sounds complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the In-Tree Plugin Changes section:
...
Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.
Hmm, even for the default preemption / dynamic resources, I feel like it's better to report something to indicate the pods are not retried on purpose since today that is not visible at all. As Dominik mentioned, the delay by the async API call mechanism can mitigate the concern of too-many-apiserver calls. And, if we implement this idea that you mentioned in another thread of mine, I don't think there would be too many API calls triggered by those PreEnqueue (assuming the plugins make the same simple message as possible at every rejection)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will add a larger section about the delay, and we will be able to make a decision on that.
(e.g., by returning an empty message described above). | ||
|
||
- **Pros:** | ||
- Reduces API server load and potential "noise" for plugins that reject pods for very short, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To mitigate it, can we just skip an error reporting if that's identical with the existing one?
e.g., if the pod keeps being rejected by the preemption PreEnqueue plugin, it will get a message like This pod is not schedulable because waiting for the preemption to be completed
for the first time, but the preemption PreEnqueue plugin will (i.e. should) return the same common message and then the scheduling framework doesn't update the pod with the message every time PreEnqueue rejects the same pod with the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have that already explained in the Preventing redundant API calls
section of Design details:
To prevent API server flooding when a pod is rejected for the same reason repeatedly, message caching will be introduced.
- A new field will be added to the
PodInfo
struct within the queue, for example,lastPreEnqueueRejectionMessage
.- Before dispatching a status patch, the scheduler will compare the new rejection message
with the cachedlastPreEnqueueRejectionMessage
.- The asynchronous call to the API server will only be dispatched if the status is new or different from the cached one. (...)
- Introduces a new value to the API, which must be documented and maintained. | ||
|
||
- **Alternatives Considered:** | ||
- PodReasonUnschedulable (`Unschedulable`): Reusing this reason would be somewhat incorrect, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to add a new one, prefer to use the existing condition. I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.
Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition?
as the Pod has not failed a scheduling attempt (i.e., it was never evaluated against nodes).
Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition
Or break them, if they rely on the fact the pod was processed and rejected by the filters.
I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA, because such rejections are not related to the nodes (capacity etc.), but to the pod itself. That can't be fixed by the CA's node provisioning. So, using the Unschedulable
status would force the CA to process such impossible to fix pods or try to filter them out somehow.
Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.
And such behavior could be eventually migrated to the PreEnqueue.
I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.
The PreEnqueue rejection is different from the Unschedulable
rejection as it shows that the pod is missing something, so I think having the new reason is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA
This is a good point, and I'm convinced.
I've updated the KEP with the alternatives mentioned in the comments. |
@@ -194,6 +199,14 @@ In such cases where the condition is expected to resolve in seconds, populating | |||
- Mandatory reporting for all rejections: This would make the scheduler's behavior uniform. | |||
However, it might create significant API churn for plugins that reject pods frequently and for short durations, | |||
negatively impacting performance and UX. | |||
- Mandatory reporting with a delay (cooldown period): This approach attempts to reduce API churn by waiting for a brief, | |||
configurable delay before reporting a rejection. While better than immediate mandatory reporting, this has some flaws: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who should configure the delay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. We could have a few options:
- Hardcoded delay per API call, e.g., delay only the PreEnqueue status patch call by a certain time
- Configurable delay through scheduler's config - might be hard to implement clearly
- Per-Plugin delay, passed via PreEnqueueResult - would move the responsibility to the plugin developers, which might not know well what is the correct delay anyway
the scheduler could generate a generic message (e.g., "Pod is blocked on PreEnqueue by plugin: Plugin B"). | ||
This has a two flaws: | ||
- While it identifies the blocking plugin, it doesn't explain why the Pod was blocked, which is the essential information for debugging. | ||
A generic message isn't a significant improvement over a stale (but potentially actionable) message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion stale != actionable 🙂 It was actionable, but now is no longer because it's stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale is still actionable (someone can try to fix the issue or just wait for it to resolve), compared to the useless (slightly informative) message that is not actionable at all.
I added a section to It might be still lacking some points, so please tell me if I missed anything. FYI I'll respond to the new comments next Friday. |