@@ -167,6 +167,10 @@ ResourceClaim.
167
167
- Enable users to decide whether they want to keep running a workload in a degraded
168
168
mode while a device is unhealthy or prefer to get pods rescheduled.
169
169
170
+ - Publish information about devices ("device health") such that control plane
171
+ components or admins can decide how to react, without immediately affecting
172
+ scheduling or workloads.
173
+
170
174
### Non-Goals
171
175
172
176
- Not part of the plan for alpha: developing a kubectl command for managing device taints.
@@ -181,9 +185,17 @@ ResourceClaim.
181
185
A driver itself can detect problems which may or may not be tolerable for
182
186
workloads, like degraded performance due to overheating. Removing such devices
183
187
from the ResourceSlice would unconditionally prevent using them for new
184
- pods. Instead, publishing with a taint informs users about this degradation and
185
- leaves them the choice whether the device is still usable enough to run pods.
186
- It also automates stopping pods which don't tolerate such a degradation.
188
+ pods. Instead, publishing ResourceSlices with a taint informs about the problem.
189
+ It includes sufficient information about the problem to enable decision making.
190
+
191
+ A control plane component or the admins react to that information. They may
192
+ publish a DeviceTaintRule which prevents using the degraded device for new pods
193
+ or even evict all pods using it at the moment to replace or reset the device
194
+ once it is idle.
195
+
196
+ Users can decide to tolerate less critical taints in their workload, at their
197
+ own risk. Admins scheduling maintenance pods need to tolerate their own taints
198
+ to get the pod scheduled.
187
199
188
200
#### External Health Monitoring
189
201
@@ -193,6 +205,22 @@ not supported by that DRA driver. When that component detects problems, it can
193
205
check its policy configuration and decide to take devices offline by creating
194
206
a DeviceTaintRule with a taint for affected devices.
195
207
208
+ #### Safe Pod Eviction
209
+
210
+ Selecting the wrong set of devices in a DeviceTaintRule can have potentially
211
+ disastrous consequences, including quickly evicting all workloads using any
212
+ device in the cluster instead of those using a single device. To avoid this, a
213
+ cluster admin can first create a DeviceTaintRule such that it has no immediate
214
+ effect. ` kubectl describe ` then includes information about the matched devices
215
+ and how many pods would be evicted if the effect was to evict. Then the admin
216
+ can edit the DeviceTaintRule to set the desired effect.
217
+
218
+ Once eviction starts, it happens at a low enough rate that admins have a chance
219
+ to delete the DeviceTaintRule before all pods are evicted if they made a
220
+ mistake after all. This rate is configurable to enable faster eviction, if
221
+ admins are sure that this is what they want. The DeviceTaintRule status
222
+ provides information about the progress.
223
+
196
224
### Risks and Mitigations
197
225
198
226
A device can be identified by its names (`<driver name >/<pool name >/<device
@@ -229,6 +257,20 @@ the purpose of this KEP:
229
257
- The ResourceClaim controller will remove such a completed pod from the claim's
230
258
` ReservedFor ` and deallocate the claim once it has no consumers.
231
259
260
+ The semantic of the value associated with a taint key is defined by whoever
261
+ publishes taints with that key. DRA drivers should use the driver name as
262
+ domain in the key to avoid conflicts. To support tolerating taints by value,
263
+ values should not require parsing to extract information. It's better to use
264
+ different keys with simple values than one key with a complex value.
265
+
266
+ The taint value may or may not be sufficient to represent the state of a
267
+ device. Therefore the API also allows publishing structured information in JSON
268
+ format as a raw extension field, similar to the ResourceClaim device status.
269
+ For humans, a description field can be used to provide a summary or additional
270
+ explanations why the taint was added.
271
+
272
+ ` Effect: None ` can be used to publish taints which are merely informational.
273
+
232
274
Taints are cumulative:
233
275
- Taints defined by an admin in a DeviceTaintRule get added to the
234
276
set of taints defined by the DRA driver in a ResourceSlice.
@@ -258,30 +300,85 @@ Device and node taints are applied independently. A node taint applies to all
258
300
pods on a node, whereas a device taint affects claim allocation and only those
259
301
pods using the claim.
260
302
303
+ ### kubectl describe DeviceTaintRule
304
+
305
+ An extension of ` kubectl describe ` for DeviceTaintRule uses the same code as
306
+ the eviction controller to calculate which devices are matched and which pods
307
+ would need to get evicted. The input for that code is the single
308
+ DeviceTaintRule which gets described and the complete set of ResourceSlices,
309
+ ResourceClaims and Pods.
310
+
311
+ Calculating the outcome on the client side was chosen because the alternative,
312
+ always updating the status of DeviceTailRules on the server side, would cause
313
+ permanent churn regardless whether someone wants the information or not.
314
+
315
+ The exact output will be decided during the implementation phase.
316
+ It can change at any time because the ` kubectl describe ` output does not fall
317
+ under any Kubernetes API stability guarantees.
318
+
261
319
### API
262
320
263
- The ResourceSlice content gets extended:
321
+ The ResourceSlice content gets extended. A natural place for a taint that
322
+ affects one specific device would be inside the ` Device ` struct. The problem
323
+ with that approach is that with 128 allowed devices per ResourceSlice the
324
+ maximum size of each taint would have to be very small to keep the worst-case
325
+ size of a ResourceSlice within the limits imposed by etcd and the API request
326
+ machinery.
327
+
328
+ Therefore a different approach is used where each ResourceSlice either provides
329
+ information about devices or about taints, but never both. Using the
330
+ ResourceSlice for taints instead of a different type has the advantage that the
331
+ existing mechanisms and code for publishing and consuming the information can
332
+ be reused. The same approach is likely to be used for other additional
333
+ ResourceSlice pool information, like mixin definitions. The downside is that at
334
+ least two ResourceSlices are necessary once taints get published by a DRA
335
+ driver.
264
336
265
337
``` Go
266
- type Device struct {
338
+ type ResourceSliceSpec struct {
267
339
...
268
340
269
- // If specified, these are the driver-defined taints.
341
+ // Devices lists some or all of the devices in this pool.
342
+ //
343
+ // Must not have more than 128 entries. Either Devices or Taints may be set, but not both.
344
+ //
345
+ // +optional
346
+ // +listType=atomic
347
+ // +oneOf=ResourceSliceContent
348
+ Devices []Device
349
+
350
+ // If specified, these are driver-defined taints.
270
351
//
271
- // The maximum number of taints is 4 .
352
+ // The maximum number of taints is 32. Either Devices or Taints may be set, but not both .
272
353
//
273
354
// This is an alpha field and requires enabling the DRADeviceTaints
274
355
// feature gate.
275
356
//
276
357
// +optional
277
358
// +listType=atomic
278
359
// +featureGate=DRADeviceTaints
279
- Taints []DeviceTaint
360
+ // +oneOf=ResourceSliceContent
361
+ Taints []SliceDeviceTaint
362
+ }
363
+
364
+ // DeviceTaintsMaxLength is the maximum number of taints per ResourceSlice.
365
+ const DeviceTaintsMaxLength = 32
366
+
367
+ // SliceDeviceTaint defines one taint within a ResourceSlice.
368
+ type SliceDeviceTaint struct {
369
+ // Device is the name of the device in the pool that the ResourceSlice belongs to
370
+ // which is affected by the taint. Multiple taints may affect the same device.
371
+ Device string
372
+
373
+ DeviceTaint
280
374
}
375
+ ```
281
376
282
- // DeviceTaintsMaxLength is the maximum number of taints per device.
283
- const DeviceTaintsMaxLength = 4
377
+ DeviceTaint has all the fields of a v1.Taint, but the description is a bit
378
+ different. In particular, PreferNoSchedule is not valid. Other fields got added
379
+ to satisfy additional use cases like device health information.
284
380
381
+ ``` Go
285
382
// The device this taint is attached to has the "effect" on
286
383
// any claim which does not tolerate the taint and, through the claim,
287
384
// to pods using the claim.
@@ -300,7 +397,7 @@ type DeviceTaint struct {
300
397
301
398
// The effect of the taint on claims that do not tolerate the taint
302
399
// and through such claims on the pods using them.
303
- // Valid effects are NoSchedule and NoExecute. PreferNoSchedule as used for
400
+ // Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for
304
401
// nodes is not valid here.
305
402
//
306
403
// +required
@@ -311,6 +408,20 @@ type DeviceTaint struct {
311
408
// Implementing PreferNoSchedule would depend on a scoring solution for DRA.
312
409
// It might get added as part of that.
313
410
411
+ // Description is a human-readable explanation for the taint.
412
+ //
413
+ // The length must be smaller or equal to 1024.
414
+ //
415
+ // +optional
416
+ Description *string
417
+
418
+ // Data contains arbitrary data specific to the taint key.
419
+ //
420
+ // The length of the raw data must be smaller or equal to 10 Ki.
421
+ //
422
+ // +optional
423
+ Data *runtime.RawExtension
424
+
314
425
// TimeAdded represents the time at which the taint was added.
315
426
// Added automatically during create or update if not set.
316
427
//
@@ -324,10 +435,21 @@ type DeviceTaint struct {
324
435
// ignored during pod eviction in pkg/controller/tainteviction).
325
436
}
326
437
438
+ const (
439
+ // TaintDescriptionMaxLength is the maximum size of [DeviceTaint.Description].
440
+ TaintDescriptionMaxLength 1024
441
+
442
+ // TaintDataMaxLength is the maximum size of [DeviceTaint.Data].
443
+ TaintDataMaxLength 10 * 1024
444
+ )
445
+
327
446
// +enum
328
447
type DeviceTaintEffect string
329
448
330
449
const (
450
+ // No effect, the taint is purely informational.
451
+ DeviceTaintEffectNone DeviceTaintEffect = " None"
452
+
331
453
// Do not allow new pods to schedule which use a tainted device unless they tolerate the taint,
332
454
// but allow all pods submitted to Kubelet without going through the scheduler
333
455
// to start, and allow all already-running pods to continue running.
@@ -338,9 +460,6 @@ const (
338
460
)
339
461
```
340
462
341
- Taint has the exact same fields as a v1.Taint, but the description is a bit
342
- different. In particular, PreferNoSchedule is not valid.
343
-
344
463
Tolerations get added to a DeviceRequest:
345
464
346
465
``` Go
@@ -447,9 +566,15 @@ would be repetitive work. Instead, a
447
566
reacts to informer events for ResourceSlice and DeviceTaintRule and
448
567
maintains a set of updated ResourceSlices which also contain the taints
449
568
set via a DeviceTaintRule. The tracker provides the API of an informer
450
- and thus can be used as a drop-in replacement for a ResourceSlice
569
+ and thus can be used as a replacement for a ResourceSlice
451
570
informer.
452
571
572
+ It uses the type from ` k8s.io/dynamic-resource-allocation/api ` to represent
573
+ ResourceSlices. The difference compared to the ` k8s.io/resource/v1 ` API is that
574
+ taints are stored in the device struct together with information about where
575
+ they came from. The eviction controller uses this to apply the DeviceTaintRule
576
+ rate limit and to update the DeviceTaintStatus.
577
+
453
578
``` Go
454
579
// DeviceTaintRule adds one taint to all devices which match the selector.
455
580
// This has the same effect as if the taint was specified directly
@@ -465,24 +590,30 @@ type DeviceTaintRule struct {
465
590
// Changing the spec automatically increments the metadata.generation number.
466
591
Spec DeviceTaintRuleSpec
467
592
468
- // ^^^
469
- // A spec gets added because adding a status seems likely.
470
- // Such a status could provide feedback on applying the
471
- // eviction and/or statistics (number of matching devices,
472
- // affected allocated claims, pods remaining to be evicted,
473
- // etc.).
593
+ // Status provides information about an on-going pod eviction.
594
+ Status DeviceTaintRuleStatus
474
595
}
475
596
476
597
// DeviceTaintRuleSpec specifies the selector and one taint.
477
598
type DeviceTaintRuleSpec struct {
478
599
// DeviceSelector defines which device(s) the taint is applied to.
479
600
// All selector criteria must be satified for a device to
480
601
// match. The empty selector matches all devices. Without
481
- // a selector, no devices are matches .
602
+ // a selector, no devices are matched .
482
603
//
483
604
// +optional
484
605
DeviceSelector *DeviceTaintSelector
485
606
607
+ // EvictionRate controls how quickly Pods get evicted if that is
608
+ // the effect of the taint. If multiple taints cause eviction
609
+ // of the same set of Pods, then the lowest rate defined in
610
+ // any of those taints applies.
611
+ //
612
+ // The default is 100 Pods/s.
613
+ //
614
+ // +optional
615
+ EvictionRate *int64
616
+
486
617
// The taint that gets applied to matching devices.
487
618
//
488
619
// +required
@@ -535,6 +666,24 @@ type DeviceTaintSelector struct {
535
666
// +listType=atomic
536
667
Selectors []DeviceSelector
537
668
}
669
+
670
+ // DeviceTaintRuleStatus provides information about an on-going pod eviction.
671
+ type DeviceTaintRuleStatus struct {
672
+ // PodsPendingEviction counts the number of Pods which still need to be evicted.
673
+ // Because taints with the NoExecute effect also prevent scheduling new pods,
674
+ // this number should eventually reach zero.
675
+ //
676
+ // The count gets updated periodically, so it is not guaranteed to be 100%
677
+ // accurate.
678
+ PodsPendingEviction int64
679
+
680
+ // PodsEvicted counts the number of Pods which were evicted because of the taint.
681
+ //
682
+ // This gets updated periodically, so it is not guaranteed to be 100%
683
+ // accurate. The actual count may be higher if the controller evicted
684
+ // some Pods and then gets restarted before updating this field.
685
+ PodsEvicted int64
686
+ }
538
687
```
539
688
540
689
### Test Plan
0 commit comments