This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Workload API
FEATURE STATE:
Kubernetes v1.35 [alpha](disabled by default)
The Workload API resource allows you to describe the scheduling requirements and structure of a multi-Pod application.
While workload controllers provide runtime behavior for the workloads,
the Workload API is supposed to provide scheduling constraints for the "true" workloads, such as Job and others.
What is a Workload?
The Workload API resource is part of the scheduling.k8s.io/v1alpha1
API group
(and your cluster must have that API group enabled, as well as the GenericWorkload
feature gate,
before you can benefit from this API).
This resource acts as a structured, machine-readable definition of the scheduling requirements
of a multi-Pod application. While user-facing workloads like Jobs
define what to run, the Workload resource determines how a group of Pods should be scheduled
and how its placement should be managed throughout its lifecycle.
API structure
A Workload allows you to define a group of Pods and apply a scheduling policy to them.
It consists of two sections: a list of pod groups and a reference to a controller.
Pod groups
The podGroups list defines the distinct components of your workload.
For example, a machine learning job might have a driver group and a worker group.
Each entry in podGroups must have:
- A unique
name that can be used in the Pod's Workload reference.
- A scheduling policy (
basic or gang).
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
controllerRef:
apiGroup: batch
kind: Job
name: training-job
podGroups:
- name: workers
policy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
Referencing a workload controlling object
The controllerRef field links the Workload back to the specific high-level object defining the application,
such as a Job or a custom CRD. This is useful for observability and tooling.
This data is not used to schedule or manage the Workload.
What's next
1 - Pod Group Policies
FEATURE STATE:
Kubernetes v1.35 [alpha](disabled by default)
Every pod group defined in a Workload
must declare a scheduling policy. This policy dictates how the scheduler treats the collection of Pods.
Policy types
The API currently supports two policy types: basic and gang.
You must specify exactly one policy for each group.
Basic policy
The basic policy instructs the scheduler to evaluate all Pods on a best-effort basis.
Unlike the gang policy, a PodGroup using the basic policy is considered feasible
regardless of how many of its Pods are currently schedulable.
The primary reason to use the basic policy is to organize the Pods within your Workload
for better observability and management, while still evaluating them together within a single,
atomic PodGroup scheduling cycle.
This policy can be used for groups of a Workload that do not require simultaneous startup
but logically belong to the application, or to open the way for future group constraints
that do not imply "all-or-nothing" placement.
Gang policy
The gang policy enforces "all-or-nothing" scheduling. This is essential for tightly-coupled workloads
where partial startup results in deadlocks or wasted resources.
This can be used for Jobs
or any other batch process where all workers must run concurrently to make progress.
The gang policy requires a minCount parameter:
policy:
gang:
# The number of Pods that must be schedulable simultaneously
# for the group to be admitted.
minCount: 4
What's next
2 - Topology-Aware Workload Scheduling
FEATURE STATE:
Kubernetes v1.36 [alpha](disabled by default)
Topology-Aware Scheduling (TAS) is a feature of the Workload API that optimizes the placement of
pods within the cluster.
TAS ensures that all pods within a PodGroup are co-located into a specific topology domain,
such as a single server rack or zone. This minimizes inter-pod communication latency and prevents
workload fragmentation across the cluster infrastructure.
Topology-aware scheduling with gang scheduling policy
When applied to PodGroups with gang scheduling policy, TAS simulates the potential assignment
(placement) of the full group of pods at once. It guarantees that at least the specified
minCount pods can fit together into the same topology domain before committing resources.
If no feasible placement is found, the entire PodGroup becomes unschedulable.
This is the recommended approach for workloads like distributed AI and ML training that strictly
require proximity to minimize inter-pod communication latency.
If new pods are added to the PodGroup where some pods are already scheduled (for example, if pods
are recreated), the scheduler will force all new incoming pods to land on the exact same topology
domain where the existing pods currently reside. If that specific domain lacks sufficient capacity
for the new pods, the pods will remain pending - even if it means that less than minCount pods
are scheduled at this point.
Note:
As of v1.36 Topology-Aware Scheduling does not trigger workload or pod preemption. If no
feasible placement can be found without triggering preemption, the PodGroup becomes unschedulable.
Topology-aware scheduling with basic scheduling policy
Using TAS with basic scheduling policy may exhibit inconsistent behavior. The scheduler may only
observe a subset of pods when entering the PodGroup scheduling cycle - therefore placement
feasibility is only evaluated for the observed pods, rather than the entire PodGroup. To partially
mitigate this limitation, you can use scheduling gates to hold off PodGroup scheduling until all
pods within the PodGroup are in the scheduling queue.
If no feasible placement is found for the entire PodGroup, only a subset of pods may be scheduled,
and they are guaranteed to meet the scheduling constraints.
If new pods are added to the PodGroup where some pods are already scheduled, the scheduler will act
the same as in case of gang policy - forcing the new pods into the same domain, unless there is
insufficient capacity (in which case the new pods will remain pending).
API configuration: scheduling constraints
Every PodGroup (or PodGroupTemplate) may optionally declare the schedulingConstraints field,
which is interpreted by the placement-based PodGroup scheduling algorithm.
If constraints are defined in PodGroupTemplate, they will be copied to referencing PodGroups.
As of Kubernetes v1.36, the API supports topology constraints.
Note:
As of Kubernetes v1.36, you can specify only a single topology constraint in each PodGroup.
Topology constraint
To define a topology constraint for a PodGroup you need to set a key, which corresponds to
a Kubernetes node label, representing the target topology domain (for example, a rack or a zone).
The scheduler strictly enforces that all pods within the PodGroup are placed onto nodes that share
the exact same value for this specified label.
Here is an example of a PodGroup configured with a topology constraint:
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: example-podgroup
spec:
schedulingPolicy:
gang:
minCount: 4
schedulingConstraints:
topology:
- key: topology.example.com/rack
What's next