Kubernetes 1.35 Native Gang Scheduling: The Eve of Scheduling Ecosystem Unification

Shengxu included in Kubernetes AI

2026-01-21 About 1100 words 5 minutes

Contents

Kubernetes 1.35 introduces native Workload API and Gang Scheduling support, widely regarded as a “kernel-level refactoring” of cloud-native AI infrastructure. To truly grasp the significance of this upgrade, we need to look not only at what it brings but also at what it aims to replace (or merge with).

Before v1.35, to address the “resource deadlock” pain point of AI training tasks, the community had actually evolved a complex “third-party scheduler zoo.” This article starts from the native primitives, takes stock of existing ecosystem options, and reveals the architectural evolution direction in production environments.

1. Origin and Conflict: The “Misfit” of Atomic Scheduling in the AI Era

In Kubernetes’ classic design, a Pod is the smallest atomic scheduling unit. The scheduler uses a “greedy algorithm,” processing Pods in the queue sequentially and binding them immediately if the current node meets the requirements. This mechanism works perfectly in the microservices era, but encounters semantic conflicts in AI distributed training scenarios:

Microservice Assumption: Pods are independent; partially starting them can still provide partial service.
AI Assumption: Training tasks (e.g., PyTorch DDP) are tightly coupled topological structures, requiring All-or-Nothing.

This conflict between “atomicity” and “wholeness” directly leads to resource deadlock: resources occupied by the greedy algorithm (partial Pods) wait for remaining resources, while those remaining resources are locked by other partially occupied Pods.

2. The Ecosystem’s Contenders: How Did We Cope Before v1.35?

During the long period when native functionality was absent, the industry developed three main alternative solutions to run AI tasks on K8s. Understanding them is key to grasping v1.35’s entry point.

2.1 Heavyweight Replacements: Volcano and YuniKorn

This is the most radical approach—directly replacing or bypassing the default scheduler.

Volcano: A CNCF project originating from Huawei. It completely abandons K8s’ default scheduling logic, introducing concepts like PodGroup, Queue, and Command. It not only supports Gang Scheduling but also complex multi-tenant queue management (e.g., “Department A borrowing quota from Department B”).
Apache YuniKorn: Originating from Cloudera, it carries strong Hadoop YARN DNA. Its killer feature is Hierarchical Queues, making it ideal for big data/AI hybrid scenarios requiring fine-grained budget management.
Pain Points: Extremely high operational costs. You need to maintain two schedulers (Default for Web, Volcano for AI), and resource view conflicts (Race Conditions) are common.

2.2 Lightweight Plugins: Scheduler Plugins (Coscheduling)

This is a plugin-based extension using the Kubernetes Scheduling Framework.

Mechanism: By installing the Coscheduling plugin, it intercepts the Filter/Permit phases of the default scheduler, implementing a simple “wait until everyone is ready” logic.
Pain Points: Limited functionality. It only solves the “grouping” problem but lacks enterprise-grade features like queue management and priority preemption.

2.3 The “Newcomer”: Kueue (Kubernetes Native Job Queuing)

Kueue is not a scheduler but a job queue controller.

Mechanism: It operates above the scheduler. It intercepts Jobs and only releases (unsuspends) Pods into the scheduler when the cluster quota is met.
Pain Points: Before v1.35, although Kueue could control quotas, once released, the underlying default scheduler could still cause deadlock due to fragmentation. Therefore, Kueue often needed to be used in conjunction with the Coscheduling plugin.

3. Kernel-Level Refactoring: The “Dimensionality Reduction” of v1.35’s Workload API

The emergence of Kubernetes 1.35 essentially absorbs the experience of the above solutions, sinking core capabilities.

3.1 Elevating the Scheduling Perspective

The new scheduling.k8s.io/v1alpha1 API elevates the scheduling perspective from a single Pod to a Workload (job group). This effectively tells the scheduler: “Don’t just look at this tree (Pod), look at the whole forest (Workload).”

3.2 Fundamental State Machine Change

After enabling GangScheduling, the scheduling loop introduces a critical WaitOnPermit phase. This is essentially a two-phase commit protocol:

Pre-check: Intercepts task groups that don’t meet the minimum count (minCount) at the queue stage.
Transactional Binding: Attempts to place all Pods in memory; only when the entire group has a place does it proceed to actual node binding (Bind).

This marks: The historical mission of the Coscheduling plugin is coming to an end, as its logic has been absorbed into the K8s kernel.

4. Realistic Production Assessment: Where is the Mainstream Architecture Heading?

If v1.35 solves the “can it run” problem, production environments care about “does it run well.” Current production practices are transitioning from “heavyweight schedulers” to a “native combination model.”

4.1 Current Production Pain Points (Why not just Volcano?)

Although Volcano is powerful, in large-scale production environments, operations teams increasingly reject the “multi-scheduler architecture”:

Upgrade Difficulties: When K8s upgrades, Volcano often lags behind in adaptation.
Resource Fragmentation: It’s hard to mix Web services and AI training on the same node pool (Volcano even has its own node isolation mechanism).

4.2 Future Mainstream Architecture: Native + Kueue

With the maturation of native Gang Scheduling in v1.35, a clear layered architecture is forming:

Kubernetes Scheduling Evolution

Layer	Component	Responsibility	Evolution Trend
Policy Layer	Kueue	Decides “who can run.” Manages department quotas, borrowing logic, job priorities.	Becoming the unified entry point for AI tasks, taking over Volcano’s queue functionality.
Mechanism Layer	Kube-Scheduler (v1.35+)	Decides “where to run.” Uses native Gang Scheduling to prevent deadlock, executes specific node binding.	Kernel functionality enhanced, replacing the Coscheduling plugin, eliminating the need for third-party schedulers.

4.3 Why Can’t We Throw Away Volcano Yet?

We must be clear-eyed: v1.35 is currently in Alpha and has obvious “bare-bones” characteristics:

Rigid Configuration: The hardcoded 5-minute timeout logic cannot adapt to the slow start requirements of loading large models (LLMs).
Lack of Defragmentation: The native scheduler lacks re-scheduling capabilities and cannot actively move small tasks to free up large resource blocks for Gang tasks.

Conclusion: For teams deeply reliant on advanced Volcano/YuniKorn features (e.g., topology awareness, re-scheduling, complex borrowing strategies), the ROI of migration is currently low. However, for most newly built AI platforms, “Kueue + Kubernetes Native Scheduler (v1.35+)” will be the gold standard for the next two years—enjoying the stability of native K8s while gaining necessary queue management capabilities.

Conclusion

Kubernetes 1.35’s native Gang Scheduling is not about eliminating all third-party schedulers, but about reclaiming territory. It brings the common requirement of “group scheduling” back into the kernel, forcing projects like Volcano and YuniKorn to transition towards higher-end “specialized scheduling” (e.g., fine-grained GPU topology, cross-cluster federation scheduling).

For platform engineers, this means future architectures will be simpler: maintain one less component, gain one more layer of native assurance.

References:

Kubernetes v1.35: Introducing Workload Aware Scheduling

Want updates? Subscribe via RSS