Kubernetes 1.35 Native Gang Scheduling: The Eve of Scheduling Ecosystem Unification
Kubernetes 1.35 introduces native Workload API and Gang Scheduling support, widely regarded as a “kernel-level refactoring” of cloud-native AI infrastructure. To truly grasp the significance of this upgrade, we need to look not only at what it brings but also at what it aims to replace (or merge with).
Before v1.35, to address the “resource deadlock” pain point of AI training tasks, the community had actually evolved a complex “third-party scheduler zoo.” This article starts from the native primitives, takes stock of existing ecosystem options, and reveals the architectural evolution direction in production environments.
1. Origin and Conflict: The “Misfit” of Atomic Scheduling in the AI Era
In Kubernetes’ classic design, a Pod is the smallest atomic scheduling unit. The scheduler uses a “greedy algorithm,” processing Pods in the queue sequentially and binding them immediately if the current node meets the requirements. This mechanism works perfectly in the microservices era, but encounters semantic conflicts in AI distributed training scenarios:
- Microservice Assumption: Pods are independent; partially starting them can still provide partial service.
- AI Assumption: Training tasks (e.g., PyTorch DDP) are tightly coupled topological structures, requiring All-or-Nothing.
This conflict between “atomicity” and “wholeness” directly leads to resource deadlock: resources occupied by the greedy algorithm (partial Pods) wait for remaining resources, while those remaining resources are locked by other partially occupied Pods.
2. The Ecosystem’s Contenders: How Did We Cope Before v1.35?
During the long period when native functionality was absent, the industry developed three main alternative solutions to run AI tasks on K8s. Understanding them is key to grasping v1.35’s entry point.
2.1 Heavyweight Replacements: Volcano and YuniKorn
This is the most radical approach—directly replacing or bypassing the default scheduler.
- Volcano: A CNCF project originating from Huawei. It completely abandons K8s’ default scheduling logic, introducing concepts like
PodGroup,Queue, andCommand. It not only supports Gang Scheduling but also complex multi-tenant queue management (e.g., “Department A borrowing quota from Department B”). - Apache YuniKorn: Originating from Cloudera, it carries strong Hadoop YARN DNA. Its killer feature is Hierarchical Queues, making it ideal for big data/AI hybrid scenarios requiring fine-grained budget management.
- Pain Points: Extremely high operational costs. You need to maintain two schedulers (Default for Web, Volcano for AI), and resource view conflicts (Race Conditions) are common.
2.2 Lightweight Plugins: Scheduler Plugins (Coscheduling)
This is a plugin-based extension using the Kubernetes Scheduling Framework.
- Mechanism: By installing the
Coschedulingplugin, it intercepts the Filter/Permit phases of the default scheduler, implementing a simple “wait until everyone is ready” logic. - Pain Points: Limited functionality. It only solves the “grouping” problem but lacks enterprise-grade features like queue management and priority preemption.
2.3 The “Newcomer”: Kueue (Kubernetes Native Job Queuing)
Kueue is not a scheduler but a job queue controller.
- Mechanism: It operates above the scheduler. It intercepts Jobs and only releases (unsuspends) Pods into the scheduler when the cluster quota is met.
- Pain Points: Before v1.35, although Kueue could control quotas, once released, the underlying default scheduler could still cause deadlock due to fragmentation. Therefore, Kueue often needed to be used in conjunction with the Coscheduling plugin.
3. Kernel-Level Refactoring: The “Dimensionality Reduction” of v1.35’s Workload API
The emergence of Kubernetes 1.35 essentially absorbs the experience of the above solutions, sinking core capabilities.
3.1 Elevating the Scheduling Perspective
The new scheduling.k8s.io/v1alpha1 API elevates the scheduling perspective from a single Pod to a Workload (job group). This effectively tells the scheduler: “Don’t just look at this tree (Pod), look at the whole forest (Workload).”
3.2 Fundamental State Machine Change
After enabling GangScheduling, the scheduling loop introduces a critical WaitOnPermit phase. This is essentially a two-phase commit protocol:
- Pre-check: Intercepts task groups that don’t meet the minimum count (
minCount) at the queue stage. - Transactional Binding: Attempts to place all Pods in memory; only when the entire group has a place does it proceed to actual node binding (Bind).
This marks: The historical mission of the Coscheduling plugin is coming to an end, as its logic has been absorbed into the K8s kernel.
4. Realistic Production Assessment: Where is the Mainstream Architecture Heading?
If v1.35 solves the “can it run” problem, production environments care about “does it run well.” Current production practices are transitioning from “heavyweight schedulers” to a “native combination model.”
4.1 Current Production Pain Points (Why not just Volcano?)
Although Volcano is powerful, in large-scale production environments, operations teams increasingly reject the “multi-scheduler architecture”:
- Upgrade Difficulties: When K8s upgrades, Volcano often lags behind in adaptation.
- Resource Fragmentation: It’s hard to mix Web services and AI training on the same node pool (Volcano even has its own node isolation mechanism).
4.2 Future Mainstream Architecture: Native + Kueue
With the maturation of native Gang Scheduling in v1.35, a clear layered architecture is forming:
| Layer | Component | Responsibility | Evolution Trend |
|---|---|---|---|
| Policy Layer | Kueue | Decides “who can run.” Manages department quotas, borrowing logic, job priorities. | Becoming the unified entry point for AI tasks, taking over Volcano’s queue functionality. |
| Mechanism Layer | Kube-Scheduler (v1.35+) | Decides “where to run.” Uses native Gang Scheduling to prevent deadlock, executes specific node binding. | Kernel functionality enhanced, replacing the Coscheduling plugin, eliminating the need for third-party schedulers. |
4.3 Why Can’t We Throw Away Volcano Yet?
We must be clear-eyed: v1.35 is currently in Alpha and has obvious “bare-bones” characteristics:
- Rigid Configuration: The hardcoded 5-minute timeout logic cannot adapt to the slow start requirements of loading large models (LLMs).
- Lack of Defragmentation: The native scheduler lacks re-scheduling capabilities and cannot actively move small tasks to free up large resource blocks for Gang tasks.
Conclusion: For teams deeply reliant on advanced Volcano/YuniKorn features (e.g., topology awareness, re-scheduling, complex borrowing strategies), the ROI of migration is currently low. However, for most newly built AI platforms, “Kueue + Kubernetes Native Scheduler (v1.35+)” will be the gold standard for the next two years—enjoying the stability of native K8s while gaining necessary queue management capabilities.
Conclusion
Kubernetes 1.35’s native Gang Scheduling is not about eliminating all third-party schedulers, but about reclaiming territory. It brings the common requirement of “group scheduling” back into the kernel, forcing projects like Volcano and YuniKorn to transition towards higher-end “specialized scheduling” (e.g., fine-grained GPU topology, cross-cluster federation scheduling).
For platform engineers, this means future architectures will be simpler: maintain one less component, gain one more layer of native assurance.
References:
🤖 AI Related Posts by semantic similarity
Want updates? Subscribe via RSS
Related Content
- Dragonfly: Image and Model Distribution Infrastructure for the Cloud-Native Era
- From Azure SRE Agent to HolmesGPT: AIOps Practices in Multi-Cloud Kubernetes Environments
- Cilium 2026 (Continued): How the Unified Data Plane Is Reshaping Kubernetes Platform Architecture
- OWASP LLM Top 10 Security in Practice
- Two Real Problems in AI Programming: Multi-Project Task Management and Multi-User Collaboration Isolation