<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>AI Infrastructure - Tag - Shengxu · Cloud Architecture &amp; DevOps</title><link>https://shengxu.pages.dev/en/tags/ai-infrastructure/</link><description>Cloud architecture &amp; DevOps notes by Shengxu: Kubernetes, Cilium, observability, LLM infra, AI agents.</description><generator>Hugo 0.153.2 &amp; FixIt v0.4.0-alpha.3-20251225101113-8ffb9a95</generator><language>en</language><lastBuildDate>Wed, 21 Jan 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://shengxu.pages.dev/en/tags/ai-infrastructure/index.xml" rel="self" type="application/rss+xml"/><item><title>Kubernetes 1.35 Native Gang Scheduling: The Eve of Scheduling Ecosystem Unification</title><link>https://shengxu.pages.dev/en/posts/kubernetes-1-35-native-gang-scheduling/</link><pubDate>Wed, 21 Jan 2026 00:00:00 +0000</pubDate><guid>https://shengxu.pages.dev/en/posts/kubernetes-1-35-native-gang-scheduling/</guid><category domain="https://shengxu.pages.dev/en/categories/kubernetes/">Kubernetes</category><category domain="https://shengxu.pages.dev/en/categories/ai/">AI</category><description>&lt;p&gt;Kubernetes 1.35 introduces native Workload API and Gang Scheduling support, widely regarded as a &amp;ldquo;kernel-level refactoring&amp;rdquo; of cloud-native AI infrastructure. To truly grasp the significance of this upgrade, we need to look not only at what it brings but also at what it aims to replace (or merge with).&lt;/p&gt;
&lt;p&gt;Before v1.35, to address the &amp;ldquo;resource deadlock&amp;rdquo; pain point of AI training tasks, the community had actually evolved a complex &amp;ldquo;third-party scheduler zoo.&amp;rdquo; This article starts from the native primitives, takes stock of existing ecosystem options, and reveals the architectural evolution direction in production environments.&lt;/p&gt;</description></item><item><title>Dragonfly: Image and Model Distribution Infrastructure for the Cloud-Native Era</title><link>https://shengxu.pages.dev/en/posts/dragonfly-cloud-native-p2p-distribution/</link><pubDate>Thu, 15 Jan 2026 10:00:00 +0800</pubDate><guid>https://shengxu.pages.dev/en/posts/dragonfly-cloud-native-p2p-distribution/</guid><category domain="https://shengxu.pages.dev/en/categories/kubernetes/">Kubernetes</category><category domain="https://shengxu.pages.dev/en/categories/ai/">AI</category><description>&lt;p&gt;In 2026, as AI and cloud-native infrastructure continue to evolve, image and model distribution is shifting from a &amp;ldquo;peripheral optimization point&amp;rdquo; to a critical factor affecting platform efficiency. Traditional approaches relying on centralized Registry + CDN often face dual challenges of speed and cost when dealing with scenarios involving large-scale concurrent nodes and large-volume images or models. Against this backdrop, Dragonfly has grown into a CNCF Graduated project and is adopted in production environments by companies such as Ant Group, Alibaba, Datadog, DiDi, and Kuaishou to support efficient distribution of containers and AI models.&lt;/p&gt;</description></item></channel></rss>