Kubernetes - Category - Shengxu · Cloud Architecture & DevOps

From Azure SRE Agent to HolmesGPT: AIOps Practices in Multi-Cloud Kubernetes Environments

Fri, 17 Apr 2026 19:40:00 +0800

In the multi-cloud Kubernetes era, the pain point for SREs is no longer just “too many alerts,” but rather investigation chains that are too long, context that is too scattered, and troubleshooting costs across clouds that are too high. What truly drains people isn’t glancing at a chart, but constantly switching between multiple cloud platforms, logging systems, deployment records, and ticketing systems.

This is why AI SRE Agents are starting to deliver real value. Their goal isn’t to be a better conversational Copilot, but to proactively take over the highly repetitive first half of the work—“checking logs, finding correlations, guessing root causes, and giving suggestions”—once an alert is triggered.

Cilium 2026 (Continued): How the Unified Data Plane Is Reshaping Kubernetes Platform Architecture

Sat, 21 Mar 2026 14:31:56 +0800

In the previous article on Cilium, we explored the real reasons behind the 2026 migration wave: it’s no longer just “a faster CNI,” but rather a reorganization of Kubernetes networking, security, observability, and multi-cluster capabilities into a more unified infrastructure foundation, while clarifying its division of labor and boundaries with Istio.

If the previous article answered “What exactly can Cilium bring us?”, this one goes further, focusing on its core evolution: the Unified Dataplane.

Before Discussing LLM Security, Is Your Kubernetes Foundation Up to Standard?

Sat, 14 Mar 2026 10:00:00 +0800

The explosion of Large Language Models (LLMs) and AI Agents has not only revolutionized business models but also introduced new application-layer security challenges such as prompt injection and data poisoning. While everyone’s attention is drawn to these cutting-edge vulnerabilities, let’s first pause and ask ourselves a fundamental question: Before diving into these complex AI security issues, is the cloud-native foundation that supports all our business workloads even up to par?

What Cilium Can Really Bring Us in 2026

Sun, 08 Mar 2026 10:30:00 +0800

——What Meaningful Changes It Actually Brings, and How to Divide Work with Istio

By 2026, many teams discussing Cilium are no longer asking “Is it worth trying?” but rather “When should we migrate?”

Kubernetes Complexity: Starting from a Job Interview Question

Sat, 24 Jan 2026 12:47:00 +0800

I recently went through a job interview where the interviewer posed a seemingly routine question: “In your opinion, when should you use Kubernetes, and when is it unnecessary and just adds complexity?”

I answered it fairly smoothly at the time, but the question lingered in my mind long afterward. What made it so “sharp” was that it stepped beyond the technical details of “how to use K8s” and cut straight to the core trade-off in architecture design: Are we introducing a tech stack to solve a real business pain point, or just to satisfy the team’s “anxiety about being cutting-edge”?

OWASP LLM Top 10 Security in Practice

Fri, 23 Jan 2026 10:00:00 +0800

Yesterday I had the privilege of attending a talk by Sergey Saburov from Acronis on “Agentic Engineering & LLM Security.” Sergey provided an in-depth analysis of security threats facing modern LLM applications, along with numerous real-world case studies aligned with the OWASP LLM Top 10 framework.

I’ve organized and summarized the content based on the latest OWASP LLM Top 10 v2.0 (2025) official standard. I’ve corrected some terminology discrepancies from the original talk (e.g., LLM06, LLM10) and compiled Python PoC (Proof of Concept) and defense scripts tailored for Kubernetes platform engineers, hoping this serves as a reference for building secure AI systems.

Helm 4 Deep Dive: More Than a Version Bump – A New Beginning for the Kubernetes-Native Era

Thu, 22 Jan 2026 10:00:00 +0800

In the infrastructure world, some version updates are “icing on the cake,” while others are “transformative.” If Helm 3 freed us from the nightmare of Tiller, then Helm 4, officially released in November 2025, marks the coming-of-age moment when Helm truly understood and embraced Kubernetes’ declarative philosophy.

After two months of community validation and official documentation refinement, this article will clarify the easily misunderstood technical details based on Helm 4’s actual release state.

Kubernetes 1.35 Native Gang Scheduling: The Eve of Scheduling Ecosystem Unification

Wed, 21 Jan 2026 00:00:00 +0000

Kubernetes 1.35 introduces native Workload API and Gang Scheduling support, widely regarded as a “kernel-level refactoring” of cloud-native AI infrastructure. To truly grasp the significance of this upgrade, we need to look not only at what it brings but also at what it aims to replace (or merge with).

Before v1.35, to address the “resource deadlock” pain point of AI training tasks, the community had actually evolved a complex “third-party scheduler zoo.” This article starts from the native primitives, takes stock of existing ecosystem options, and reveals the architectural evolution direction in production environments.

Dragonfly: Image and Model Distribution Infrastructure for the Cloud-Native Era

Thu, 15 Jan 2026 10:00:00 +0800

In 2026, as AI and cloud-native infrastructure continue to evolve, image and model distribution is shifting from a “peripheral optimization point” to a critical factor affecting platform efficiency. Traditional approaches relying on centralized Registry + CDN often face dual challenges of speed and cost when dealing with scenarios involving large-scale concurrent nodes and large-volume images or models. Against this backdrop, Dragonfly has grown into a CNCF Graduated project and is adopted in production environments by companies such as Ant Group, Alibaba, Datadog, DiDi, and Kuaishou to support efficient distribution of containers and AI models.

Farewell to iptables: The Nftables Revolution in Kubernetes Network Data Plane

Fri, 09 Jan 2026 14:00:00 +0800

In the networking world of Kubernetes, kube-proxy has long played the role of “gatekeeper,” responsible for distributing Service traffic to backend Pods. However, for years, we’ve endured the performance pain of iptables mode or been forced to migrate to the more complex IPVS mode.

Fast forward to 2026, with Kubernetes 1.33 reaching General Availability (GA) in April 2025, nftables mode is no longer an experimental option—it has become the “new standard” for production environments. In fact, with the release of v1.35 at the end of 2025, the once-reliable ipvs mode has been officially marked as Deprecated. This marks a complete “return to fundamentals” for the Linux kernel network stack in the cloud-native era.

Kubernetes 1.34/1.35 Certificate Revolution: From Manual Hell to Zero-Trust Heaven

Sat, 03 Jan 2026 19:00:00 +0800

Recently upgraded to 1.35 and discovered that certificate management changes are nothing short of revolutionary—especially for self-managed K8s users, where operational overhead has been cut in half.

In the past, certificate issues were the “silent killer” of security incidents: expired certificates causing outages, token leaks, and manual rotation consuming 30% of ops time. Versions 1.34/1.35 introduce native automated mTLS, making zero trust no longer exclusive to Istio. Today, let’s dive into these new features and compare them in a self-managed K8s vs. cloud K8s hands-on scenario.

Kubernetes v1.33–v1.35 Deep Dive: From Native Sidecar to AI Compute Foundation

Fri, 02 Jan 2026 09:50:00 +0800

Timeline Overview

v1.33 (Octarine): Released April 2025, Native Sidecar GA, security features enabled by default.
v1.34 (Of Wind & Will): Released August 2025, DRA GA, marking the native era of AI/GPU scheduling.
v1.35 (Timbernetes): Released December 2025, In-Place Pod Resize GA, zero-disruption elasticity becomes reality.

1. v1.33 “Octarine”: Sidecar Graduation and Default Security

The keywords for v1.33 are “Native Sidecar” and “Security Enabled by Default.” This release transforms long-standing experimental capabilities into dependable infrastructure for daily engineering.

IngressNightmare (CVE-2025-1974): Vulnerability Deep Dive and Gateway API Migration Guide

Sat, 27 Dec 2025 10:00:00 +0800

The recently disclosed “IngressNightmare” vulnerability in Ingress-NGINX has once again thrust nginx-ingress into the spotlight, serving as a stark warning for clusters still relying on traditional Ingress.

Below is a technical review focused on engineering practice, covering the vulnerability recap, risk analysis, short-term fixes, how to leverage this as an opportunity to migrate to Gateway API, and a comparison of pros and cons before and after migration.

Vulnerability Brief: IngressNightmare (CVE‑2025‑1974)

Severity: In March 2025, researchers disclosed a set of high-severity vulnerabilities in the Ingress-NGINX controller, collectively known as “IngressNightmare.” Among them, CVE‑2025‑1974 has a CVSS score of 9.8, rated as “Critical” by the official team and multiple security vendors, affecting a vast number of Kubernetes clusters.
Root Cause: The core issue lies in the Validating Admission Webhook. When validating an Ingress object, the controller generates an NGINX configuration based on the object and its annotations, then uses nginx -t for validation. During this process, insufficient filtering of annotations and configuration fragments allows attackers to inject arbitrary NGINX directives, ultimately leading to Remote Code Execution (RCE) on the controller Pod.
Low Attack Barrier: An attacker only needs access to the admission webhook within the Pod network (many clusters even expose it to the public internet) to trigger the vulnerability via unauthenticated requests. This is an unauthenticated RCE, highly susceptible to mass exploitation by worms or automated attack tools.
Vulnerability Chain: The same disclosure includes several other high-severity injection vulnerabilities (e.g., CVE‑2025‑24514, CVE‑2025‑1097, CVE‑2025‑1098), collectively forming the IngressNightmare vulnerability chain, with an attack surface far exceeding a single CVE.

Risk and Impact: From NGINX to Full Cluster Takeover

Sensitive Information Leakage: Once RCE is achieved within the ingress-nginx controller container, attackers can read all Kubernetes Secrets mounted to that Pod. Crucially, the NGINX Ingress Controller typically has extremely high privileges (ClusterRole), requiring it to read Secrets from all namespaces in the cluster to obtain TLS certificates. This means the consequence of RCE is not just the current Namespace, but the complete leakage of all cluster certificates and credentials.
Traffic Hijacking and Tampering: The controller usually has read and write permissions for Ingress resources in the cluster. Combined with RCE, attackers can further tamper with routing, transparently forwarding user traffic to attacker-controlled backends for man-in-the-middle attacks or data theft.
“One Hole to Rule the Cloud”: Practical tests by multiple security vendors show that in clusters with loose default network policies, an attacker only needs execution permissions on any Pod to laterally access the admission webhook, thereby escalating to cluster-level control.

Short-Term Remediation: Patch First, Rebuild Later

Before discussing Gateway API migration, all clusters still running ingress-nginx need to take two immediate actions: