From Azure SRE Agent to HolmesGPT: AIOps Practices in Multi-Cloud Kubernetes Environments

Fri, 17 Apr 2026 19:40:00 +0800

In the multi-cloud Kubernetes era, the pain point for SREs is no longer just “too many alerts,” but rather investigation chains that are too long, context that is too scattered, and troubleshooting costs across clouds that are too high. What truly drains people isn’t glancing at a chart, but constantly switching between multiple cloud platforms, logging systems, deployment records, and ticketing systems.

This is why AI SRE Agents are starting to deliver real value. Their goal isn’t to be a better conversational Copilot, but to proactively take over the highly repetitive first half of the work—“checking logs, finding correlations, guessing root causes, and giving suggestions”—once an alert is triggered.

Hands-On · Building a Memory-Enabled AI Writing Partner (Part 4): Observability (Metrics + Logs + Trace + Cost)

Thu, 05 Feb 2026 16:00:00 +0800

In the previous post, we discussed the security of RAG systems and prompt injection protection. Today, let’s dive into another engineering deep-water zone: Observability.

When a system evolves from “it works” to “it works reliably long-term,” you will inevitably encounter three types of problems:

Slow: Is retrieval slow? Is the LLM slow? Or is some Agent stuck in a retry loop?
Expensive: Is token consumption being silently drained by a specific chain? Why doesn’t this month’s API bill add up?
Strange: Intermittent bugs that can’t be reproduced, leaving you to fix code based on “gut feeling.”

At this stage, I chose to build a complete Metrics + Logs system, rather than just sprinkling in a few print statements.

Grafana - Tag - Shengxu · Cloud Architecture & DevOps

From Azure SRE Agent to HolmesGPT: AIOps Practices in Multi-Cloud Kubernetes Environments

Hands-On · Building a Memory-Enabled AI Writing Partner (Part 4): Observability (Metrics + Logs + Trace + Cost)