<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Grafana - Tag - Shengxu · Cloud Architecture &amp; DevOps</title><link>https://shengxu.pages.dev/en/tags/grafana/</link><description>Cloud architecture &amp; DevOps notes by Shengxu: Kubernetes, Cilium, observability, LLM infra, AI agents.</description><generator>Hugo 0.153.2 &amp; FixIt v0.4.0-alpha.3-20251225101113-8ffb9a95</generator><language>en</language><lastBuildDate>Fri, 17 Apr 2026 19:40:00 +0800</lastBuildDate><atom:link href="https://shengxu.pages.dev/en/tags/grafana/index.xml" rel="self" type="application/rss+xml"/><item><title>From Azure SRE Agent to HolmesGPT: AIOps Practices in Multi-Cloud Kubernetes Environments</title><link>https://shengxu.pages.dev/en/posts/azure-sre-agent-to-holmesgpt/</link><pubDate>Fri, 17 Apr 2026 19:40:00 +0800</pubDate><guid>https://shengxu.pages.dev/en/posts/azure-sre-agent-to-holmesgpt/</guid><category domain="https://shengxu.pages.dev/en/categories/ai/">AI</category><category domain="https://shengxu.pages.dev/en/categories/kubernetes/">Kubernetes</category><category domain="https://shengxu.pages.dev/en/categories/devops/">DevOps</category><category domain="https://shengxu.pages.dev/en/categories/observability/">Observability</category><description>&lt;p&gt;In the multi-cloud Kubernetes era, the pain point for SREs is no longer just &amp;ldquo;too many alerts,&amp;rdquo; but rather investigation chains that are too long, context that is too scattered, and troubleshooting costs across clouds that are too high. What truly drains people isn&amp;rsquo;t glancing at a chart, but constantly switching between multiple cloud platforms, logging systems, deployment records, and ticketing systems.&lt;/p&gt;
&lt;p&gt;This is why AI SRE Agents are starting to deliver real value. Their goal isn&amp;rsquo;t to be a better conversational Copilot, but to proactively take over the highly repetitive first half of the work—&amp;ldquo;checking logs, finding correlations, guessing root causes, and giving suggestions&amp;rdquo;—once an alert is triggered.&lt;/p&gt;</description></item><item><title>Hands-On · Building a Memory-Enabled AI Writing Partner (Part 4): Observability (Metrics + Logs + Trace + Cost)</title><link>https://shengxu.pages.dev/en/posts/fantasy-novel-agent-observability/</link><pubDate>Thu, 05 Feb 2026 16:00:00 +0800</pubDate><guid>https://shengxu.pages.dev/en/posts/fantasy-novel-agent-observability/</guid><category domain="https://shengxu.pages.dev/en/categories/ai/">AI</category><category domain="https://shengxu.pages.dev/en/categories/devops/">DevOps</category><category domain="https://shengxu.pages.dev/en/categories/observability/">Observability</category><description>&lt;p&gt;In the previous post, we discussed the security of RAG systems and prompt injection protection. Today, let&amp;rsquo;s dive into another engineering deep-water zone: &lt;strong&gt;Observability&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;When a system evolves from &amp;ldquo;it works&amp;rdquo; to &amp;ldquo;it works reliably long-term,&amp;rdquo; you will inevitably encounter three types of problems:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Slow&lt;/strong&gt;: Is retrieval slow? Is the LLM slow? Or is some Agent stuck in a retry loop?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Expensive&lt;/strong&gt;: Is token consumption being silently drained by a specific chain? Why doesn&amp;rsquo;t this month&amp;rsquo;s API bill add up?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strange&lt;/strong&gt;: Intermittent bugs that can&amp;rsquo;t be reproduced, leaving you to fix code based on &amp;ldquo;gut feeling.&amp;rdquo;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At this stage, I chose to build a complete &lt;strong&gt;Metrics + Logs&lt;/strong&gt; system, rather than just sprinkling in a few &lt;code&gt;print&lt;/code&gt; statements.&lt;/p&gt;</description></item></channel></rss>