<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Observability on Euijun's Personal Blog</title><link>https://elbanic.github.io/tags/observability/</link><description>Recent content in Observability on Euijun's Personal Blog</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Mon, 15 Jun 2026 20:00:00 +0900</lastBuildDate><atom:link href="https://elbanic.github.io/tags/observability/index.xml" rel="self" type="application/rss+xml"/><item><title>「Agentic AI Dev Note」 Observing Agent Quality with Evaluation</title><link>https://elbanic.github.io/posts/observing-agent-quality-with-evaluation/</link><pubDate>Mon, 15 Jun 2026 20:00:00 +0900</pubDate><guid>https://elbanic.github.io/posts/observing-agent-quality-with-evaluation/</guid><description>&lt;p&gt;Across &lt;a href="https://elbanic.github.io/posts/develop-and-operating-agents-in-a-distributed-environment/"&gt;Part 1&lt;/a&gt; and &lt;a href="https://elbanic.github.io/posts/managing-agent-memory-in-a-distributed-systems/"&gt;Part 2&lt;/a&gt; of this 「Agentic AI Dev Note」 series, we made our agent operable in the cloud. Now it&amp;rsquo;s time to check whether that agent actually behaves the way we expect. There are three ways to measure an agent&amp;rsquo;s quality.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Code quality&lt;/strong&gt;: Most of it can be verified with unit/integration/e2e tests. The same input produces exactly the same output, so it can be verified precisely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;System quality&lt;/strong&gt;: Measured through runtime error rates, latency, dependency failures, and the like, via alarms and monitoring. Canary tests can monitor system quality too.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM response quality&lt;/strong&gt;: Measures whether the LLM produced the expected answer to the user&amp;rsquo;s input.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But how do you guarantee the response quality of a non-deterministic LLM, where the same input can yield a different output every time? For the first two, the same input gives the same output, so you can &lt;em&gt;prove&lt;/em&gt; it is &amp;ldquo;correct.&amp;rdquo; LLM responses, however, cannot be proven. And if you can&amp;rsquo;t prove it, the only option left is to &lt;em&gt;observe&lt;/em&gt; it: watch the trend of behavior instead of any single correct answer. This post focuses on that.&lt;/p&gt;</description></item></channel></rss>