「Agentic AI Dev Note」 Observing Agent Quality with Evaluation

Mon, 15 Jun 2026 20:00:00 +0900

Across Part 1 and Part 2 of this 「Agentic AI Dev Note」 series, we made our agent operable in the cloud. Now it’s time to check whether that agent actually behaves the way we expect. There are three ways to measure an agent’s quality.

Code quality: Most of it can be verified with unit/integration/e2e tests. The same input produces exactly the same output, so it can be verified precisely.
System quality: Measured through runtime error rates, latency, dependency failures, and the like, via alarms and monitoring. Canary tests can monitor system quality too.
LLM response quality: Measures whether the LLM produced the expected answer to the user’s input.

But how do you guarantee the response quality of a non-deterministic LLM, where the same input can yield a different output every time? For the first two, the same input gives the same output, so you can prove it is “correct.” LLM responses, however, cannot be proven. And if you can’t prove it, the only option left is to observe it: watch the trend of behavior instead of any single correct answer. This post focuses on that.

Observability on Euijun's Personal Blog

「Agentic AI Dev Note」 Observing Agent Quality with Evaluation