Observability

LLM Observability: A Practical Look at Tracing, Logging, and Evaluation Loops

LLM Observability: A Practical Look at Tracing, Logging, and Evaluation Loops

Picture a restaurant kitchen. The dishes look great, customers keep ordering, but if you can't see what's happening behind the pass, you'll never understand why one day everyone leaves their plate half-eaten. An LLM-based product is exactly like this: the model is a box, and what spins inside it usually stays in the dark. Observability is opening a window into that kitchen. In this post we'll talk, in plain language, about tracing, logging, evaluation loops, and how to protect quality and catch regressions in production.

Why is observability different from classic software?

In traditional software, a function usually returns the same output for the same input. When something breaks, you see a stack trace, find the line, and fix it. With LLMs, things are slipperier: the same question can get a different answer today and an entirely different one tomorrow. The question "is it crashing or not?" is replaced by "is it good or bad?" And that's hard to measure, because "good" is subjective.

This is exactly why LLM observability needs three layers together: what happened (tracing), how it happened (logging), and was it good (evaluation). If one is missing, you're either blind or unable to interpret what you see.

Classic monitoring tells you "is the system up?" LLM observability asks "the system is up, but is what it's saying correct and useful?" The second is far harder, and it's also the one that really matters.

Tracing: the journey of a request

A user's question is rarely a single step on its way to the model. First documents are retrieved, then a prompt is built, perhaps a tool is called, the model generates a reply, and then a verification step runs. Tracing means recording every link in this chain end to end, just like a parcel passing through every stop from the warehouse to your door.

A good trace answers these questions: Which documents were retrieved? What exact prompt went to the model? Which tool ran, and for how long? Where did token consumption spike? Which step caused the latency? When an answer turns out poor, only the trace lets you tell whether the fault lies in retrieval or in the model itself.

  • Span: A single step in the chain (for example "document retrieval" or "model call"). It carries start, end, and input/output information.
  • Trace: The tree formed by all the spans of one user request. It is the full story of the request.
  • Metadata: Context tags such as model name, version, temperature, and user ID. They are pure gold for filtering later.
Tip: Add a version tag (such as prompt version and model name) to every trace. When quality drops one day, the answer to "after which change?" is hidden in those tags.

Logging: what, and how much, to record?

Logging is tracing's rawer, more flexible cousin. While a trace is a structured tree, a log is usually a free-form record that says "right now I did this." In the LLM world, the most valuable logs include: the exact prompt sent, the model's raw response, the token counts used, latency, error messages, and any user feedback (thumbs up/down).

But there are two traps here. The first is privacy: prompts and responses can contain personal or sensitive data. Mask, anonymize, or shorten retention before you store them. The second is noise: if you record everything, finding what matters becomes a needle-in-a-haystack search. Decide upfront which fields you actually need.

A good log is a letter you write to your future self. If three months from now a complaint comes in and you can reproduce the prompt and response from that moment, you logged it right.

Evaluation loops

Tracing and logging tell you "what happened"; evaluation measures "was it good." An evaluation loop means testing the model's output against a quality benchmark and repeating that regularly. There are three common approaches:

  1. Reference-based evaluation: If you have an expected "golden answer," you compare the model's output against it. Ideal for tasks with clear answers, like classification or extraction.
  2. LLM-as-judge: When there is no single correct answer for open-ended responses, you use another model as a "referee" to score the output against specific criteria. Fast and scalable, but remember the judge can have biases too.
  3. Human evaluation: The most reliable but most expensive method. It's usually run on a sample, often to calibrate the LLM-as-judge.

The key word here is loop. Evaluation isn't a one-off exam; you re-run the same test set on every prompt change and every model update. Building a fixed test set (a golden set) is the LLM equivalent of regression tests in software.

Tip: Start small. A golden set of 30-50 well-chosen examples is worth far more than a giant set that tries to be perfect but never gets built. Grow the set over time with real failures from production.

Quality and regression tracking in production

A model that works well in the lab can behave differently in production when it meets real, messy inputs. Quality tracking in production is a continuous pulse check. You watch signals like: user feedback, how LLM-as-judge scores trend over time, latency and cost curves, "empty/irrelevant answer" rates, and how often safety filters trigger.

A regression is when something that used to work breaks after a change. With LLMs this can be sneaky, because the change may not have come from you: the provider can silently update the model, a prompt tweak can have unexpected side effects, or the way users phrase questions can shift over time (data drift). So run your golden set automatically at regular intervals and set an alert if the score drops below a threshold.

  • Offline eval: Before shipping, with a fixed test set. It answers "did this change break anything?"
  • Online monitoring: Live, with real traffic. It answers "what is actually happening right now?"
  • Feedback: Add the bad examples caught in production back to the golden set and close the loop.
A solid rule: never ship a prompt or model change without running an evaluation. Just as code isn't shipped before a regression test passes, a prompt shouldn't ship before an eval passes.

A small tracing example

Below is a pseudo-code sketch of tracing a RAG chain. The goal is to record each step as a separate span so that when an answer turns out poor, you can see where it went wrong:

# Trace a user request end to end
with trace("question-answer", user_id=uid, prompt_version="v3") as t:

    with t.span("retrieval") as s:
        docs = retrieve(query)              # fetch documents
        s.log(found=len(docs), query=query)

    with t.span("model_call") as s:
        prompt = build_prompt(query, docs)  # build the prompt
        answer = model.generate(prompt)
        s.log(tokens_in=count(prompt),
              tokens_out=count(answer),
              model="model-x")

    with t.span("eval") as s:
        score = judge(query, answer)        # LLM-as-judge score
        s.log(score=score)
        if score < 0.6:                     # below threshold
            alert("Low quality answer", trace_id=t.id)

The key detail is the threshold check in the final span: when the score falls below a certain level, an alert fires automatically and is stored with the relevant trace ID. That way you can later open the bad answer and inspect its whole journey step by step; it becomes clear whether the fault was in retrieval, the prompt, or the model.

Key takeaways

  • LLM observability has three layers: tracing (what happened), logging (how it happened), and evaluation (was it good).
  • Tracing splits a request into spans so you can tell whether a fault is in retrieval or in the model.
  • When logging, protect privacy and avoid drowning in noise; the goal is to reproduce the event three months later.
  • Evaluation is a loop: start with a small golden set, re-run it on every change, and grow it with production failures.
  • Regressions arrive silently; put offline eval at the gate before release, and keep listening to live traffic with online monitoring.
What's the difference between tracing and logging?

Logging is free-form, scattered records that say "right now I did this." Tracing structures all the steps of one request as a connected tree. Tracing gives you the whole picture and the relationships between steps, while logs give you the raw detail of individual events; the two are strongest together.

Is LLM-as-judge reliable?

It's very valuable because it's fast and scalable, but it isn't flawless; the judge can have biases too. Best practice is to calibrate the LLM-as-judge with a small human evaluation and treat it as an early-warning system rather than the sole decision-maker.

How do I catch regressions early in production?

Run a fixed golden test set automatically at regular intervals and set an alert when the score drops below a threshold. Also watch your model provider's version changes and any drift in user questions; regressions often come not from you, but from the environment changing.


In short, LLM observability isn't a luxury; it's the foundational infrastructure of every AI product taken into production. When you set up tracing, logging, and evaluation loops together, you turn your model from a black box into a transparent system you can manage with confidence. If you're planning to build observable, measurable, and trustworthy AI solutions, the EcoFluxion team would be glad to design this structure with you.

İsmail Tarık Şenkal

EcoFluxion Teknoloji A.Ş. · Co-Founder

A developer and entrepreneur working on Turkish-focused AI products — the name behind EcoFluxion and İçtiHub.

← Previous
RLHF and Model Alignment: Helpful and Safe Models, the Reward Model, and DPO