Evaluating LLM Workflows: A Field Guide

June 2026

I spent the last several months building an eval system for an LLM workflow. Examples below use a fictional product I'll call Beacon, a workflow that researches competitors and produces a structured weekly brief. The real lessons are the same regardless of what the workflow does.

Most of what I learned is at odds with how people usually talk about LLM evals.

Here are the pieces I think transfer.

The thing nobody tells you about LLM judges

LLM judges are unreliable for anything that isn't genuinely semantic.

They can't count. They're bad at format rules and regex. On simple binary questions they contradict themselves. Sometimes the reasoning walks through why an output should pass, and then the score field comes back passed=false. Run the same prompt twice and get different answers. This is well-documented at this point, but the practical implication isn't always obvious: stop asking judges questions that aren't actually judgment questions.

The cheapest, most reliable check is the one that doesn't involve a model at all.

Three paths, in order of trust

Every criterion I want to check on workflow output goes through one of three paths. The order matters. Each path is cheaper and more honest than the next.

Programmatic verifiers. Deterministic Python over the workflow's output object. "Exactly 3 takeaways?" Count them. "Every takeaway starts with an emoji followed by a bolded phrase?" Regex. "The summary section has between 80 and 200 words?" Count. "Competitor list is non-empty?" len(competitors) > 0. These never flake, never disagree with themselves, and cost nothing.

Numeric LLM evals. A 0 to 10 score from Haiku 4.5, with a passing threshold around 7. I use this for genuinely fuzzy dimensions like "completeness of the brief" or "analytical depth." The number is noisy run-to-run, but the trend across changes is informative.

Binary LLM evals. Pass/fail per judge for criteria that are actually semantic but still have a clear yes-or-no answer. "The brief names at least three competitors with specific revenue or headcount figures." "The conclusion synthesizes across the prior sections instead of copy-pasting from them." A person would answer yes or no here, but the answer requires reading and judging.

The rule of thumb when adding a new criterion: if you can phrase it as "count X," "match regex Z," "field W non-empty," or "header in word range [a, b]," it's programmatic. It is not a judge question. Don't put it through a model.

Cross-model judges, and why disagreement is the point

For binary criteria I run two judges from different model families. Haiku and GPT-4o-mini. Same criterion, two independent verdicts.

The point isn't to average them. It's to surface disagreement as signal. Every criterion lands in one of three buckets:

Both pass. Almost certainly fine.
Both fail. Strong signal of a real defect. The two reasons should describe the same problem. If they do, you're looking at something to fix in the prompt.
Disagreement. Usually means the criterion text is ambiguous, or one judge is simply wrong. Either way, it's a flag to read the actual output and decide.

Disagreement detection is a first-class output of my eval runner. Disagreements get written into the per-run summary alongside everything else, so I can sort by them when reviewing.

The asymmetric blocking rule

This one feels weird the first time you see it. Only Haiku's binary results gate CI. GPT-4o-mini's results are recorded as signal and never block a build.

The reasoning: the second judge is there to expose ambiguity, not to vote. If both judges had to pass for the build to go green, every disagreement would block the pipeline. And most disagreements are ambiguity in the criterion text rather than a real defect. You'd spend every day rewriting prompts to please two different models with two different reading styles.

So one judge is the strict bar. The other is the second opinion you read when you're trying to figure out whether the bar itself is calibrated. In production runs I disable cross-model entirely. It's a local iteration tool, not a CI gate.

Errors are not failures

This sounds obvious. It wasn't.

When the OpenAI or Anthropic API hiccups and a judge returns no result, that is not the workflow failing. It's the eval failing. Treating the two the same way makes your dashboards lie to you.

So every outcome carries an error field. Outages are tracked separately from real fails. The summary JSON for each run also gets stamped with the git SHA, branch, judge model IDs, and environment config. A result is always traceable to the exact code and judge versions that produced it. When something looks weird, you can go back and ask: did the workflow regress, or did I change my judge model?

What to do when an eval fails

This is the part most people get wrong, and the part that took me longest to internalize.

When automated evals flag a failure, do not assume the workflow is broken.

The loop I run is roughly:

Read the latest summary JSON for the workflow.
Trust programmatic results as ground truth for what they cover. If a programmatic check passes, ignore any LLM complaint in that same territory.
Bucket the binary criteria: both-pass, both-fail, disagreement.
For both-failed and disagreements, dump the actual workflow output the judges saw and read it yourself.
Compare your read against the judge's claim. Is the workflow wrong, or is the judge wrong?
Decide an action, in priority order:
- Judge error or ambiguous criterion: fix the eval. Reword the criterion, or move it to a programmatic verifier.
- Real workflow defect: fix the prompt in the workflow's instructions.
- Subjective dimension: keep it as baseline signal. Don't gate on it.

The principle I keep coming back to: fix the measurement before you fix the thing being measured. Verifier additions and criterion rewrites improve the signal you'd use to judge any future prompt change, so they come first. If you start tuning your prompt against a flaky eval, you're chasing the wrong gradient.

There's a corollary I had to learn the hard way: don't touch the prompt unless someone asks. Most people building evals want to see an honest baseline before they decide whether the eval or the workflow is at fault. If you've quietly fixed three prompts on the way to "reviewing the evals," you've poisoned the baseline.

How it actually evolved

The system didn't arrive fully formed. It grew through a series of small refinements, and each one solved a problem the previous version had created.

The first version was just numeric scores. That caught regressions in fuzzy dimensions but missed clear yes/no defects.

Adding binary criteria caught the clear defects, but a single judge's pass/fail flipped between runs and CI started flaking.

Adding a second judge from a different family fixed the flake by making disagreement a separate signal. But only after I stopped treating both judges as voters and let just one of them gate the build.

Then programmatic verifiers got carved out from the LLM layer entirely, which collapsed about a third of the criteria into checks that never lie.

Then a skill got written down so the methodology was repeatable instead of living in my head.

Then judge outages started getting tracked separately from real failures, because too many "failures" turned out to be the API having a bad afternoon.

The most interesting refinement was the last one. I had a roughly 540-line Python script that generated an HTML report from each run. It worked. I deleted it. Letting the agent read the summary JSON and write a report tailored to whatever that specific run surfaced was more useful than any fixed template. The codified part is what colors and sections to use. The writing is per-run.

What I'd tell someone starting today

If you're building evals for an LLM workflow and you want to skip the longest detours:

Start with programmatic verifiers. Every criterion you can express as code is one criterion that will never lie to you.
Don't ask judges to count or match formats. They're bad at it and they'll mask real problems.
If you use LLM judges, fan out across two families and treat disagreement as a separate category from "fail." Then pick one to gate on.
Stamp every run with the code version and judge versions. Future-you will thank present-you.
Treat outages as outages. They are not failures.
When an eval fails, read the actual output before you touch the prompt. The eval is wrong more often than the workflow is.
Write down the methodology somewhere your team (and your agents) can find it. The system is only as good as the discipline of using it.

The one-line version: route every check to the cheapest mechanism that can answer it honestly, use cross-model disagreement to find ambiguity rather than to vote, treat judge outages as outages, and always read the real output before trusting a judge that says something is broken.