Skip to Content
Use CasesContinuous improvement

Continuous improvement from production traces

Use this page when you have an agent in production emitting traces and you want your test set — and your agent — to keep up with how people actually use it. If you haven’t pushed an agent to Veris yet, start at Quickstart; to iterate by hand on a fixed set, see the development loop.

A continuous-improvement loop composes three things Veris already gives you:

  • Scenarios from traces — turn a window of real production traffic into a fresh scenario set.
  • Simulation and grading — run your agent against that set and score it.
  • Reports — get ranked failures plus concrete, machine-readable fix suggestions for your agent’s own files.

Run them on a schedule and you have a loop that keeps discovering new failure modes and proposing fixes. Everything around those features — when it runs, and what you do with the fixes — is yours to decide.

The loop

  1. Rebuild the environment from your agent source.
  2. Generate a fresh scenario set from the last N days of production traces.
  3. Simulate and grade the rebuilt agent against the set.
  4. Report the failures into fix suggestions.
  5. Act on the fixes — however fits your team (see below).

Each cycle’s rebuild (step 1) picks up whatever you applied last cycle, so the loop re-measures its own changes against fresh, trace-grounded scenarios.

rebuild env → scenarios from traces → simulate + grade → report → apply fixes ▲ │ └────────────────── next run's rebuild re-measures ◄───────────────┘

The Veris commands

Each step is a single veris command — the same regardless of how you schedule them or what you do with the output:

# 1. rebuild the env from your agent source, so a fix you applied takes effect veris env push --no-snapshot --env-id <ENV_ID> # 2. generate a set from a trace query — paste the curl for your tracing provider # (keys are read from it, never stored); compute a rolling time window in the query printf '%s' "$TRACE_CURL" | veris scenarios create \ --from-langfuse - --env-id <ENV_ID> --num 30 --prompt "what to focus on" # 3. wait for generation, then simulate + grade # (auto-selects the set's grader; exits non-zero if the run fails) veris scenarios status <SET_ID> --watch veris run --scenario-set-id <SET_ID> --env-id <ENV_ID> # 4. report the run and export its fix suggestions as JSON veris reports create <RUN_ID> veris reports get <REPORT_ID> --format json -o fixes.json

scenarios create prints the scenset_… id, veris run the run_… id, and reports create the rpt_… id — capture each to feed the next command. For building and filtering the trace query, see Generating scenarios from traces.

Acting on the report

reports get --format json gives you the report’s fix suggestions. Each is routed to one of your agent’s own files:

  • routesystem_prompt, skill, or tool_schema
  • confidence and target_path
  • diff — a unified diff you can apply to your agent source

How much of this you automate is your call — the same suggestions drive anything from a fully hands-off loop to manual curation:

  • Fully automated — apply the diffs and rebuild every cycle with no human in the loop; the next run is the regression check.
  • Reviewed change — open the diffs as a pull or merge request and have someone approve before they land.
  • Manual — read the report and hand-apply only the fixes you agree with.

More automation closes the loop faster; more review catches a fix that helps one case but regresses another. Either way the next run, on a fresh trace-grounded set, re-grades the change — so the loop self-corrects even when it runs fully unattended. Track the headline pass rate across runs: a healthy loop trends up as substantive fixes land.

For one example of consuming the report, the cookbook’s crm-analyst-agent/improve/ingest_report.py reads the exported JSON, filters to the agent-fixable routes, git applys each diff to the agent source, and writes a PR body — copy it or adapt it to your repo.

Running it on a schedule

Everything above is platform-agnostic. To run the loop on a timer you add a scheduler (cron, GitHub Actions, GitLab CI, a cloud scheduler), a secret store for VERIS_API_KEY and your tracing keys (veris login "$VERIS_API_KEY" is the only auth step), and — if you apply fixes through review — your version control. Veris isn’t opinionated about any of them.

For a complete, runnable version, see the crm-analyst agent in the Veris cookbook . It wires the loop into one concrete stack — GitHub Actions, git, and a draft-PR review step — as an illustration of the pattern, not a requirement; swap in your own scheduler, version control, and automation level.

What to watch

Scope the query deliberately. Filter to your agent’s gradable turns and a window with real traffic — a quiet window generates few or no scenarios. The filter is where the signal is; see the Langfuse query guide.

This is deliberate regeneration — the thing CI/CD gating and the development loop tell you not to do between iterations. Track the score over time, but keep a separate pinned set for hard regression gating; comparing across freshly-regenerated sets is apples-to-oranges. Use this loop to discover what the pinned set is missing.

See also