Continuous improvement from production traces
Use this page when you have an agent in production emitting traces and you want your test set — and your agent — to keep up with how people actually use it. If you haven’t pushed an agent to Veris yet, start at Quickstart; to iterate by hand on a fixed set, see the development loop.
A continuous-improvement loop composes three things Veris already gives you:
- Scenarios from traces — turn a window of real production traffic into a fresh scenario set.
- Simulation and grading — run your agent against that set and score it.
- Reports — get ranked failures plus concrete, machine-readable fix suggestions for your agent’s own files.
Run them on a schedule and you have a loop that keeps discovering new failure modes and proposing fixes. Everything around those features — when it runs, and what you do with the fixes — is yours to decide.
The loop
- Rebuild the environment from your agent source.
- Generate a fresh scenario set from the last N days of production traces.
- Simulate and grade the rebuilt agent against the set.
- Report the failures into fix suggestions.
- Act on the fixes — however fits your team (see below).
Each cycle’s rebuild (step 1) picks up whatever you applied last cycle, so the loop re-measures its own changes against fresh, trace-grounded scenarios.
rebuild env → scenarios from traces → simulate + grade → report → apply fixes
▲ │
└────────────────── next run's rebuild re-measures ◄───────────────┘The Veris commands
Each step is a single veris command — the same regardless of how you schedule them or what you do with the output:
# 1. rebuild the env from your agent source, so a fix you applied takes effect
veris env push --no-snapshot --env-id <ENV_ID>
# 2. generate a set from a trace query — paste the curl for your tracing provider
# (keys are read from it, never stored); compute a rolling time window in the query
printf '%s' "$TRACE_CURL" | veris scenarios create \
--from-langfuse - --env-id <ENV_ID> --num 30 --prompt "what to focus on"
# 3. wait for generation, then simulate + grade
# (auto-selects the set's grader; exits non-zero if the run fails)
veris scenarios status <SET_ID> --watch
veris run --scenario-set-id <SET_ID> --env-id <ENV_ID>
# 4. report the run and export its fix suggestions as JSON
veris reports create <RUN_ID>
veris reports get <REPORT_ID> --format json -o fixes.jsonscenarios create prints the scenset_… id, veris run the run_… id, and reports create the rpt_… id — capture each to feed the next command. For building and filtering the trace query, see Generating scenarios from traces.
Acting on the report
reports get --format json gives you the report’s fix suggestions. Each is routed to one of your agent’s own files:
route—system_prompt,skill, ortool_schemaconfidenceandtarget_pathdiff— a unified diff you can apply to your agent source
How much of this you automate is your call — the same suggestions drive anything from a fully hands-off loop to manual curation:
- Fully automated — apply the diffs and rebuild every cycle with no human in the loop; the next run is the regression check.
- Reviewed change — open the diffs as a pull or merge request and have someone approve before they land.
- Manual — read the report and hand-apply only the fixes you agree with.
More automation closes the loop faster; more review catches a fix that helps one case but regresses another. Either way the next run, on a fresh trace-grounded set, re-grades the change — so the loop self-corrects even when it runs fully unattended. Track the headline pass rate across runs: a healthy loop trends up as substantive fixes land.
For one example of consuming the report, the cookbook’s crm-analyst-agent/improve/ingest_report.py reads the exported JSON, filters to the agent-fixable routes, git applys each diff to the agent source, and writes a PR body — copy it or adapt it to your repo.
Running it on a schedule
Everything above is platform-agnostic. To run the loop on a timer you add a scheduler (cron, GitHub Actions, GitLab CI, a cloud scheduler), a secret store for VERIS_API_KEY and your tracing keys (veris login "$VERIS_API_KEY" is the only auth step), and — if you apply fixes through review — your version control. Veris isn’t opinionated about any of them.
For a complete, runnable version, see the crm-analyst agent in the Veris cookbook . It wires the loop into one concrete stack — GitHub Actions, git, and a draft-PR review step — as an illustration of the pattern, not a requirement; swap in your own scheduler, version control, and automation level.
What to watch
Scope the query deliberately. Filter to your agent’s gradable turns and a window with real traffic — a quiet window generates few or no scenarios. The filter is where the signal is; see the Langfuse query guide.
This is deliberate regeneration — the thing CI/CD gating and the development loop tell you not to do between iterations. Track the score over time, but keep a separate pinned set for hard regression gating; comparing across freshly-regenerated sets is apples-to-oranges. Use this loop to discover what the pinned set is missing.
See also
- Generating scenarios from traces — build and filter the trace query
- Development loop — iterate by hand on a fixed set
- CI/CD regression gating — gate PRs on a pinned set
- CI/CD configuration — auth and example configs for GitHub Actions, GitLab CI, and others
- CLI commands
- crm-analyst cookbook example — the complete runnable loop