Evaluations
What is an Evaluation?
An evaluation is the process of running graders against completed simulation transcripts. Each grader analyzes the transcript and produces scores and findings for specific categories.
Evaluation Workflow
- Complete a simulation run
- Select a grader to evaluate with
- The grader processes each simulation transcript
- Per-simulation scores and findings are produced
- Results are available in the CLI and Console
Running an Evaluation
Interactive mode
veris evaluation-runs createThe CLI prompts you to select a completed run and a grader.
With flags
veris evaluation-runs create \
--run-id run_abc123 \
--grader-id grd_xyz789What Graders Check
Graders analyze transcripts for specific failure modes:
| Category | What It Checks |
|---|---|
| Hallucination | Did the agent fabricate information not present in the data? Did it overpromise or invent details? |
| Tool execution | Did the agent call the right APIs with correct parameters? Did it verify the results? |
| Communication | Was the agent clear, polite, and helpful? Did it ask for clarification when needed? |
| Procedural correctness | Did the agent follow the correct sequence of operations? Did it skip steps? |
| Objective completion | Were the user’s stated goals achieved? Did the agent address all requests? |
Viewing Results
From the CLI
# List evaluation runs for a completed run
veris evaluation-runs list --run-id RUN_ID
# Check evaluation status (polls every 5 seconds)
veris evaluation-runs status EVAL_RUN_ID --run-id RUN_ID --watchFrom the Console
Navigate to Evaluations to see evaluation runs grouped by their parent simulation run. Click an evaluation run to view per-simulation scores and detailed findings.
You can run multiple graders against the same simulation run. This is useful for evaluating different aspects of agent behavior — e.g., one grader for hallucination detection and another for communication quality.
From Evaluation to Report
Evaluation results give you per-simulation scores. To identify patterns across all simulations and get actionable recommendations, generate a report:
veris reports create --eval-run-id EVAL_RUN_IDUsing the Console
Navigate to Evaluations to see evaluation runs grouped by their parent simulation run. Each evaluation shows the grader used, status, progress bar, and creation date.
- Click Evaluate on a completed run to trigger grading
- Select the run — the grader is auto-resolved from the run’s scenario set
- Click an evaluation run to see per-simulation scores and detailed findings
CLI Commands
# Run graders against simulations
veris evaluation-runs create [--run-id ID] [--grader-id ID]
# List evaluation runs
veris evaluation-runs list --run-id RUN_ID
# Check evaluation status
veris evaluation-runs status EVAL_RUN_ID --run-id RUN_ID [--watch]
# List available graders
veris eval list [--env-id ID]