Skip to Content
Core ConceptsEvaluations

Evaluations

What is an Evaluation?

An evaluation is the process of running graders against completed simulation transcripts. Each grader analyzes the transcript and produces scores and findings for specific categories.

Evaluation Workflow

  1. Complete a simulation run
  2. Select a grader to evaluate with
  3. The grader processes each simulation transcript
  4. Per-simulation scores and findings are produced
  5. Results are available in the CLI and Console

Running an Evaluation

Interactive mode

veris evaluation-runs create

The CLI prompts you to select a completed run and a grader.

With flags

veris evaluation-runs create \ --run-id run_abc123 \ --grader-id grd_xyz789

What Graders Check

Graders analyze transcripts for specific failure modes:

CategoryWhat It Checks
HallucinationDid the agent fabricate information not present in the data? Did it overpromise or invent details?
Tool executionDid the agent call the right APIs with correct parameters? Did it verify the results?
CommunicationWas the agent clear, polite, and helpful? Did it ask for clarification when needed?
Procedural correctnessDid the agent follow the correct sequence of operations? Did it skip steps?
Objective completionWere the user’s stated goals achieved? Did the agent address all requests?

Viewing Results

From the CLI

# List evaluation runs for a completed run veris evaluation-runs list --run-id RUN_ID # Check evaluation status (polls every 5 seconds) veris evaluation-runs status EVAL_RUN_ID --run-id RUN_ID --watch

From the Console

Navigate to Evaluations to see evaluation runs grouped by their parent simulation run. Click an evaluation run to view per-simulation scores and detailed findings.

You can run multiple graders against the same simulation run. This is useful for evaluating different aspects of agent behavior — e.g., one grader for hallucination detection and another for communication quality.

From Evaluation to Report

Evaluation results give you per-simulation scores. To identify patterns across all simulations and get actionable recommendations, generate a report:

veris reports create --eval-run-id EVAL_RUN_ID

Using the Console

Navigate to Evaluations to see evaluation runs grouped by their parent simulation run. Each evaluation shows the grader used, status, progress bar, and creation date.

  • Click Evaluate on a completed run to trigger grading
  • Select the run — the grader is auto-resolved from the run’s scenario set
  • Click an evaluation run to see per-simulation scores and detailed findings

CLI Commands

# Run graders against simulations veris evaluation-runs create [--run-id ID] [--grader-id ID] # List evaluation runs veris evaluation-runs list --run-id RUN_ID # Check evaluation status veris evaluation-runs status EVAL_RUN_ID --run-id RUN_ID [--watch] # List available graders veris eval list [--env-id ID]