Training

Veris simulation data and environments can be used directly for model training. The same scenarios, transcripts, and grading results that you use for evaluation become training data — and the simulation environment itself becomes the RL training ground.

Training is currently in beta. Available in the Console under Training.

Why Train with Veris Data?

Agent behavior is shaped by the model behind it. While prompt engineering and tool configuration go a long way, fine-tuning the model on your specific task domain produces agents that are more reliable, more efficient, and better at following your workflows.

Veris provides both ingredients for training:

Data — Simulation transcripts with grading labels for supervised fine-tuning
Environment — The simulation sandbox as a live reward environment for reinforcement learning

Supervised Fine-Tuning (SFT)

SFT trains the model on high-quality examples of correct agent behavior.

How It Works

Run simulations and evaluations across your scenario sets
Filter for high-scoring transcripts (the agent behaved correctly)
Convert transcripts into training examples (input/output pairs)
Fine-tune a base model on these examples

Supported Base Models

Provider	Models
DeepSeek	DeepSeek-V3, DeepSeek-R1
Qwen	Qwen 2.5 (7B, 14B, 32B, 72B)
Llama	Llama 3.1 (8B, 70B), Llama 3.3 70B
Mistral	Mistral Large, Mistral Small

SFT Parameters

Parameter	Default	Description
Epochs	3	Number of training passes
Learning rate	2e-5	Step size for optimization
Batch size	4	Samples per training step
Max sequence length	4096	Maximum token length
Warmup ratio	0.1	Fraction of steps for LR warmup
LoRA rank	16	Rank of LoRA adapters
LoRA alpha	32	Scaling factor for LoRA

Reinforcement Learning (GRPO)

GRPO (Group Relative Policy Optimization) uses the Veris simulation environment as a live training ground. The agent interacts with mock services and simulated users, and graders provide reward signals.

How It Works

The model generates multiple completions for each scenario
Each completion is executed in the simulation environment
Graders and assertions score the outcomes
The model is updated to favor higher-scoring behaviors

This is the same loop as evaluation, but instead of just measuring performance, the results actively improve the model.

Reward Models

Reward signals come from:

Grader scores — hallucination, tool execution, communication, procedural correctness
Assertion pass rates — did the agent achieve the defined success criteria?
Custom reward models — optional LLM-based reward models for domain-specific scoring

GRPO Parameters

Parameter	Default	Description
Epochs	1	Number of training passes
Learning rate	5e-7	Step size (lower than SFT)
Batch size	4	Samples per training step
Num generations	4	Completions per prompt
Max prompt length	2048	Maximum prompt tokens
Max completion length	2048	Maximum completion tokens
Temperature	0.7	Sampling temperature
Beta	0.04	KL penalty coefficient
LoRA rank	16	Rank of LoRA adapters

Using the Console

Navigate to Training in the sidebar.

Creating a Training Run

Click New Training Run
Select a base model from the supported list
Choose the training method — SFT or GRPO
For GRPO, optionally select a reward model
Configure parameters (or use defaults with Advanced toggle)
Select the simulation data to train on
Click Start Training

Monitoring Progress

Active training runs show a progress bar, current epoch, loss metrics, and estimated time remaining. Completed runs show final metrics and a download link for the trained model weights.

From Evaluation to Training

RL path

RL Training

The simulation environment is the live training ground. Graders and assertions provide reward signals after each completion, and the model is updated via GRPO to favor higher-scoring behaviors. Repeat until convergence.

SFT path

SFT Training

Run simulations and evaluations across your scenario sets
Filter for high-scoring transcripts where the agent behaved correctly
Fine-tune a base model on these examples
Deploy the improved model and re-evaluate to measure progress