Training
Veris simulation data and environments can be used directly for model training. The same scenarios, transcripts, and grading results that you use for evaluation become training data — and the simulation environment itself becomes the RL training ground.
Training is currently in beta. Available in the Console under Training.
Why Train with Veris Data?
Agent behavior is shaped by the model behind it. While prompt engineering and tool configuration go a long way, fine-tuning the model on your specific task domain produces agents that are more reliable, more efficient, and better at following your workflows.
Veris provides both ingredients for training:
- Data — Simulation transcripts with grading labels for supervised fine-tuning
- Environment — The simulation sandbox as a live reward environment for reinforcement learning
Supervised Fine-Tuning (SFT)
SFT trains the model on high-quality examples of correct agent behavior.
How It Works
- Run simulations and evaluations across your scenario sets
- Filter for high-scoring transcripts (the agent behaved correctly)
- Convert transcripts into training examples (input/output pairs)
- Fine-tune a base model on these examples
Supported Base Models
| Provider | Models |
|---|---|
| DeepSeek | DeepSeek-V3, DeepSeek-R1 |
| Qwen | Qwen 2.5 (7B, 14B, 32B, 72B) |
| Llama | Llama 3.1 (8B, 70B), Llama 3.3 70B |
| Mistral | Mistral Large, Mistral Small |
SFT Parameters
| Parameter | Default | Description |
|---|---|---|
| Epochs | 3 | Number of training passes |
| Learning rate | 2e-5 | Step size for optimization |
| Batch size | 4 | Samples per training step |
| Max sequence length | 4096 | Maximum token length |
| Warmup ratio | 0.1 | Fraction of steps for LR warmup |
| LoRA rank | 16 | Rank of LoRA adapters |
| LoRA alpha | 32 | Scaling factor for LoRA |
Reinforcement Learning (GRPO)
GRPO (Group Relative Policy Optimization) uses the Veris simulation environment as a live training ground. The agent interacts with mock services and simulated users, and graders provide reward signals.
How It Works
- The model generates multiple completions for each scenario
- Each completion is executed in the simulation environment
- Graders and assertions score the outcomes
- The model is updated to favor higher-scoring behaviors
This is the same loop as evaluation, but instead of just measuring performance, the results actively improve the model.
Reward Models
Reward signals come from:
- Grader scores — hallucination, tool execution, communication, procedural correctness
- Assertion pass rates — did the agent achieve the defined success criteria?
- Custom reward models — optional LLM-based reward models for domain-specific scoring
GRPO Parameters
| Parameter | Default | Description |
|---|---|---|
| Epochs | 1 | Number of training passes |
| Learning rate | 5e-7 | Step size (lower than SFT) |
| Batch size | 4 | Samples per training step |
| Num generations | 4 | Completions per prompt |
| Max prompt length | 2048 | Maximum prompt tokens |
| Max completion length | 2048 | Maximum completion tokens |
| Temperature | 0.7 | Sampling temperature |
| Beta | 0.04 | KL penalty coefficient |
| LoRA rank | 16 | Rank of LoRA adapters |
Using the Console
Navigate to Training in the sidebar.
Creating a Training Run
- Click New Training Run
- Select a base model from the supported list
- Choose the training method — SFT or GRPO
- For GRPO, optionally select a reward model
- Configure parameters (or use defaults with Advanced toggle)
- Select the simulation data to train on
- Click Start Training
Monitoring Progress
Active training runs show a progress bar, current epoch, loss metrics, and estimated time remaining. Completed runs show final metrics and a download link for the trained model weights.
From Evaluation to Training
RL path
The simulation environment is the live training ground. Graders and assertions provide reward signals after each completion, and the model is updated via GRPO to favor higher-scoring behaviors. Repeat until convergence.
SFT path
- Run simulations and evaluations across your scenario sets
- Filter for high-scoring transcripts where the agent behaved correctly
- Fine-tune a base model on these examples
- Deploy the improved model and re-evaluate to measure progress