Why the environment library compounds into a defensible dataset.
Three artifacts a lab buys, all generated by training/ from the judged/graded
trajectories — see the benchmark for the eval, and Compare
for the matched panels these are derived from:
| Artifact | What it is | Count | File |
|---|---|---|---|
| Benchmark / eval | Model × workflow competence, LLM-judge scored on the expert rubric | 11 envs | training/runs/leaderboard.json |
| SFT dataset | Messages-format demonstrations from trajectories at the expert bar (≥4/5) | 20 | training/datasets/sft.jsonl |
| Preference dataset (RLHF/DPO) | chosen vs rejected pairs on matched queries | 26 | training/datasets/preferences.jsonl |
Each environment emits a structured record per graded trajectory. That record is the unit a lab buys: a real-workflow problem, an agent's full reasoning, and an expert's dimension-level scores + verdict. Aggregated, these become (a) evals — does the agent meet the expert bar on this workflow — and (b) reward signal — preference/quality data to RLHF the agent.
{
"run_id": "competitive_landscape__",
"grader": "expert id",
"scores": {
"scope": 4,
"design_deltas": 5,
"competitive_read": 4,
"whitespace": 3,
"faithfulness": 5
},
"verdict": "acceptable",
"comments": {
"whitespace": "missed the 1L checkpoint-combo gap"
}
}
Verdict mix: strong: 1 acceptable: 1 flawed: 1
| Run | Grader | Avg score (1–5) | Verdict |
|---|---|---|---|
ae_causality_20260606_SAMPLE_01 sample | SAMPLE-expert-PV-MD | 4.6 | strong |
competitive_landscape_20260606_SAMPLE_01 sample | SAMPLE-expert-onc-CI | 4.4 | acceptable |
fto_patent_20260606_SAMPLE_01 sample | SAMPLE-expert-IP-atty | 4.6 | flawed |