The data flywheel

Why the environment library compounds into a defensible dataset.

environments

trajectories generated

SFT examples

preference pairs (RLHF)

Training-data products built from graded trajectories

Three artifacts a lab buys, all generated by training/ from the judged/graded trajectories — see the benchmark for the eval, and Compare for the matched panels these are derived from:

Artifact	What it is	Count	File
Benchmark / eval	Model × workflow competence, LLM-judge scored on the expert rubric	11 envs	`training/runs/leaderboard.json`
SFT dataset	Messages-format demonstrations from trajectories at the expert bar (≥4/5)	20	`training/datasets/sft.jsonl`
Preference dataset (RLHF/DPO)	chosen vs rejected pairs on matched queries	26	`training/datasets/preferences.jsonl`

One labeled unit = one graded trajectory

Each environment emits a structured record per graded trajectory. That record is the unit a lab buys: a real-workflow problem, an agent's full reasoning, and an expert's dimension-level scores + verdict. Aggregated, these become (a) evals — does the agent meet the expert bar on this workflow — and (b) reward signal — preference/quality data to RLHF the agent.

{
  "run_id": "competitive_landscape__",
  "grader": "expert id",
  "scores": {
    "scope": 4,
    "design_deltas": 5,
    "competitive_read": 4,
    "whitespace": 3,
    "faithfulness": 5
  },
  "verdict": "acceptable",
  "comments": {
    "whitespace": "missed the 1L checkpoint-combo gap"
  }
}

Collected grades

Verdict mix: strong: 1 acceptable: 1 flawed: 1

Run	Grader	Avg score (1–5)	Verdict
`ae_causality_20260606_SAMPLE_01` sample	SAMPLE-expert-PV-MD	4.6	strong
`competitive_landscape_20260606_SAMPLE_01` sample	SAMPLE-expert-onc-CI	4.4	acceptable
`fto_patent_20260606_SAMPLE_01` sample	SAMPLE-expert-IP-atty	4.6	flawed

Samples are illustrative of the export schema. Real grades from vetted experts replace them and seed the proprietary corpus — the moat a pure labor marketplace lacks.