← home

The data flywheel

Why the environment library compounds into a defensible dataset.

11
environments
29
trajectories generated
20
SFT examples
26
preference pairs (RLHF)

Training-data products built from graded trajectories

Three artifacts a lab buys, all generated by training/ from the judged/graded trajectories — see the benchmark for the eval, and Compare for the matched panels these are derived from:

ArtifactWhat it isCountFile
Benchmark / evalModel × workflow competence, LLM-judge scored on the expert rubric 11 envstraining/runs/leaderboard.json
SFT datasetMessages-format demonstrations from trajectories at the expert bar (≥4/5) 20training/datasets/sft.jsonl
Preference dataset (RLHF/DPO)chosen vs rejected pairs on matched queries 26training/datasets/preferences.jsonl

One labeled unit = one graded trajectory

Each environment emits a structured record per graded trajectory. That record is the unit a lab buys: a real-workflow problem, an agent's full reasoning, and an expert's dimension-level scores + verdict. Aggregated, these become (a) evals — does the agent meet the expert bar on this workflow — and (b) reward signal — preference/quality data to RLHF the agent.

{
  "run_id": "competitive_landscape__",
  "grader": "expert id",
  "scores": {
    "scope": 4,
    "design_deltas": 5,
    "competitive_read": 4,
    "whitespace": 3,
    "faithfulness": 5
  },
  "verdict": "acceptable",
  "comments": {
    "whitespace": "missed the 1L checkpoint-combo gap"
  }
}

Collected grades

Verdict mix: strong: 1 acceptable: 1 flawed: 1

RunGraderAvg score (1–5)Verdict
ae_causality_20260606_SAMPLE_01 sampleSAMPLE-expert-PV-MD4.6strong
competitive_landscape_20260606_SAMPLE_01 sampleSAMPLE-expert-onc-CI4.4acceptable
fto_patent_20260606_SAMPLE_01 sampleSAMPLE-expert-IP-atty4.6flawed
Samples are illustrative of the export schema. Real grades from vetted experts replace them and seed the proprietary corpus — the moat a pure labor marketplace lacks.