Use Case
Auto Research
Can models participate in the loops that drive scientific and engineering progress?
The Problem
Where the research pipeline breaks down
Generating plausible code is easy. Beating a real baseline under a fixed budget — on a real dataset, with a real grader — is where agents fall apart.
Code that runs cleanly but fails to beat the random baseline on the actual metric
Solutions that overfit one data pool or model and collapse when the seed, source mix, or checkpoint changes
Training runs that silently blow the GPU, memory, or wall-clock budget and return no usable result
How It Works
From plausible code to a metric that actually moves
BakeLens audits each attempt against baseline and reference scores. Proof delivers verified task environments and expert solution traces.
BakeLens audits the auto-research pipeline
Trace every stage: task understanding, plan, code, training run, evaluation, retry
Score each attempt against the task's baseline and reference solution, not just whether the code ran
Surface where agents waste compute, pick the wrong heuristic, or hit silent budget limits
Proof delivers verified research environments
Reproducible Docker task environments with fixed training pipelines, datasets, and graders
Baseline and reference solutions written by working ML researchers, with metric scores attached
Step-by-step expert traces showing how a researcher reasons from task spec to a metric-moving solution
What You Get
Deliverables
Verified Task Environments
Sandboxed Docker tasks with fixed pipelines, fixed seeds, hardware budgets, and a graded metric — built like AutoLab's data_select_ifeval
Expert Solution Traces
Reference solutions from ML researchers who beat the baseline, with the reasoning, code diffs, and ablations behind each decision
Auto Research Eval Suite
Held-out tasks that measure agents on real metric improvements — not whether the code compiles, but whether it actually wins
Explore More
Agent Reliability
Agents fail where it matters: planning, tools, ambiguity. Diagnose and fix long-horizon failures before production.
Read moreCoding Models
Repo-level coding ≠ solving LeetCode. Expert data for real-world debugging, testing, and integration.
Read moreSTEM Reasoning
PhD-level reasoning requires proof, not patterns. Verified expert annotations across bio, chem, math, med, physics.
Read moreBuilt for AI Operating Beyond Benchmarks
Diagnosis, evaluation, expert data, and environments for production deployment.