Research
Benchmarks & Evaluation Hub
Open evaluation datasets and benchmarks from Bake AI research. Used by frontier labs to evaluate coding agents, tool-use agents, STEM reasoning, and more.
KodCode
CodingA diverse, challenging, and verifiable synthetic dataset for training and evaluating coding agents. Covers repo-level tasks beyond single-function benchmarks.
Key Metrics
TOUCAN
Tool-Use Agents1.5M tool-agentic data points synthesized from real-world MCP environments. Tests multi-turn tool interactions in realistic settings.
Key Metrics
ChemOrch
STEM / ChemistryChemical intelligence benchmark empowering LLMs with synthetic instructions for chemistry reasoning. Tests multi-step scientific reasoning chains.
Key Metrics
PersonaMem
PersonalizationBenchmark for dynamic user profiling and personalized responses at scale. Measures how well agents learn and adapt to individual user preferences.
Key Metrics
VisualSphinx
MultimodalLarge-scale synthetic vision logic puzzles for reinforcement learning. Tests multimodal reasoning where visual understanding and logic intersect.
Key Metrics
These benchmarks are part of our ongoing research. See all publications and open-source contributions.
View All ResearchNeed custom evaluation?
We build domain-specific evaluation frameworks tailored to your agent's deployment context. From task design to failure taxonomy to regression suites.
Talk to Us