Skip to content

Research

Benchmarks & Evaluation Hub

Open evaluation datasets and benchmarks from Bake AI research. Used by frontier labs to evaluate coding agents, tool-use agents, STEM reasoning, and more.

KodCode

Coding

A diverse, challenging, and verifiable synthetic dataset for training and evaluating coding agents. Covers repo-level tasks beyond single-function benchmarks.

Key Metrics

Pass@1 Execution Accuracy
Read Paper

TOUCAN

Tool-Use Agents

1.5M tool-agentic data points synthesized from real-world MCP environments. Tests multi-turn tool interactions in realistic settings.

Key Metrics

Tool Call Accuracy Task Completion
Read Paper

ChemOrch

STEM / Chemistry

Chemical intelligence benchmark empowering LLMs with synthetic instructions for chemistry reasoning. Tests multi-step scientific reasoning chains.

Key Metrics

Reasoning Accuracy
Read Paper

PersonaMem

Personalization

Benchmark for dynamic user profiling and personalized responses at scale. Measures how well agents learn and adapt to individual user preferences.

Key Metrics

Profile Accuracy Response Quality
Read Paper

VisualSphinx

Multimodal

Large-scale synthetic vision logic puzzles for reinforcement learning. Tests multimodal reasoning where visual understanding and logic intersect.

Key Metrics

Puzzle Accuracy
Read Paper

These benchmarks are part of our ongoing research. See all publications and open-source contributions.

View All Research

Need custom evaluation?

We build domain-specific evaluation frameworks tailored to your agent's deployment context. From task design to failure taxonomy to regression suites.

Talk to Us