Public Benchmark ///
Art & Aesthetics Benchmark
How well do AI models perceive and evaluate visual aesthetics? We tested 15 leading models across 2,400 expert-curated test cases spanning 7 aesthetic dimensions.
Last updated February 2026
Leaderboard
Model Rankings
OpenAI
Anthropic
Anthropic
OpenAI
Meta
Mistral
Alibaba
OpenAI
Anthropic
Meta
Mistral
Meta
Visual Comparison
Score Distribution
Overall score vs. floor score (lowest category) for each model.
■ ■ ■ sorted by overall score ■ ■ ■
Evaluation Dimensions
7 Aesthetic Dimensions
Each model is evaluated across seven independent dimensions of aesthetic understanding, scored by domain experts.
Composition
Balance, visual weight, figure-ground relationships, leading lines, and spatial arrangement.
Color Harmony
Color relationships, temperature, saturation balance, and chromatic coherence.
Style Recognition
Artistic movements, cross-cultural visual traditions, and period identification.
Emotional Resonance
Mood, atmosphere, psychological response, and affective interpretation.
Design Principles
Hierarchy, contrast, alignment, typography, and visual communication.
Cultural Context
Cultural symbolism, historical awareness, and contextual interpretation.
Critique Depth
Structured critique, analytical reasoning, and evaluative articulation.
coming soon
Protocol
Methodology
Rigorous, transparent, and reproducible. Every aspect of our evaluation protocol is designed for scientific credibility.
Peer-reviewed methodology
Open evaluation protocol
Quarterly updates
Evaluation Protocol
Each model receives the same 2,400 test cases — images paired with aesthetic evaluation prompts. Responses are scored against expert consensus on a 0–100 scale across all 7 dimensions.
Expert Panel
24 evaluators with backgrounds in fine art, art history, graphic design, and cultural studies. Each test case is independently scored by 3 experts; final scores use the median to reduce outlier effects.
Scoring
Per-dimension scores are averaged to produce the overall score. Scores reflect agreement with expert consensus — a model scoring 80 agrees with expert judgment 80% of the time on that dimension.
Updates
The benchmark is updated quarterly as new models are released. Historical scores are preserved for longitudinal comparison. Methodology revisions are documented publicly.
Get involved ///
Submit your model for evaluation
We evaluate new models on a rolling basis. Submit your model and receive a full breakdown across all 7 aesthetic dimensions within two weeks.
Submit a Model →