Public Benchmark ///

Art & Aesthetics Benchmark

How well do AI models perceive and evaluate visual aesthetics? We tested 15 leading models across 2,400 expert-curated test cases spanning 7 aesthetic dimensions.

15 Models Tested
2,400 Test Cases
24 Expert Evaluators

Last updated February 2026

Leaderboard

Model Rankings

01 GPT-4o
82.4

OpenAI

████████████████░░░░
02 Claude 3.5 Sonnet
81.1

Anthropic

████████████████░░░░
03 Gemini 1.5 Pro
79.8

Google

████████████████░░░░
04 Claude 3 Opus
78.3

Anthropic

████████████████░░░░
05 GPT-4 Turbo
77.6

OpenAI

████████████████░░░░
06 Gemini 1.5 Flash
74.2

Google

███████████████░░░░░
07 Llama 3.1 405B
72.8

Meta

███████████████░░░░░
08 Mistral Large 2
71.5

Mistral

██████████████░░░░░░
09 Qwen2-VL 72B
70.1

Alibaba

██████████████░░░░░░
10 GPT-4o mini
68.4

OpenAI

██████████████░░░░░░
11 Claude 3 Haiku
66.7

Anthropic

█████████████░░░░░░░
12 Llama 3.1 70B
64.3

Meta

█████████████░░░░░░░
13 Gemini 1.0 Pro
62.1

Google

████████████░░░░░░░░
14 Mistral Medium
59.8

Mistral

████████████░░░░░░░░
15 Llama 3.1 8B
54.2

Meta

███████████░░░░░░░░░

Visual Comparison

Score Distribution

████ Overall ████ Floor

Overall score vs. floor score (lowest category) for each model.

82.4 79
░░░░████████████████
░░░░████████████████
81.1 78
░░░░████████████████
░░░░████████████████
79.8 77
░░░░████████████████
░░░░░███████████████
78.3 76
░░░░████████████████
░░░░░███████████████
77.6 75
░░░░████████████████
░░░░░███████████████
74.2 72
░░░░░███████████████
░░░░░░██████████████
72.8 70
░░░░░███████████████
░░░░░░██████████████
71.5 69
░░░░░░██████████████
░░░░░░██████████████
70.1 68
░░░░░░██████████████
░░░░░░██████████████
68.4 66
░░░░░░██████████████
░░░░░░░█████████████
66.7 64
░░░░░░░█████████████
░░░░░░░█████████████
64.3 62
░░░░░░░█████████████
░░░░░░░░████████████
62.1 60
░░░░░░░░████████████
░░░░░░░░████████████
59.8 57
░░░░░░░░████████████
░░░░░░░░░███████████
54.2 52
░░░░░░░░░███████████
░░░░░░░░░░██████████
GPT-4o
Claude 3.5 Sonnet
Gemini 1.5 Pro
Claude 3 Opus
GPT-4 Turbo
Gemini 1.5 Flash
Llama 3.1 405B
Mistral Large 2
Qwen2-VL 72B
GPT-4o mini
Claude 3 Haiku
Llama 3.1 70B
Gemini 1.0 Pro
Mistral Medium
Llama 3.1 8B

■ ■ ■ sorted by overall score ■ ■ ■

Evaluation Dimensions

7 Aesthetic Dimensions

Each model is evaluated across seven independent dimensions of aesthetic understanding, scored by domain experts.

01

Composition

Balance, visual weight, figure-ground relationships, leading lines, and spatial arrangement.

02

Color Harmony

Color relationships, temperature, saturation balance, and chromatic coherence.

03

Style Recognition

Artistic movements, cross-cultural visual traditions, and period identification.

04

Emotional Resonance

Mood, atmosphere, psychological response, and affective interpretation.

05

Design Principles

Hierarchy, contrast, alignment, typography, and visual communication.

06

Cultural Context

Cultural symbolism, historical awareness, and contextual interpretation.

07

Critique Depth

Structured critique, analytical reasoning, and evaluative articulation.

Protocol

Methodology

Rigorous, transparent, and reproducible. Every aspect of our evaluation protocol is designed for scientific credibility.

Peer-reviewed methodology
Open evaluation protocol
Quarterly updates

01

Evaluation Protocol

Each model receives the same 2,400 test cases — images paired with aesthetic evaluation prompts. Responses are scored against expert consensus on a 0–100 scale across all 7 dimensions.

02

Expert Panel

24 evaluators with backgrounds in fine art, art history, graphic design, and cultural studies. Each test case is independently scored by 3 experts; final scores use the median to reduce outlier effects.

03

Scoring

Per-dimension scores are averaged to produce the overall score. Scores reflect agreement with expert consensus — a model scoring 80 agrees with expert judgment 80% of the time on that dimension.

04

Updates

The benchmark is updated quarterly as new models are released. Historical scores are preserved for longitudinal comparison. Methodology revisions are documented publicly.

Get involved ///

Submit your model for evaluation

We evaluate new models on a rolling basis. Submit your model and receive a full breakdown across all 7 aesthetic dimensions within two weeks.

Submit a Model