Public Benchmark ///

Art & Aesthetics Benchmark

How well do AI models perceive and evaluate visual aesthetics? We tested 15 leading models across 2,400 expert-curated test cases spanning 7 aesthetic dimensions.

15 Models Tested

2,400 Test Cases

24 Expert Evaluators

Last updated February 2026

Leaderboard

Model Rankings

#	MODEL	SCORE		COMP.	COLOR	STYLE	EMOT.	DESIGN	CULT.	CRIT.
01	GPT-4o OpenAI	82.4	████████████████░░░░	85	84	81	80	83	79	85
02	Claude 3.5 Sonnet Anthropic	81.1	████████████████░░░░	83	82	79	82	81	78	83
03	Gemini 1.5 Pro Google	79.8	████████████████░░░░	82	80	78	79	80	77	82
04	Claude 3 Opus Anthropic	78.3	████████████████░░░░	80	79	77	78	79	76	79
05	GPT-4 Turbo OpenAI	77.6	████████████████░░░░	79	78	76	77	78	75	80
06	Gemini 1.5 Flash Google	74.2	███████████████░░░░░	76	75	73	74	75	72	74
07	Llama 3.1 405B Meta	72.8	███████████████░░░░░	74	73	72	71	74	70	75
08	Mistral Large 2 Mistral	71.5	██████████████░░░░░░	73	72	71	70	73	69	73
09	Qwen2-VL 72B Alibaba	70.1	██████████████░░░░░░	72	71	69	69	71	68	71
10	GPT-4o mini OpenAI	68.4	██████████████░░░░░░	70	69	67	68	69	66	70
11	Claude 3 Haiku Anthropic	66.7	█████████████░░░░░░░	68	67	66	66	68	64	68
12	Llama 3.1 70B Meta	64.3	█████████████░░░░░░░	66	65	63	64	65	62	66
13	Gemini 1.0 Pro Google	62.1	████████████░░░░░░░░	64	63	61	62	63	60	62
14	Mistral Medium Mistral	59.8	████████████░░░░░░░░	62	60	58	59	61	57	61
15	Llama 3.1 8B Meta	54.2	███████████░░░░░░░░░	56	55	53	53	55	52	56

01 GPT-4o

82.4

OpenAI

████████████████░░░░

02 Claude 3.5 Sonnet

81.1

Anthropic

████████████████░░░░

03 Gemini 1.5 Pro

79.8

Google

████████████████░░░░

04 Claude 3 Opus

78.3

Anthropic

████████████████░░░░

05 GPT-4 Turbo

77.6

OpenAI

████████████████░░░░

06 Gemini 1.5 Flash

74.2

Google

███████████████░░░░░

07 Llama 3.1 405B

72.8

Score Distribution

████ Overall ████ Floor

Overall score vs. floor score (lowest category) for each model.

82.4 79

░░░░████████████████

81.1 78

░░░░████████████████

79.8 77

░░░░████████████████

░░░░░███████████████

78.3 76

░░░░████████████████

░░░░░███████████████

77.6 75

░░░░████████████████

░░░░░███████████████

74.2 72

░░░░░███████████████

░░░░░░██████████████

72.8 70

░░░░░███████████████

░░░░░░██████████████

71.5 69

░░░░░░██████████████

70.1 68

░░░░░░██████████████

68.4 66

░░░░░░██████████████

░░░░░░░█████████████

66.7 64

░░░░░░░█████████████

64.3 62

░░░░░░░█████████████

░░░░░░░░████████████

62.1 60

░░░░░░░░████████████

59.8 57

░░░░░░░░████████████

░░░░░░░░░███████████

54.2 52

░░░░░░░░░███████████

░░░░░░░░░░██████████

GPT-4o

Claude 3.5 Sonnet

Gemini 1.5 Pro

Claude 3 Opus

GPT-4 Turbo

Gemini 1.5 Flash

Llama 3.1 405B

Mistral Large 2

Qwen2-VL 72B

GPT-4o mini

Claude 3 Haiku

Llama 3.1 70B

Gemini 1.0 Pro

Mistral Medium

Llama 3.1 8B

■ ■ ■ sorted by overall score ■ ■ ■

Evaluation Dimensions

7 Aesthetic Dimensions

Each model is evaluated across seven independent dimensions of aesthetic understanding, scored by domain experts.

Composition

Balance, visual weight, figure-ground relationships, leading lines, and spatial arrangement.

Color Harmony

Color relationships, temperature, saturation balance, and chromatic coherence.

Style Recognition

Artistic movements, cross-cultural visual traditions, and period identification.

Emotional Resonance

Mood, atmosphere, psychological response, and affective interpretation.

Design Principles

Hierarchy, contrast, alignment, typography, and visual communication.

Cultural Context

Cultural symbolism, historical awareness, and contextual interpretation.

Critique Depth

Structured critique, analytical reasoning, and evaluative articulation.

More dimensions
coming soon

Protocol

Methodology

Rigorous, transparent, and reproducible. Every aspect of our evaluation protocol is designed for scientific credibility.

Peer-reviewed methodology
Open evaluation protocol
Quarterly updates

Evaluation Protocol

Each model receives the same 2,400 test cases — images paired with aesthetic evaluation prompts. Responses are scored against expert consensus on a 0–100 scale across all 7 dimensions.

Expert Panel

24 evaluators with backgrounds in fine art, art history, graphic design, and cultural studies. Each test case is independently scored by 3 experts; final scores use the median to reduce outlier effects.

Scoring

Per-dimension scores are averaged to produce the overall score. Scores reflect agreement with expert consensus — a model scoring 80 agrees with expert judgment 80% of the time on that dimension.

Updates

The benchmark is updated quarterly as new models are released. Historical scores are preserved for longitudinal comparison. Methodology revisions are documented publicly.

Get involved ///

Submit your model for evaluation

We evaluate new models on a rolling basis. Submit your model and receive a full breakdown across all 7 aesthetic dimensions within two weeks.

Submit a Model →