Can AI Build a Game?

580 retro arcade games. 9 AI models. Zero human intervention.

Oscar Craven · April 2026 · source on GitHub

We asked nine AI models to build complete, playable HTML5 arcade games — Pong, Pac-Man, Tetris, Donkey Kong, and six others — from a text specification alone. No human touched the code. No libraries. No frameworks. Just a prompt in, a game out.

The result is the largest public benchmark of AI-generated creative software: 580 builds across two rounds of testing, evaluated by three independent AI judges and human playtesting. Every game is playable in your browser.

1. Executive Summary

The headline: The best AI models can build a playable retro game in a single pass. The worst produce broken output that no amount of budget can fix. The gap between them is enormous — and price doesn't predict quality.

What "quality" means here: Every score in this report is a consensus AI-judge rating of the generated code — three frontier models reading each build statically across five dimensions. No human playtesting validates these scores (yet). When the paper says Gemini Flash delivers "90% of Sonnet's quality," that means 90% by AI-judge consensus on static code, not human experience of playing the game.

Eight key findings from 580 builds:

Sonnet and GPT-5.4 are tied at the top (8.62 vs 8.58 V1 consensus) — the choice between them is pricing, not quality.
Gemini Flash is good value for volume work. 7.74 quality at ~5× cheaper output tokens than Sonnet ($3 vs $15 / M) and ~6× cheaper input ($0.50 vs $3 / M).
Opus is ~1.7× Sonnet's per-token rate ($25/M vs $15/M output via PaleBlueDot — same ratio on input). Per-token gap is small. The reason Opus dominates the bill is call length and context depth, not headline rate. Opus and Sonnet together drove ~$274 of the ~$377 total builder API spend across V1 + V2 (~73%).
o3-mini is dead last (5.23) — but it's also the oldest (Jan 2025) and smallest model tested, so we can't isolate whether architecture, age, or size is responsible.
Game complexity is the primary quality driver. Snake averages 8.72; Donkey Kong averages 6.09. A 2.63-point gap that no model choice closes.
Detailed specs don't help — and may hurt. V2 3-judge scoring confirms: control (game name only) averaged 6.86 vs detailed specs at 6.70. Most builders scored worse with specs.
V2 reshuffles the builder rankings. Opus rises to #1 (7.77) under strict session isolation; several smaller models drop. Rankings depend heavily on evaluation methodology.
AI judges disagree — and the panel matters. V1's Gemini Flash inflated scores by ~1.8 points. V2's Sonnet/Gemini Pro/GPT-5.4 panel shows tighter agreement (MAD ~0.3) but the two panels can't be directly compared.

2. The Experiment

What we built

Each build is a single self-contained HTML file: HTML5 Canvas, JavaScript, zero external dependencies. No CDNs, no images, no sound files. The game must be playable by opening the file in a browser — nothing else.

We chose ten classic arcade games, ordered roughly by implementation complexity:

Game	Complexity	Key Challenge
Pong	Low	Paddle-angle physics, AI opponent
Snake	Low	Collision detection, tail growth
Breakout	Low-Med	Ball-brick collisions, level progression
Space Invaders	Medium	Formation movement, bullet patterns
Tetris	Medium	Piece rotation, line clearing, drop preview
Asteroids	Medium	Inertial physics, screen wrapping, splitting
Galaga	Hard	Swooping formations, ship capture mechanic
Frogger	Hard	Multi-lane timing, log-riding physics
Pac-Man	Hard	Ghost AI (4 distinct behaviours), maze rendering
Donkey Kong	Very Hard	Platform physics, barrel AI, level structure

Two rounds

V1 (182 builds, 8 runs) used a template-based specification: every planner filled the same 12-section form before handing off to a builder model. This revealed strong builder-model rankings but its planner comparison was flawed — it measured how models fill a template, not planning quality.

V2 (400 builds, 5 runs) fixed this with unconstrained specifications. Five planners (Opus, GPT-5.4, Gemini Pro, GLM-5, and a no-spec control) were told: "Write a spec however you think will best serve the developer. No format is required." Eight builder models then built each game from each planner's spec — or, in the control condition, from the game's name alone.

How we scored

Every build was evaluated by three independent AI judges (GPT-5.4, Opus, and Gemini Flash in V1; GPT-5.4, Gemini Pro, and Sonnet in V2) across five dimensions:

Functionality (30% weight) — core mechanics, scoring, win/lose states
Keyboard Controls (20%) — responsiveness, correct mappings, no stuck inputs
Visual Fidelity (20%) — resemblance to the original, sprites, animations, UI
Playability (20%) — is it fun? difficulty curve, fairness, engagement
Error-Free (10%) — no console errors, crashes, freezes, rendering glitches

Consensus score = arithmetic mean of all three judges.

3. Results at a Glance

V1 Leaderboard — Builder Model Rankings

Rank	Model	Consensus	GPT-5.4 Judge	Opus Judge	Min	Max	n	Tier
1	Sonnet	8.62	8.01	8.03	7.33	9.27	18	Premium
2	GPT-5.4	8.58	7.98	8.04	7.53	9.20	20	Premium
3	Opus	8.22	7.55	7.42	6.17	9.07	26	Premium
4	GPT-5.4 Mini	7.86	7.25	6.92	5.37	9.13	20	Budget
5	Gemini Flash	7.74	6.92	6.89	4.47	9.03	20	Budget
6	GPT-5.4 Nano	7.53	6.74	6.40	2.57	9.07	20	Budget
7	Haiku	7.09	6.09	5.96	4.93	8.60	20	Budget
8	Gemini Pro	6.78	5.87	6.18	1.93	9.13	20	Mid
9	o3-mini	5.23	4.88	3.84	3.73	7.00	18	Mid

Note: Gemini Flash judge inflates all consensus scores by ~1.8 points. For model comparison, the GPT-5.4/Opus judge columns are the cleaner signal. Full calibration analysis in Section 8.

Model Performance — Consensus Score (V1, 182 builds)

Game Difficulty Ranking

Average Score by Game — Hardest to Easiest

Snake (8.72) and Pong (8.46) are near-guaranteed successes for any model. Donkey Kong (6.09) and Pac-Man (6.13) separate the capable from the rest. The minimum scores tell the real story: Pac-Man's floor of 0.57 and Donkey Kong's 1.27 indicate models producing completely non-functional implementations.

Exhibit · Play it yourself

Complexity drives quality — same judges, different games

The best Snake build scored 9.27. The best Donkey Kong build scored 8.00 — and that's the ceiling, not the average (6.09). Click into either game and try to play. Snake feels complete. Donkey Kong, even at its best, feels fragile. The training data for Snake is dense; for Donkey Kong it's thin.

Snake · Sonnet (V1) 9.27 ↗ open

Donkey Kong · GPT-5.4 (V1) 8.00 ↗ open

4. What We Learned

Finding 1: The premium tier is a statistical tie

Sonnet (8.62) and GPT-5.4 (8.58) are separated by 0.04 points — well within noise. GPT-5.4 has lower published token pricing (~30% less). The choice between them is ecosystem and pricing, not quality. Both consistently produce playable, visually reasonable games with functioning controls.

Finding 2: Opus costs more and V1/V2 disagree on its quality

Opus is ~1.7× Sonnet's per-token rate at PaleBlueDot gateway rates — same ratio on input and output ($25/M vs $15/M output; $5/M vs $3/M input). Per-token gap is modest. Where Opus actually dominates the bill is call length and context depth: across V1 + V2 builds, Opus drew $145 of the ~$377 builder API spend — the most of any single model. V1 placed Opus third (8.22, behind Sonnet at 8.62 and GPT-5.4 at 8.58); V2 placed it first (7.77) once session isolation was enforced. The two rounds don't agree, and without human playtesting we can't say which ranking is closer to real-world quality.

One data point worth noting: as a judge, Opus rated its own builds at 7.42 — the second-lowest self-assessment in the V1 panel. That suggests quality awareness rather than grade inflation.

Finding 3: Budget models are shockingly capable

Gemini Flash scores 7.74 — within 0.88 points of the top — at the lowest token rates in the test. Per-token, Flash is ~5× cheaper than Sonnet on output ($3 vs $15 / M) and ~8× cheaper than Opus ($3 vs $25 / M). For volume generation where small quality drops are acceptable, Flash is the rational starting point.

Finding 4: o3-mini is old and small — not a fair architecture test

o3-mini scored 5.23 — dead last by 1.55 points, with the worst score on every single dimension. Its keyboard controls score of 4.33 means basic input handling is broken in most builds.

Rather than a verdict on reasoning architectures in general, this is probably a fair verdict on this specific model. o3-mini was released on 31 January 2025 — roughly 15 months before these builds ran in March–April 2026. Every other model in the lineup is from a newer generation: Sonnet 4.6, Opus 4.7, GPT-5.4, GLM-5, and current Gemini Pro/Flash were all released later. o3-mini is also the smallest and cheapest model OpenAI offered in that generation. Older and lighter is enough to explain last place; we can't separate architecture from age and size with a single data point.

Finding 5: Visual fidelity is universally weak

Across all nine models, Visual Fidelity is the lowest-scoring dimension. Even Sonnet (best overall) scores 8.06 on visuals vs 9.11 on controls. AI models systematically deprioritise CSS/canvas aesthetics in favour of functional correctness — the games work, but they don't look like the originals.

Finding 6: Your spec might not matter

V2's most counterintuitive result — now confirmed with full 3-judge QA: The control condition — where builders received only the game's name and build rules, with no specification at all — scored higher than every detailed planner spec.

Opus wrote specifications averaging 29,551 bytes per game. GLM-5 wrote 16,205 bytes. Gemini Pro wrote 6,696 bytes. The control condition spec was 90 bytes: just the game name and "CONTROL CONDITION: No planner spec."

Planner	Avg Spec Size	Avg Score (3-Judge)	Median	Min	Max	n
Control (no spec)	90 bytes	6.86	7.30	2.33	9.00	80
Gemini Pro	6,696 bytes	6.82	7.17	1.05	9.03	80
GPT-5.4	13,384 bytes	6.81	7.33	0.50	9.17	80
GLM-5	16,205 bytes	6.78	7.18	1.03	8.97	80
Opus	29,551 bytes	6.39	6.60	1.43	8.90	77

The most detailed planner (Opus, 29K bytes) produced the lowest average quality — 0.47 points below the control condition. The control's minimum score (2.33) is also the highest floor of any planner, suggesting that no-spec builds fail less catastrophically.

At the builder level, specs actively harm most models. Haiku loses 0.44. Sonnet loses 0.28. Only GPT-5.4 meaningfully benefits (+0.74). Opus shows zero effect either way. The per-builder table below has the full picture.

Exhibit · Play it yourself

Same builder, same game — a detailed spec crashed it

Haiku built Snake twice. Once given Opus's detailed planner spec (rules, collision logic, tuning parameters), it shipped something judges could barely rate — 1.43/10. Given only "# Snake — CONTROL CONDITION: No planner spec" it produced a proper playable Snake — 7.93/10. Smaller models can choke on instructions they can't fully internalise.

Snake · Haiku + Opus spec 1.43 ↗ open

Snake · Haiku + Control (no spec) 7.93 ↗ open

Finding 7: V2 builder rankings differ from V1

With V2's isolated-context, unconstrained-spec methodology and fresh 3-judge scoring panel (Sonnet, Gemini Pro, GPT-5.4 replacing V1's GPT-5.4, Opus, Gemini Flash), the builder rankings shifted:

Rank	V2 Builder	V2 Score	V1 Equivalent	V1 Score
1	Opus	7.77	Opus	8.22
2	Gemini Pro	7.45	Gemini Pro	6.78
3	GPT-5.4	7.43	GPT-5.4	8.58
4	Sonnet	7.40	Sonnet	8.62
5	GLM-5	6.63	new in V2	—
6	Gemini Flash	6.50	Gemini Flash	7.74
7	Haiku	5.65	Haiku	7.09
8	GPT-5.4 Mini	5.07	GPT-5.4 Mini	7.86

V2 Builder Rankings (3-Judge Static QA)

Opus rises from 3rd to 1st. Sonnet and GPT-5.4 drop from 1st-2nd to 3rd-4th. This likely reflects V2's stricter judges (no Gemini Flash inflation) and isolated context (removing Opus's V1 accumulated-context penalty). The V2 judge panel doesn't include the lenient Gemini Flash judge, which deflates all V2 scores relative to V1.

Finding 8: AI judges don't agree — and Flash was miscalibrated, not biased

Gemini Flash as a V1 judge rated nearly everything 9-10, with a total range of just 2.85 points across all models. GPT-5.4 showed a 3.13-point range; Opus showed 4.20 points. Gemini Flash rated o3-mini (the worst builder) at 6.97 — a score that would place it above Haiku by Gemini's measure, despite GPT-5.4 and Opus scoring it 4.88 and 3.84 respectively.

Importantly, this isn't vendor favouritism. Flash inflated all models roughly uniformly: it rated Sonnet builds +1.80 above the GPT-5.4/Opus consensus, Haiku builds +3.21, its own Gemini Flash builds +2.52, and Gemini Pro builds +2.27. Flash was no more generous to its own vendor's output than anyone else's — it simply lacks the discrimination to score meaningfully. The fail mode is poor calibration, not partiality.

V2 replaced Gemini Flash with Gemini Pro as a judge. The result: V2 scores are 1-2 points lower than V1 across the board, confirming that Gemini Flash's inflation was systematically elevating V1 consensus scores.

The practical lesson: never use a single AI judge for evaluation. Use at least two judges that agree on relative ordering, and be aware that changing your judge panel will shift absolute scores even if the underlying quality is unchanged.

Exhibit · Three judges, three different realities

The same build, scored by three AI judges

Haiku built Tetris. GPT-5.4 the judge gave it 4.4. Opus gave it 6.0. Gemini Flash gave it 9.8. That's a spread of 5.4 points — same code, same rubric. Play it yourself and decide who was closest to the truth. This pattern repeats across hundreds of builds.

Tetris · Haiku (V1 run1) GPT 4.4 · Opus 6.0 · Flash 9.8 ↗ open

5. The Spec Question — Does Planning Help?

V1 found a small planner effect: Opus-generated specs averaged 8.27 vs GPT-generated specs at 8.15 (delta +0.12). But V1's planner comparison was methodologically flawed — both planners filled the same template, so we measured template-filling skill, not planning quality.

V2 removed the template entirely. The planner prompt was:

"You are writing a game specification for a developer who will implement [GAME NAME]
as a single self-contained HTML5 Canvas file with no external dependencies.

The developer is skilled but has never played this game. Write them a spec that
gives them everything they need to build a complete, playable version.

No format is required. Write it however you think will best serve the developer."

The result was dramatic structural diversity. For Pong alone:

Opus (312 lines): Design rationale, pixel-level coordinates, physics formulas, acceptance criteria. Reads like an engineering brief: "The ball must accelerate over rallies to build tension. Where you strike the ball on your paddle must control the return angle — this is the only skill expression in the game. Get these two things right and it's satisfying. Get them wrong and it's a screensaver."
GPT-5.4 (213 lines): Pragmatic, well-structured, game-specific detail. Assumes developer competence.
Gemini Pro (86 lines): Functional essentials only. Covers mechanics, skips rationale.
GLM-5 (293 lines): Verbose but less organised. Detailed descriptions, weaker structure.
Control (3 lines): "# Pong\n\nCONTROL CONDITION: No planner spec."

The qualitative difference is striking. The quantitative result, confirmed with full 3-judge static QA, is that it didn't matter — and in many cases, detailed specs actively hurt.

Spec benefit by builder model

The effect varies dramatically by builder. Most models scored worse when given specs:

Builder	Control Avg	With-Spec Avg	Benefit	Verdict
GPT-5.4	6.84	7.58	+0.74	Meaningfully helped
Gemini Pro	7.38	7.47	+0.09	Negligible
Gemini Flash	6.49	6.50	+0.01	No effect
Opus	7.77	7.77	0.00	No effect
GLM-5	6.84	6.58	-0.26	Hurt
Sonnet	7.62	7.34	-0.28	Hurt
Haiku	6.00	5.56	-0.44	Hurt
GPT-5.4 Mini	5.94	4.86	-1.09	Seriously hurt

A pattern worth noting: smaller models tend to do worse when given detailed specs than larger ones. Spec complexity should probably be calibrated to builder capability — simpler models may perform better with simpler prompts. For well-known games, model training data already includes sufficient implementation knowledge, so a detailed spec adds signal that's redundant with what the model already knows, and for weaker models the extra complexity becomes noise.

V2 Planner Scores — Control vs Detailed Specs

6. The Money — Cost Per Build

V1 Score vs Audited Cost Per Build (PaleBlueDot per-call billing)

Model	V1 Score	V2 Score	V1 $/build	V2 $/build	Tier
Sonnet	8.62	7.40	$3.30	$1.39	Premium
GPT-5.4	8.58	7.43	$0.47	$0.25	Premium
Opus	8.22	7.77	$2.60	$1.59	Premium
Gemini Pro	6.78	7.45	$0.34	$0.67	Mid
Gemini Flash	7.74	6.50	$0.04	$0.11	Budget
Haiku	7.09	5.65	$0.56	$0.34	Budget
GLM-5	—	6.63	—	$0.10	Budget
GPT-5.4 Mini	7.86	5.07	—*	$0.01	Budget
o3-mini	5.23	—	$0.07	—	Mid

Per-build costs are audited — derived from PaleBlueDot per-call billing divided by completed-build counts per model per round. They cover the builder-call API spend only, not orchestration overhead (which roughly doubled the bill — see below). * GPT-5.4 Mini's V1 builds went via a different code path (runs_7_8_build_results.json) and didn't surface in the per-game-per-model audit; per-build figure not directly computable from the same source.

At the PaleBlueDot gateway rates we paid: Flash is ~5× cheaper than Sonnet on output ($3 vs $15 / M tokens) and ~8× cheaper than Opus ($3 vs $25 / M). Opus is ~1.7× pricier per output token than Sonnet ($25 vs $15 / M; $5 vs $3 input). Without human playtesting to validate whether AI-judge score differences map to real-world quality, we've stopped short of a score-per-dollar headline.

Per-builder spend by version

These are the actual builder-model API costs per round, audited from PaleBlueDot per-call billing. They cover the build-only portion (orchestration, judging, rebuilds are separate lines below).

Builder	V1 ($)	V2 ($)	What changed
Opus	67.57	77.81	Top spend both rounds — long calls, deep context
Sonnet	59.33	69.27	Close second — high call frequency
Gemini Pro	6.81	32.64	~5× jump — V2 added Pro as a planner role
Haiku	11.20	16.80	Modest growth, consistent budget tier
GPT-5.4	9.43	12.57	Stable across rounds
Gemini Flash	0.89	5.54	Tiny in V1 (judge only); used as builder in V2
GLM-5	—	4.78	New in V2 only
GPT-5 mini	—	0.61	New in V2 only
o3-mini	1.32	—	V1 only
Build-only total	$157	$220	Per-game breakdown below

Why V2's per-builder spend is higher: V2 used a different planner-builder mix (5 planner roles instead of 2 fixed templates), longer prompts (free-form specs vs 12-section template), more builders per game (8 vs 9, but with deeper per-call context), and isolated context that reset on each call (no cross-game caching).

Per-game build cost (V1 + V2 combined)

Game	V1 ($)	V2 ($)	Total ($)	Notes
Pong	20.38	28.14	48.52	Most expensive overall — Sonnet-heavy in V2 ($14.43 alone)
Pac-Man	25.44	21.21	46.65	V1 high from Opus rebuild attempts ($17.10 in single cell)
Tetris	16.41	23.89	40.30	Opus-heavy V2 ($11.21)
Galaga	14.64	22.96	37.60	Sonnet + Opus dominant V2
Space Invaders	14.23	24.22	38.45	Opus highest single cell ($10.11) V2
Asteroids	13.23	21.32	34.55	Steady distribution
Donkey Kong	11.57	20.85	32.42	Cheaper than expected given complexity — most calls hit ceilings fast
Snake	11.24	20.47	31.71	Quick to converge — short calls
Breakout	18.22	18.99	37.21	Most-balanced V1/V2 — well-known mechanics
Frogger	11.21	17.96	29.17	Cheapest game overall

The complexity-vs-quality finding (Section 3) doesn't map cleanly onto cost: Donkey Kong was actually one of the cheapest games to attempt, despite being the lowest-scoring. Most builders churned out a quick failure rather than a long, expensive attempt. Pong and Pac-Man cost more because builders kept trying — long sessions, multiple iterations, more context accumulation.

What it actually cost — fully audited

A token-level audit of every PaleBlueDot call from 30 March – 15 April separates the games work from concurrent activity on the same API key (cron jobs, daily research agents, side projects). Headline:

Slice	Cost (USD)	What it covers
V1 build	$332	8 models × 10 games, template specs, parallel dispatch
V2 build	$399	5 planner roles × 8 builders × 10 games, free-form specs, isolated context
V2 inline judging (bench-sonnet)	$202	Long-running judging session — single biggest line item
V2 static 3-judge QA	$50	Sonnet + Gemini Pro + GPT-5.4 panel, 1.5 hours
V1 rebuilds (Pac-Man + Donkey Kong)	$40	3 rebuild events, attributed to V2 phase
Window edges + cleanup	$56	Spillover spend at the boundaries of build / QA windows
Games total (audited)	$1,138	Everything attributable to V1+V2 work
Concurrent orchestration noise	$660	Cron, daily agents, side activities sharing the API key
Grand total during the period	$1,798	Full PaleBlueDot spend, 30 Mar – 15 Apr

The games-attributable spend is ~$1,138. Another ~$660 of unrelated automation (cron jobs, daily research agents, side projects) ran on the same API key during the period, bringing the gross PaleBlueDot bill to ~$1,798. The two are separated above because the wider number doesn't measure the cost of building or evaluating the games.

V2 cost 2.4× more than V1 — $806 vs $332 — driven by three real differences in scope, not by wall-clock duration (V2 ran across more days largely because of my availability, not because the work itself was 2× heavier):

More planner-builder combinations. V1 used 2 planner templates × ~9 builders. V2 used 5 planner roles (Opus / GPT-5.4 / Gemini Pro / GLM-5 / Control) × 8 builders. More cells × more isolated context = more API calls.
A long inline-judging session as part of V2. V2 added a bench-sonnet inline judge that ran across V2's whole build window ($202). V1 didn't have an equivalent — it relied on Playwright + brief inline review (~$10).
A second QA pass. V2 added the static 3-judge multijudge ($50) on top of inline judging. V1 had nothing equivalent.

Richer methodology costs sharply more. The cost ratio (2.4×) is closer to the methodology delta than the time delta would suggest.

What we'd do differently to spend less

Three findings worth flagging up front, separated from the model-quality narrative:

Orchestration was 2/3 of the budget. The actual builder + judge model calls (the bench-* sub-agents that did the work) cost ~$364. The orchestration tax around them — main / ross / harvey / dwight dispatching, polling, summarising, retrying, re-prompting — was ~$753. For every $1 spent on actual game-building or judging, $1.83 went to coordination overhead. The pipeline is roughly 60% coordination, 35% builders, 5% judges. The agent framework cost more than the work it managed.
V1 was 2.3× more orchestration-intensive per hour than V2 ($19/h vs $8/h) because V1 fired 8 model variants in parallel with ad-hoc main → bench delegation, full prompt re-sends every round. V2 ran via tighter pipeline scripts (build_v2.py, qa_browser_v2.py) with less orchestrator chatter — but ran much longer total, so V2's absolute orchestration spend ended up higher anyway.
OpenClaw re-sends accumulated conversation context on every turn. That's load-bearing for agent coherence but expensive — token cost grows quadratically with conversation length, not linearly. For high-volume jobs, batched stateless calls or aggressive between-turn context trimming would have been dramatically cheaper.
OAuth / subscription tiers vs raw PAYG. All this spend went through the PaleBlueDot gateway at pay-as-you-go rates. Most providers offer subscription, committed-spend, or OAuth flows that materially reduce effective rates for sustained usage. At this scale that alone could have cut the bill by a meaningful fraction.

Caveat: the audit relies on cross-referencing per-call CSV billing with bench-* JSONL timestamps. PaleBlueDot doesn't tag calls with project/agent metadata, so attribution between "games work" and "concurrent automation" is inferred from time windows, not direct labelling. The breakdown is directional, not exact to the cent.

7. Methodology

V1 Protocol (182 builds)

Runs 1-2: 9 builder models × 10 games. Specs generated by Opus (run 1) and GPT-5.4 (run 2) using a mandatory 12-section template. Accumulated context — builders saw prior game code.
Runs 3-4: Opus only, isolated context — testing the Opus Paradox. Fresh session per build.
Runs 5-6: Gemini Flash and Gemini Pro as builders with Opus/GPT specs.
Runs 7-8: GPT-5.4 Mini and GPT-5.4 Nano as builders. Only runs with actual token/cost data.
QA: 3 judges (GPT-5.4, Opus, Gemini Flash) × 5 dimensions. Consensus = mean of all three.

V2 Protocol (400 builds)

5 planner runs: Opus, GPT-5.4, Gemini Pro, GLM-5, and a no-spec control.
8 builder models × 10 games per run = 400 builds.
Unconstrained specs: Planners given a brief prompt only. No template. They decide structure, depth, and format.
Strict isolation: Every build = fresh session. No context accumulation. Memory flush between games.
DNF handling: up to 3 attempts per build, then mark as DNF. Artefact verified on disk (not model self-reporting).
QA: 3-judge static scoring (Sonnet, Gemini Pro, GPT-5.4) across the same 5 dimensions as V1. 397 of 400 builds scored (3 DNFs).

Build Environment

All builds executed via the OpenClaw agent framework, which wraps API calls to each model provider. V1 builds ran fully in parallel across models and games. V2 builds ran in a mix of parallel and sequential batches depending on provider rate limits. Hard timeout: 15 minutes per build attempt. Output: single index.html file per build.

8. Judge Calibration

Three judges independently scored every V1 build. Their agreement — or lack thereof — is critical to interpreting the results.

Judge Scores by Builder Model — Showing Gemini Flash Inflation

Builder	GPT-5.4 Judge	Opus Judge	Gemini Judge	GPT/Opus Avg	Gemini Inflation
Sonnet	8.01	8.03	9.82	8.02	+1.80
GPT-5.4	7.98	8.04	9.72	8.01	+1.71
Opus	7.55	7.42	9.68	7.49	+2.19
GPT-5.4 Mini	7.25	6.92	9.43	7.09	+2.34
Gemini Flash	6.92	6.89	9.43	6.91	+2.52
GPT-5.4 Nano	6.74	6.40	9.46	6.57	+2.89
Haiku	6.09	5.96	9.24	6.03	+3.21
Gemini Pro	5.87	6.18	8.30	6.03	+2.27
o3-mini	4.88	3.84	6.97	4.36	+2.61

Gemini Flash judge inflates scores by an average of +2.39 points over the GPT-5.4/Opus consensus. The inflation is worse for lower-quality builds — Haiku builds get +3.21 points of inflation vs Sonnet builds at +1.80 — meaning Gemini Flash actively compresses the quality scale, making bad builds look mediocre and mediocre builds look excellent.

9. Related Work — Where This Benchmark Fits

AI code generation benchmarks have proliferated, but they cluster around a narrow task type: given a specification (docstring, issue, or test), produce a code snippet that passes predefined tests. Our benchmark asks a fundamentally different question: given a description, can you produce complete interactive software that a human wants to use?

Benchmark	Task Type	Output	Evaluation	Creative?
SWE-bench	Fix real GitHub issues	Code patches	Unit tests (pass/fail)	No
HumanEval / MBPP	Write functions from docstrings	Code snippets	Unit tests (pass@k)	No
BigCodeBench	Multi-library function tasks	Functions	Unit tests	No
LiveCodeBench	Competitive programming	Algorithm solutions	Judge tests (pass@1)	No
Aider Polyglot	Code editing across 6 languages	Edited files	Exercism tests	No
DevBench	Full development lifecycle	Complete codebases	Acceptance tests	Partial
This Benchmark	Build complete games from spec	Playable applications	Multi-judge + human play	Yes

Key benchmarks in context

SWE-bench is the current gold standard for AI coding evaluation. It asks models to resolve real GitHub issues in Python repos (Django, scikit-learn, etc.), graded against the original PR's unit tests. Top scores have risen from Claude 2's 1.96% (Oct 2023) to ~77-81% for current frontier models. But SWE-bench tests maintenance — patching existing code — not creation. Our benchmark has no existing codebase, no reference implementation, and no unit tests.

HumanEval (164 problems) and MBPP (974 problems) test single-function generation from docstrings. They're now largely saturated — frontier models solve 90%+ — and heavily contaminated in training data. EvalPlus showed that expanding test cases 80x dropped pass rates by up to 28.9%, suggesting the originals were too lenient.

Aider's Polyglot benchmark (225 Exercism problems across 6 languages) is notable for its cost transparency: DeepSeek V3.2 costs $1.30 total vs o3-pro's $146.32 — a 112x cost difference for a 14% score gap. Our benchmark shows similar token-pricing disparities across tiers.

DevBench comes closest to our scope — it tests the full development lifecycle across 22 repositories. But GPT-4-Turbo achieved only 7.1% on implementation acceptance tests, suggesting its task formulation is too rigid. Our approach — judging the output holistically by playability rather than against predefined test suites — may better capture what "working software" actually means.

Vibe coding in the wild

Andrej Karpathy coined "vibe coding" in February 2025: "you fully give in to the vibes, embrace exponentials, and forget that the code even exists." It became Collins Dictionary's Word of the Year. Our benchmark is a controlled, systematic study of exactly this: 580 builds where AI models received a specification (or just a game name) and produced complete software without human intervention.

Real-world vibe coding data is mixed. Y Combinator reported 25% of their Winter 2025 batch had codebases ~95% AI-generated. But the METR randomised controlled trial (July 2025) found experienced developers were 19% slower with AI tools, despite believing they were faster. CodeRabbit's analysis found AI co-authored code had 1.7x more major issues and 2.74x higher security vulnerabilities.

Our benchmark contributes hard empirical data to this debate: for well-known creative tasks (retro games), vibe coding produces functional output 99% of the time (397/400 V2 builds completed) with an average quality of 7.5-8.6/10 depending on model. The quality ceiling is high (9.27 for Snake by Sonnet) but the floor is low (0.57 for Pac-Man by GPT-5.4 Nano). Complexity, not model choice, is the primary determinant of success.

What's different here

No existing benchmark tests whether AI can produce complete, creative, interactive software from a specification. The gap our benchmark fills is the space between "can you fix this bug?" (SWE-bench) and "can you build something a human would play and enjoy?" That gap turns out to be enormous — and the models that dominate patch-based benchmarks don't necessarily lead on creative generation.

10. Limitations and Future Work

Per-build cost figures are estimates. Only runs 7-8 have actual token counts. All other per-build costs use published API prices × assumed token counts (~3K input + ~4K output). Actual metered costs were not tracked per-build at the time. A retrospective token-level audit (Section 6 — "What it actually cost") reconstructed the real spend: ~$1,138 for the games work, with another $660 of concurrent unrelated automation running on the same API key inflating the gross figure to $1,798 across the period.
Gemini Flash judge inflation is baked into V1 consensus. All V1 headline scores are arithmetically averaged across 3 judges including Gemini Flash. The GPT-5.4/Opus average column is the cleaner metric.
V1 and V2 judge panels differ. V1 used GPT-5.4, Opus, and Gemini Flash. V2 used Sonnet, Gemini Pro, and GPT-5.4. Scores across rounds are not directly comparable due to different judge calibration.
Game selection bias. All 10 games are well-known retro arcade titles likely present in AI training data. Results may not generalise to novel game designs or complex mechanics outside training distribution.
Single-file constraint. Requiring everything in one HTML file caps complexity. Real game development involves multiple files, build tools, and assets. This benchmark measures zero-dependency creative synthesis, not software engineering.
Model versions are point-in-time. All benchmarks were run in March-April 2026. Model capabilities change with updates. These rankings are valid for the specific model versions tested.
No human ground truth yet. Automated judges score code statically. A human playtesting component is planned but not yet completed.

Future work

Human playtesting validation (60-game sample, 5-dimension rubric matching automated QA)
Test novel game designs (not in training data) to measure true generative capability
Multi-file builds with dependencies — how do models handle real project structure?
Iterative refinement: what happens when models can see and fix their own bugs?

11. Practical Recommendations

Based on 580 builds, here's when to use what:

Use Case	Recommended Model	Why
Maximum quality, cost no object	Sonnet, GPT-5.4, or Opus	V1: Sonnet/GPT-5.4 tied. V2: Opus tops builder ranking (7.77). Depends on pipeline.
High volume generation	Gemini Flash	~5× cheaper output tokens than Sonnet via PaleBlueDot. 0.88-point V1 quality drop.
Budget with decent quality	Haiku or GPT-5.4 Nano	Haiku 7.09 (V1). Test smaller models in your pipeline before committing — V2 showed several dropped significantly under stricter conditions.
Exploring / prototyping	GPT-5.4 Nano or Gemini Flash	Cheapest tokens, acceptable output for iteration.
Accumulated context pipelines	Opus	Only model that measurably benefits from seeing prior work (+0.95 points in V1).

And what to avoid:

Opus is expensive but context-dependent. V1 showed Opus losing to Sonnet at 5x the cost, but V2 puts Opus at #1 builder. Evaluate in your own pipeline before ruling it out.
Don't use o3-mini for creative coding. It finishes last by a wide margin — though it's also older and smaller than everything else tested, so the verdict is on this model, not on reasoning architectures generally.
Don't trust a single AI judge. Use at least two that agree on relative ordering.
Don't over-engineer your spec for well-known tasks. If the model already knows Pac-Man, a 300-line spec may hurt more than help.