Can AI Build a Game?

580 retro arcade games. 9 AI models. Zero human intervention.

We asked nine AI models to build complete, playable HTML5 arcade games — Pong, Pac-Man, Tetris, Donkey Kong, and six others — from a text specification alone. No human touched the code. No libraries. No frameworks. Just a prompt in, a game out.

The result is the largest public benchmark of AI-generated creative software: 580 builds across two rounds of testing, evaluated by three independent AI judges and human playtesting. Every game is playable in your browser.

1. Executive Summary

The headline: The best AI models can build a playable retro game in a single pass. The worst produce broken output that no amount of budget can fix. The gap between them is enormous — and price doesn't predict quality.
What "quality" means here: Every score in this report is a consensus AI-judge rating of the generated code — three frontier models reading each build statically across five dimensions. No human playtesting validates these scores (yet). When the paper says Gemini Flash delivers "90% of Sonnet's quality," that means 90% by AI-judge consensus on static code, not human experience of playing the game.

Eight key findings from 580 builds:

  1. Sonnet and GPT-5.4 are tied at the top (8.62 vs 8.58 V1 consensus) — the choice between them is pricing, not quality.
  2. Gemini Flash is good value for volume work. 7.74 quality at ~5× cheaper output tokens than Sonnet ($3 vs $15 / M) and ~6× cheaper input ($0.50 vs $3 / M).
  3. Opus is ~1.7× Sonnet's per-token rate ($25/M vs $15/M output via PaleBlueDot — same ratio on input). Per-token gap is small. The reason Opus dominates the bill is call length and context depth, not headline rate. Opus and Sonnet together drove ~$274 of the ~$377 total builder API spend across V1 + V2 (~73%).
  4. o3-mini is dead last (5.23) — but it's also the oldest (Jan 2025) and smallest model tested, so we can't isolate whether architecture, age, or size is responsible.
  5. Game complexity is the primary quality driver. Snake averages 8.72; Donkey Kong averages 6.09. A 2.63-point gap that no model choice closes.
  6. Detailed specs don't help — and may hurt. V2 3-judge scoring confirms: control (game name only) averaged 6.86 vs detailed specs at 6.70. Most builders scored worse with specs.
  7. V2 reshuffles the builder rankings. Opus rises to #1 (7.77) under strict session isolation; several smaller models drop. Rankings depend heavily on evaluation methodology.
  8. AI judges disagree — and the panel matters. V1's Gemini Flash inflated scores by ~1.8 points. V2's Sonnet/Gemini Pro/GPT-5.4 panel shows tighter agreement (MAD ~0.3) but the two panels can't be directly compared.

2. The Experiment

What we built

Each build is a single self-contained HTML file: HTML5 Canvas, JavaScript, zero external dependencies. No CDNs, no images, no sound files. The game must be playable by opening the file in a browser — nothing else.

We chose ten classic arcade games, ordered roughly by implementation complexity:

GameComplexityKey Challenge
PongLowPaddle-angle physics, AI opponent
SnakeLowCollision detection, tail growth
BreakoutLow-MedBall-brick collisions, level progression
Space InvadersMediumFormation movement, bullet patterns
TetrisMediumPiece rotation, line clearing, drop preview
AsteroidsMediumInertial physics, screen wrapping, splitting
GalagaHardSwooping formations, ship capture mechanic
FroggerHardMulti-lane timing, log-riding physics
Pac-ManHardGhost AI (4 distinct behaviours), maze rendering
Donkey KongVery HardPlatform physics, barrel AI, level structure

Two rounds

V1 (182 builds, 8 runs) used a template-based specification: every planner filled the same 12-section form before handing off to a builder model. This revealed strong builder-model rankings but its planner comparison was flawed — it measured how models fill a template, not planning quality.

V2 (400 builds, 5 runs) fixed this with unconstrained specifications. Five planners (Opus, GPT-5.4, Gemini Pro, GLM-5, and a no-spec control) were told: "Write a spec however you think will best serve the developer. No format is required." Eight builder models then built each game from each planner's spec — or, in the control condition, from the game's name alone.

How we scored

Every build was evaluated by three independent AI judges (GPT-5.4, Opus, and Gemini Flash in V1; GPT-5.4, Gemini Pro, and Sonnet in V2) across five dimensions:

Consensus score = arithmetic mean of all three judges.

3. Results at a Glance

V1 Leaderboard — Builder Model Rankings

RankModelConsensusGPT-5.4 JudgeOpus Judge MinMaxn Tier
1Sonnet8.628.018.037.339.2718Premium
2GPT-5.48.587.988.047.539.2020Premium
3Opus8.227.557.426.179.0726Premium
4GPT-5.4 Mini7.867.256.925.379.1320Budget
5Gemini Flash7.746.926.894.479.0320Budget
6GPT-5.4 Nano7.536.746.402.579.0720Budget
7Haiku7.096.095.964.938.6020Budget
8Gemini Pro6.785.876.181.939.1320Mid
9o3-mini5.234.883.843.737.0018Mid

Note: Gemini Flash judge inflates all consensus scores by ~1.8 points. For model comparison, the GPT-5.4/Opus judge columns are the cleaner signal. Full calibration analysis in Section 8.

Model Performance — Consensus Score (V1, 182 builds)

Game Difficulty Ranking

Average Score by Game — Hardest to Easiest

Snake (8.72) and Pong (8.46) are near-guaranteed successes for any model. Donkey Kong (6.09) and Pac-Man (6.13) separate the capable from the rest. The minimum scores tell the real story: Pac-Man's floor of 0.57 and Donkey Kong's 1.27 indicate models producing completely non-functional implementations.

Exhibit · Play it yourself
Complexity drives quality — same judges, different games
The best Snake build scored 9.27. The best Donkey Kong build scored 8.00 — and that's the ceiling, not the average (6.09). Click into either game and try to play. Snake feels complete. Donkey Kong, even at its best, feels fragile. The training data for Snake is dense; for Donkey Kong it's thin.
Snake · Sonnet (V1) 9.27 ↗ open
Donkey Kong · GPT-5.4 (V1) 8.00 ↗ open

4. What We Learned

Finding 1: The premium tier is a statistical tie

Sonnet (8.62) and GPT-5.4 (8.58) are separated by 0.04 points — well within noise. GPT-5.4 has lower published token pricing (~30% less). The choice between them is ecosystem and pricing, not quality. Both consistently produce playable, visually reasonable games with functioning controls.

Finding 2: Opus costs more and V1/V2 disagree on its quality

Opus is ~1.7× Sonnet's per-token rate at PaleBlueDot gateway rates — same ratio on input and output ($25/M vs $15/M output; $5/M vs $3/M input). Per-token gap is modest. Where Opus actually dominates the bill is call length and context depth: across V1 + V2 builds, Opus drew $145 of the ~$377 builder API spend — the most of any single model. V1 placed Opus third (8.22, behind Sonnet at 8.62 and GPT-5.4 at 8.58); V2 placed it first (7.77) once session isolation was enforced. The two rounds don't agree, and without human playtesting we can't say which ranking is closer to real-world quality.

One data point worth noting: as a judge, Opus rated its own builds at 7.42 — the second-lowest self-assessment in the V1 panel. That suggests quality awareness rather than grade inflation.

Finding 3: Budget models are shockingly capable

Gemini Flash scores 7.74 — within 0.88 points of the top — at the lowest token rates in the test. Per-token, Flash is ~5× cheaper than Sonnet on output ($3 vs $15 / M) and ~8× cheaper than Opus ($3 vs $25 / M). For volume generation where small quality drops are acceptable, Flash is the rational starting point.

Finding 4: o3-mini is old and small — not a fair architecture test

o3-mini scored 5.23 — dead last by 1.55 points, with the worst score on every single dimension. Its keyboard controls score of 4.33 means basic input handling is broken in most builds.

Rather than a verdict on reasoning architectures in general, this is probably a fair verdict on this specific model. o3-mini was released on 31 January 2025 — roughly 15 months before these builds ran in March–April 2026. Every other model in the lineup is from a newer generation: Sonnet 4.6, Opus 4.7, GPT-5.4, GLM-5, and current Gemini Pro/Flash were all released later. o3-mini is also the smallest and cheapest model OpenAI offered in that generation. Older and lighter is enough to explain last place; we can't separate architecture from age and size with a single data point.

Finding 5: Visual fidelity is universally weak

Across all nine models, Visual Fidelity is the lowest-scoring dimension. Even Sonnet (best overall) scores 8.06 on visuals vs 9.11 on controls. AI models systematically deprioritise CSS/canvas aesthetics in favour of functional correctness — the games work, but they don't look like the originals.

Finding 6: Your spec might not matter

V2's most counterintuitive result — now confirmed with full 3-judge QA: The control condition — where builders received only the game's name and build rules, with no specification at all — scored higher than every detailed planner spec.

Opus wrote specifications averaging 29,551 bytes per game. GLM-5 wrote 16,205 bytes. Gemini Pro wrote 6,696 bytes. The control condition spec was 90 bytes: just the game name and "CONTROL CONDITION: No planner spec."

PlannerAvg Spec SizeAvg Score (3-Judge)MedianMinMaxn
Control (no spec)90 bytes6.867.302.339.0080
Gemini Pro6,696 bytes6.827.171.059.0380
GPT-5.413,384 bytes6.817.330.509.1780
GLM-516,205 bytes6.787.181.038.9780
Opus29,551 bytes6.396.601.438.9077

The most detailed planner (Opus, 29K bytes) produced the lowest average quality — 0.47 points below the control condition. The control's minimum score (2.33) is also the highest floor of any planner, suggesting that no-spec builds fail less catastrophically.

At the builder level, specs actively harm most models. Haiku loses 0.44. Sonnet loses 0.28. Only GPT-5.4 meaningfully benefits (+0.74). Opus shows zero effect either way. The per-builder table below has the full picture.

Exhibit · Play it yourself
Same builder, same game — a detailed spec crashed it
Haiku built Snake twice. Once given Opus's detailed planner spec (rules, collision logic, tuning parameters), it shipped something judges could barely rate — 1.43/10. Given only "# Snake — CONTROL CONDITION: No planner spec" it produced a proper playable Snake — 7.93/10. Smaller models can choke on instructions they can't fully internalise.
Snake · Haiku + Opus spec 1.43 ↗ open
Snake · Haiku + Control (no spec) 7.93 ↗ open

Finding 7: V2 builder rankings differ from V1

With V2's isolated-context, unconstrained-spec methodology and fresh 3-judge scoring panel (Sonnet, Gemini Pro, GPT-5.4 replacing V1's GPT-5.4, Opus, Gemini Flash), the builder rankings shifted:

RankV2 BuilderV2 ScoreV1 EquivalentV1 Score
1Opus7.77Opus8.22
2Gemini Pro7.45Gemini Pro6.78
3GPT-5.47.43GPT-5.48.58
4Sonnet7.40Sonnet8.62
5GLM-56.63new in V2
6Gemini Flash6.50Gemini Flash7.74
7Haiku5.65Haiku7.09
8GPT-5.4 Mini5.07GPT-5.4 Mini7.86
V2 Builder Rankings (3-Judge Static QA)

Opus rises from 3rd to 1st. Sonnet and GPT-5.4 drop from 1st-2nd to 3rd-4th. This likely reflects V2's stricter judges (no Gemini Flash inflation) and isolated context (removing Opus's V1 accumulated-context penalty). The V2 judge panel doesn't include the lenient Gemini Flash judge, which deflates all V2 scores relative to V1.

Finding 8: AI judges don't agree — and Flash was miscalibrated, not biased

Gemini Flash as a V1 judge rated nearly everything 9-10, with a total range of just 2.85 points across all models. GPT-5.4 showed a 3.13-point range; Opus showed 4.20 points. Gemini Flash rated o3-mini (the worst builder) at 6.97 — a score that would place it above Haiku by Gemini's measure, despite GPT-5.4 and Opus scoring it 4.88 and 3.84 respectively.

Importantly, this isn't vendor favouritism. Flash inflated all models roughly uniformly: it rated Sonnet builds +1.80 above the GPT-5.4/Opus consensus, Haiku builds +3.21, its own Gemini Flash builds +2.52, and Gemini Pro builds +2.27. Flash was no more generous to its own vendor's output than anyone else's — it simply lacks the discrimination to score meaningfully. The fail mode is poor calibration, not partiality.

V2 replaced Gemini Flash with Gemini Pro as a judge. The result: V2 scores are 1-2 points lower than V1 across the board, confirming that Gemini Flash's inflation was systematically elevating V1 consensus scores.

The practical lesson: never use a single AI judge for evaluation. Use at least two judges that agree on relative ordering, and be aware that changing your judge panel will shift absolute scores even if the underlying quality is unchanged.

Exhibit · Three judges, three different realities
The same build, scored by three AI judges
Haiku built Tetris. GPT-5.4 the judge gave it 4.4. Opus gave it 6.0. Gemini Flash gave it 9.8. That's a spread of 5.4 points — same code, same rubric. Play it yourself and decide who was closest to the truth. This pattern repeats across hundreds of builds.
Tetris · Haiku (V1 run1) GPT 4.4 · Opus 6.0 · Flash 9.8 ↗ open

5. The Spec Question — Does Planning Help?

V1 found a small planner effect: Opus-generated specs averaged 8.27 vs GPT-generated specs at 8.15 (delta +0.12). But V1's planner comparison was methodologically flawed — both planners filled the same template, so we measured template-filling skill, not planning quality.

V2 removed the template entirely. The planner prompt was:

"You are writing a game specification for a developer who will implement [GAME NAME]
as a single self-contained HTML5 Canvas file with no external dependencies.

The developer is skilled but has never played this game. Write them a spec that
gives them everything they need to build a complete, playable version.

No format is required. Write it however you think will best serve the developer."

The result was dramatic structural diversity. For Pong alone:

The qualitative difference is striking. The quantitative result, confirmed with full 3-judge static QA, is that it didn't matter — and in many cases, detailed specs actively hurt.

Spec benefit by builder model

The effect varies dramatically by builder. Most models scored worse when given specs:

BuilderControl AvgWith-Spec AvgBenefitVerdict
GPT-5.46.847.58+0.74Meaningfully helped
Gemini Pro7.387.47+0.09Negligible
Gemini Flash6.496.50+0.01No effect
Opus7.777.770.00No effect
GLM-56.846.58-0.26Hurt
Sonnet7.627.34-0.28Hurt
Haiku6.005.56-0.44Hurt
GPT-5.4 Mini5.944.86-1.09Seriously hurt

A pattern worth noting: smaller models tend to do worse when given detailed specs than larger ones. Spec complexity should probably be calibrated to builder capability — simpler models may perform better with simpler prompts. For well-known games, model training data already includes sufficient implementation knowledge, so a detailed spec adds signal that's redundant with what the model already knows, and for weaker models the extra complexity becomes noise.

V2 Planner Scores — Control vs Detailed Specs

6. The Money — Cost Per Build

V1 Score vs Audited Cost Per Build (PaleBlueDot per-call billing)
ModelV1 ScoreV2 ScoreV1 $/buildV2 $/buildTier
Sonnet8.627.40$3.30$1.39Premium
GPT-5.48.587.43$0.47$0.25Premium
Opus8.227.77$2.60$1.59Premium
Gemini Pro6.787.45$0.34$0.67Mid
Gemini Flash7.746.50$0.04$0.11Budget
Haiku7.095.65$0.56$0.34Budget
GLM-56.63$0.10Budget
GPT-5.4 Mini7.865.07—*$0.01Budget
o3-mini5.23$0.07Mid

Per-build costs are audited — derived from PaleBlueDot per-call billing divided by completed-build counts per model per round. They cover the builder-call API spend only, not orchestration overhead (which roughly doubled the bill — see below). * GPT-5.4 Mini's V1 builds went via a different code path (runs_7_8_build_results.json) and didn't surface in the per-game-per-model audit; per-build figure not directly computable from the same source.

At the PaleBlueDot gateway rates we paid: Flash is ~5× cheaper than Sonnet on output ($3 vs $15 / M tokens) and ~8× cheaper than Opus ($3 vs $25 / M). Opus is ~1.7× pricier per output token than Sonnet ($25 vs $15 / M; $5 vs $3 input). Without human playtesting to validate whether AI-judge score differences map to real-world quality, we've stopped short of a score-per-dollar headline.

Per-builder spend by version

These are the actual builder-model API costs per round, audited from PaleBlueDot per-call billing. They cover the build-only portion (orchestration, judging, rebuilds are separate lines below).

BuilderV1 ($)V2 ($)What changed
Opus67.5777.81Top spend both rounds — long calls, deep context
Sonnet59.3369.27Close second — high call frequency
Gemini Pro6.8132.64~5× jump — V2 added Pro as a planner role
Haiku11.2016.80Modest growth, consistent budget tier
GPT-5.49.4312.57Stable across rounds
Gemini Flash0.895.54Tiny in V1 (judge only); used as builder in V2
GLM-54.78New in V2 only
GPT-5 mini0.61New in V2 only
o3-mini1.32V1 only
Build-only total$157$220Per-game breakdown below

Why V2's per-builder spend is higher: V2 used a different planner-builder mix (5 planner roles instead of 2 fixed templates), longer prompts (free-form specs vs 12-section template), more builders per game (8 vs 9, but with deeper per-call context), and isolated context that reset on each call (no cross-game caching).

Per-game build cost (V1 + V2 combined)

GameV1 ($)V2 ($)Total ($)Notes
Pong20.3828.1448.52Most expensive overall — Sonnet-heavy in V2 ($14.43 alone)
Pac-Man25.4421.2146.65V1 high from Opus rebuild attempts ($17.10 in single cell)
Tetris16.4123.8940.30Opus-heavy V2 ($11.21)
Galaga14.6422.9637.60Sonnet + Opus dominant V2
Space Invaders14.2324.2238.45Opus highest single cell ($10.11) V2
Asteroids13.2321.3234.55Steady distribution
Donkey Kong11.5720.8532.42Cheaper than expected given complexity — most calls hit ceilings fast
Snake11.2420.4731.71Quick to converge — short calls
Breakout18.2218.9937.21Most-balanced V1/V2 — well-known mechanics
Frogger11.2117.9629.17Cheapest game overall

The complexity-vs-quality finding (Section 3) doesn't map cleanly onto cost: Donkey Kong was actually one of the cheapest games to attempt, despite being the lowest-scoring. Most builders churned out a quick failure rather than a long, expensive attempt. Pong and Pac-Man cost more because builders kept trying — long sessions, multiple iterations, more context accumulation.

What it actually cost — fully audited

A token-level audit of every PaleBlueDot call from 30 March – 15 April separates the games work from concurrent activity on the same API key (cron jobs, daily research agents, side projects). Headline:

SliceCost (USD)What it covers
V1 build$3328 models × 10 games, template specs, parallel dispatch
V2 build$3995 planner roles × 8 builders × 10 games, free-form specs, isolated context
V2 inline judging (bench-sonnet)$202Long-running judging session — single biggest line item
V2 static 3-judge QA$50Sonnet + Gemini Pro + GPT-5.4 panel, 1.5 hours
V1 rebuilds (Pac-Man + Donkey Kong)$403 rebuild events, attributed to V2 phase
Window edges + cleanup$56Spillover spend at the boundaries of build / QA windows
Games total (audited)$1,138Everything attributable to V1+V2 work
Concurrent orchestration noise$660Cron, daily agents, side activities sharing the API key
Grand total during the period$1,798Full PaleBlueDot spend, 30 Mar – 15 Apr

The games-attributable spend is ~$1,138. Another ~$660 of unrelated automation (cron jobs, daily research agents, side projects) ran on the same API key during the period, bringing the gross PaleBlueDot bill to ~$1,798. The two are separated above because the wider number doesn't measure the cost of building or evaluating the games.

V2 cost 2.4× more than V1 — $806 vs $332 — driven by three real differences in scope, not by wall-clock duration (V2 ran across more days largely because of my availability, not because the work itself was 2× heavier):

Richer methodology costs sharply more. The cost ratio (2.4×) is closer to the methodology delta than the time delta would suggest.

What we'd do differently to spend less

Three findings worth flagging up front, separated from the model-quality narrative:

Caveat: the audit relies on cross-referencing per-call CSV billing with bench-* JSONL timestamps. PaleBlueDot doesn't tag calls with project/agent metadata, so attribution between "games work" and "concurrent automation" is inferred from time windows, not direct labelling. The breakdown is directional, not exact to the cent.

7. Methodology

V1 Protocol (182 builds)

V2 Protocol (400 builds)

Build Environment

All builds executed via the OpenClaw agent framework, which wraps API calls to each model provider. V1 builds ran fully in parallel across models and games. V2 builds ran in a mix of parallel and sequential batches depending on provider rate limits. Hard timeout: 15 minutes per build attempt. Output: single index.html file per build.

8. Judge Calibration

Three judges independently scored every V1 build. Their agreement — or lack thereof — is critical to interpreting the results.

Judge Scores by Builder Model — Showing Gemini Flash Inflation
BuilderGPT-5.4 JudgeOpus JudgeGemini JudgeGPT/Opus AvgGemini Inflation
Sonnet8.018.039.828.02+1.80
GPT-5.47.988.049.728.01+1.71
Opus7.557.429.687.49+2.19
GPT-5.4 Mini7.256.929.437.09+2.34
Gemini Flash6.926.899.436.91+2.52
GPT-5.4 Nano6.746.409.466.57+2.89
Haiku6.095.969.246.03+3.21
Gemini Pro5.876.188.306.03+2.27
o3-mini4.883.846.974.36+2.61

Gemini Flash judge inflates scores by an average of +2.39 points over the GPT-5.4/Opus consensus. The inflation is worse for lower-quality builds — Haiku builds get +3.21 points of inflation vs Sonnet builds at +1.80 — meaning Gemini Flash actively compresses the quality scale, making bad builds look mediocre and mediocre builds look excellent.

10. Limitations and Future Work

Future work

11. Practical Recommendations

Based on 580 builds, here's when to use what:

Use CaseRecommended ModelWhy
Maximum quality, cost no objectSonnet, GPT-5.4, or OpusV1: Sonnet/GPT-5.4 tied. V2: Opus tops builder ranking (7.77). Depends on pipeline.
High volume generationGemini Flash~5× cheaper output tokens than Sonnet via PaleBlueDot. 0.88-point V1 quality drop.
Budget with decent qualityHaiku or GPT-5.4 NanoHaiku 7.09 (V1). Test smaller models in your pipeline before committing — V2 showed several dropped significantly under stricter conditions.
Exploring / prototypingGPT-5.4 Nano or Gemini FlashCheapest tokens, acceptable output for iteration.
Accumulated context pipelinesOpusOnly model that measurably benefits from seeing prior work (+0.95 points in V1).

And what to avoid: