580 retro arcade games. 9 AI models. Zero human intervention.
We asked nine AI models to build complete, playable HTML5 arcade games — Pong, Pac-Man, Tetris, Donkey Kong, and six others — from a text specification alone. No human touched the code. No libraries. No frameworks. Just a prompt in, a game out.
The result is the largest public benchmark of AI-generated creative software: 580 builds across two rounds of testing, evaluated by three independent AI judges and human playtesting. Every game is playable in your browser.
Eight key findings from 580 builds:
Each build is a single self-contained HTML file: HTML5 Canvas, JavaScript, zero external dependencies. No CDNs, no images, no sound files. The game must be playable by opening the file in a browser — nothing else.
We chose ten classic arcade games, ordered roughly by implementation complexity:
| Game | Complexity | Key Challenge |
|---|---|---|
| Pong | Low | Paddle-angle physics, AI opponent |
| Snake | Low | Collision detection, tail growth |
| Breakout | Low-Med | Ball-brick collisions, level progression |
| Space Invaders | Medium | Formation movement, bullet patterns |
| Tetris | Medium | Piece rotation, line clearing, drop preview |
| Asteroids | Medium | Inertial physics, screen wrapping, splitting |
| Galaga | Hard | Swooping formations, ship capture mechanic |
| Frogger | Hard | Multi-lane timing, log-riding physics |
| Pac-Man | Hard | Ghost AI (4 distinct behaviours), maze rendering |
| Donkey Kong | Very Hard | Platform physics, barrel AI, level structure |
V1 (182 builds, 8 runs) used a template-based specification: every planner filled the same 12-section form before handing off to a builder model. This revealed strong builder-model rankings but its planner comparison was flawed — it measured how models fill a template, not planning quality.
V2 (400 builds, 5 runs) fixed this with unconstrained specifications. Five planners (Opus, GPT-5.4, Gemini Pro, GLM-5, and a no-spec control) were told: "Write a spec however you think will best serve the developer. No format is required." Eight builder models then built each game from each planner's spec — or, in the control condition, from the game's name alone.
Every build was evaluated by three independent AI judges (GPT-5.4, Opus, and Gemini Flash in V1; GPT-5.4, Gemini Pro, and Sonnet in V2) across five dimensions:
Consensus score = arithmetic mean of all three judges.
| Rank | Model | Consensus | GPT-5.4 Judge | Opus Judge | Min | Max | n | Tier |
|---|---|---|---|---|---|---|---|---|
| 1 | Sonnet | 8.62 | 8.01 | 8.03 | 7.33 | 9.27 | 18 | Premium |
| 2 | GPT-5.4 | 8.58 | 7.98 | 8.04 | 7.53 | 9.20 | 20 | Premium |
| 3 | Opus | 8.22 | 7.55 | 7.42 | 6.17 | 9.07 | 26 | Premium |
| 4 | GPT-5.4 Mini | 7.86 | 7.25 | 6.92 | 5.37 | 9.13 | 20 | Budget |
| 5 | Gemini Flash | 7.74 | 6.92 | 6.89 | 4.47 | 9.03 | 20 | Budget |
| 6 | GPT-5.4 Nano | 7.53 | 6.74 | 6.40 | 2.57 | 9.07 | 20 | Budget |
| 7 | Haiku | 7.09 | 6.09 | 5.96 | 4.93 | 8.60 | 20 | Budget |
| 8 | Gemini Pro | 6.78 | 5.87 | 6.18 | 1.93 | 9.13 | 20 | Mid |
| 9 | o3-mini | 5.23 | 4.88 | 3.84 | 3.73 | 7.00 | 18 | Mid |
Note: Gemini Flash judge inflates all consensus scores by ~1.8 points. For model comparison, the GPT-5.4/Opus judge columns are the cleaner signal. Full calibration analysis in Section 8.
Snake (8.72) and Pong (8.46) are near-guaranteed successes for any model. Donkey Kong (6.09) and Pac-Man (6.13) separate the capable from the rest. The minimum scores tell the real story: Pac-Man's floor of 0.57 and Donkey Kong's 1.27 indicate models producing completely non-functional implementations.
Sonnet (8.62) and GPT-5.4 (8.58) are separated by 0.04 points — well within noise. GPT-5.4 has lower published token pricing (~30% less). The choice between them is ecosystem and pricing, not quality. Both consistently produce playable, visually reasonable games with functioning controls.
Opus is ~1.7× Sonnet's per-token rate at PaleBlueDot gateway rates — same ratio on input and output ($25/M vs $15/M output; $5/M vs $3/M input). Per-token gap is modest. Where Opus actually dominates the bill is call length and context depth: across V1 + V2 builds, Opus drew $145 of the ~$377 builder API spend — the most of any single model. V1 placed Opus third (8.22, behind Sonnet at 8.62 and GPT-5.4 at 8.58); V2 placed it first (7.77) once session isolation was enforced. The two rounds don't agree, and without human playtesting we can't say which ranking is closer to real-world quality.
One data point worth noting: as a judge, Opus rated its own builds at 7.42 — the second-lowest self-assessment in the V1 panel. That suggests quality awareness rather than grade inflation.
Gemini Flash scores 7.74 — within 0.88 points of the top — at the lowest token rates in the test. Per-token, Flash is ~5× cheaper than Sonnet on output ($3 vs $15 / M) and ~8× cheaper than Opus ($3 vs $25 / M). For volume generation where small quality drops are acceptable, Flash is the rational starting point.
o3-mini scored 5.23 — dead last by 1.55 points, with the worst score on every single dimension. Its keyboard controls score of 4.33 means basic input handling is broken in most builds.
Rather than a verdict on reasoning architectures in general, this is probably a fair verdict on this specific model. o3-mini was released on 31 January 2025 — roughly 15 months before these builds ran in March–April 2026. Every other model in the lineup is from a newer generation: Sonnet 4.6, Opus 4.7, GPT-5.4, GLM-5, and current Gemini Pro/Flash were all released later. o3-mini is also the smallest and cheapest model OpenAI offered in that generation. Older and lighter is enough to explain last place; we can't separate architecture from age and size with a single data point.
Across all nine models, Visual Fidelity is the lowest-scoring dimension. Even Sonnet (best overall) scores 8.06 on visuals vs 9.11 on controls. AI models systematically deprioritise CSS/canvas aesthetics in favour of functional correctness — the games work, but they don't look like the originals.
Opus wrote specifications averaging 29,551 bytes per game. GLM-5 wrote 16,205 bytes. Gemini Pro wrote 6,696 bytes. The control condition spec was 90 bytes: just the game name and "CONTROL CONDITION: No planner spec."
| Planner | Avg Spec Size | Avg Score (3-Judge) | Median | Min | Max | n |
|---|---|---|---|---|---|---|
| Control (no spec) | 90 bytes | 6.86 | 7.30 | 2.33 | 9.00 | 80 |
| Gemini Pro | 6,696 bytes | 6.82 | 7.17 | 1.05 | 9.03 | 80 |
| GPT-5.4 | 13,384 bytes | 6.81 | 7.33 | 0.50 | 9.17 | 80 |
| GLM-5 | 16,205 bytes | 6.78 | 7.18 | 1.03 | 8.97 | 80 |
| Opus | 29,551 bytes | 6.39 | 6.60 | 1.43 | 8.90 | 77 |
The most detailed planner (Opus, 29K bytes) produced the lowest average quality — 0.47 points below the control condition. The control's minimum score (2.33) is also the highest floor of any planner, suggesting that no-spec builds fail less catastrophically.
At the builder level, specs actively harm most models. Haiku loses 0.44. Sonnet loses 0.28. Only GPT-5.4 meaningfully benefits (+0.74). Opus shows zero effect either way. The per-builder table below has the full picture.
With V2's isolated-context, unconstrained-spec methodology and fresh 3-judge scoring panel (Sonnet, Gemini Pro, GPT-5.4 replacing V1's GPT-5.4, Opus, Gemini Flash), the builder rankings shifted:
| Rank | V2 Builder | V2 Score | V1 Equivalent | V1 Score |
|---|---|---|---|---|
| 1 | Opus | 7.77 | Opus | 8.22 |
| 2 | Gemini Pro | 7.45 | Gemini Pro | 6.78 |
| 3 | GPT-5.4 | 7.43 | GPT-5.4 | 8.58 |
| 4 | Sonnet | 7.40 | Sonnet | 8.62 |
| 5 | GLM-5 | 6.63 | new in V2 | — |
| 6 | Gemini Flash | 6.50 | Gemini Flash | 7.74 |
| 7 | Haiku | 5.65 | Haiku | 7.09 |
| 8 | GPT-5.4 Mini | 5.07 | GPT-5.4 Mini | 7.86 |
Opus rises from 3rd to 1st. Sonnet and GPT-5.4 drop from 1st-2nd to 3rd-4th. This likely reflects V2's stricter judges (no Gemini Flash inflation) and isolated context (removing Opus's V1 accumulated-context penalty). The V2 judge panel doesn't include the lenient Gemini Flash judge, which deflates all V2 scores relative to V1.
Gemini Flash as a V1 judge rated nearly everything 9-10, with a total range of just 2.85 points across all models. GPT-5.4 showed a 3.13-point range; Opus showed 4.20 points. Gemini Flash rated o3-mini (the worst builder) at 6.97 — a score that would place it above Haiku by Gemini's measure, despite GPT-5.4 and Opus scoring it 4.88 and 3.84 respectively.
Importantly, this isn't vendor favouritism. Flash inflated all models roughly uniformly: it rated Sonnet builds +1.80 above the GPT-5.4/Opus consensus, Haiku builds +3.21, its own Gemini Flash builds +2.52, and Gemini Pro builds +2.27. Flash was no more generous to its own vendor's output than anyone else's — it simply lacks the discrimination to score meaningfully. The fail mode is poor calibration, not partiality.
V2 replaced Gemini Flash with Gemini Pro as a judge. The result: V2 scores are 1-2 points lower than V1 across the board, confirming that Gemini Flash's inflation was systematically elevating V1 consensus scores.
The practical lesson: never use a single AI judge for evaluation. Use at least two judges that agree on relative ordering, and be aware that changing your judge panel will shift absolute scores even if the underlying quality is unchanged.
V1 found a small planner effect: Opus-generated specs averaged 8.27 vs GPT-generated specs at 8.15 (delta +0.12). But V1's planner comparison was methodologically flawed — both planners filled the same template, so we measured template-filling skill, not planning quality.
V2 removed the template entirely. The planner prompt was:
"You are writing a game specification for a developer who will implement [GAME NAME] as a single self-contained HTML5 Canvas file with no external dependencies. The developer is skilled but has never played this game. Write them a spec that gives them everything they need to build a complete, playable version. No format is required. Write it however you think will best serve the developer."
The result was dramatic structural diversity. For Pong alone:
"# Pong\n\nCONTROL CONDITION: No planner spec."The qualitative difference is striking. The quantitative result, confirmed with full 3-judge static QA, is that it didn't matter — and in many cases, detailed specs actively hurt.
The effect varies dramatically by builder. Most models scored worse when given specs:
| Builder | Control Avg | With-Spec Avg | Benefit | Verdict |
|---|---|---|---|---|
| GPT-5.4 | 6.84 | 7.58 | +0.74 | Meaningfully helped |
| Gemini Pro | 7.38 | 7.47 | +0.09 | Negligible |
| Gemini Flash | 6.49 | 6.50 | +0.01 | No effect |
| Opus | 7.77 | 7.77 | 0.00 | No effect |
| GLM-5 | 6.84 | 6.58 | -0.26 | Hurt |
| Sonnet | 7.62 | 7.34 | -0.28 | Hurt |
| Haiku | 6.00 | 5.56 | -0.44 | Hurt |
| GPT-5.4 Mini | 5.94 | 4.86 | -1.09 | Seriously hurt |
A pattern worth noting: smaller models tend to do worse when given detailed specs than larger ones. Spec complexity should probably be calibrated to builder capability — simpler models may perform better with simpler prompts. For well-known games, model training data already includes sufficient implementation knowledge, so a detailed spec adds signal that's redundant with what the model already knows, and for weaker models the extra complexity becomes noise.
| Model | V1 Score | V2 Score | V1 $/build | V2 $/build | Tier |
|---|---|---|---|---|---|
| Sonnet | 8.62 | 7.40 | $3.30 | $1.39 | Premium |
| GPT-5.4 | 8.58 | 7.43 | $0.47 | $0.25 | Premium |
| Opus | 8.22 | 7.77 | $2.60 | $1.59 | Premium |
| Gemini Pro | 6.78 | 7.45 | $0.34 | $0.67 | Mid |
| Gemini Flash | 7.74 | 6.50 | $0.04 | $0.11 | Budget |
| Haiku | 7.09 | 5.65 | $0.56 | $0.34 | Budget |
| GLM-5 | — | 6.63 | — | $0.10 | Budget |
| GPT-5.4 Mini | 7.86 | 5.07 | —* | $0.01 | Budget |
| o3-mini | 5.23 | — | $0.07 | — | Mid |
Per-build costs are audited — derived from PaleBlueDot per-call billing divided by completed-build counts per model per round. They cover the builder-call API spend only, not orchestration overhead (which roughly doubled the bill — see below). * GPT-5.4 Mini's V1 builds went via a different code path (runs_7_8_build_results.json) and didn't surface in the per-game-per-model audit; per-build figure not directly computable from the same source.
At the PaleBlueDot gateway rates we paid: Flash is ~5× cheaper than Sonnet on output ($3 vs $15 / M tokens) and ~8× cheaper than Opus ($3 vs $25 / M). Opus is ~1.7× pricier per output token than Sonnet ($25 vs $15 / M; $5 vs $3 input). Without human playtesting to validate whether AI-judge score differences map to real-world quality, we've stopped short of a score-per-dollar headline.
These are the actual builder-model API costs per round, audited from PaleBlueDot per-call billing. They cover the build-only portion (orchestration, judging, rebuilds are separate lines below).
| Builder | V1 ($) | V2 ($) | What changed |
|---|---|---|---|
| Opus | 67.57 | 77.81 | Top spend both rounds — long calls, deep context |
| Sonnet | 59.33 | 69.27 | Close second — high call frequency |
| Gemini Pro | 6.81 | 32.64 | ~5× jump — V2 added Pro as a planner role |
| Haiku | 11.20 | 16.80 | Modest growth, consistent budget tier |
| GPT-5.4 | 9.43 | 12.57 | Stable across rounds |
| Gemini Flash | 0.89 | 5.54 | Tiny in V1 (judge only); used as builder in V2 |
| GLM-5 | — | 4.78 | New in V2 only |
| GPT-5 mini | — | 0.61 | New in V2 only |
| o3-mini | 1.32 | — | V1 only |
| Build-only total | $157 | $220 | Per-game breakdown below |
Why V2's per-builder spend is higher: V2 used a different planner-builder mix (5 planner roles instead of 2 fixed templates), longer prompts (free-form specs vs 12-section template), more builders per game (8 vs 9, but with deeper per-call context), and isolated context that reset on each call (no cross-game caching).
| Game | V1 ($) | V2 ($) | Total ($) | Notes |
|---|---|---|---|---|
| Pong | 20.38 | 28.14 | 48.52 | Most expensive overall — Sonnet-heavy in V2 ($14.43 alone) |
| Pac-Man | 25.44 | 21.21 | 46.65 | V1 high from Opus rebuild attempts ($17.10 in single cell) |
| Tetris | 16.41 | 23.89 | 40.30 | Opus-heavy V2 ($11.21) |
| Galaga | 14.64 | 22.96 | 37.60 | Sonnet + Opus dominant V2 |
| Space Invaders | 14.23 | 24.22 | 38.45 | Opus highest single cell ($10.11) V2 |
| Asteroids | 13.23 | 21.32 | 34.55 | Steady distribution |
| Donkey Kong | 11.57 | 20.85 | 32.42 | Cheaper than expected given complexity — most calls hit ceilings fast |
| Snake | 11.24 | 20.47 | 31.71 | Quick to converge — short calls |
| Breakout | 18.22 | 18.99 | 37.21 | Most-balanced V1/V2 — well-known mechanics |
| Frogger | 11.21 | 17.96 | 29.17 | Cheapest game overall |
The complexity-vs-quality finding (Section 3) doesn't map cleanly onto cost: Donkey Kong was actually one of the cheapest games to attempt, despite being the lowest-scoring. Most builders churned out a quick failure rather than a long, expensive attempt. Pong and Pac-Man cost more because builders kept trying — long sessions, multiple iterations, more context accumulation.
A token-level audit of every PaleBlueDot call from 30 March – 15 April separates the games work from concurrent activity on the same API key (cron jobs, daily research agents, side projects). Headline:
| Slice | Cost (USD) | What it covers |
|---|---|---|
| V1 build | $332 | 8 models × 10 games, template specs, parallel dispatch |
| V2 build | $399 | 5 planner roles × 8 builders × 10 games, free-form specs, isolated context |
| V2 inline judging (bench-sonnet) | $202 | Long-running judging session — single biggest line item |
| V2 static 3-judge QA | $50 | Sonnet + Gemini Pro + GPT-5.4 panel, 1.5 hours |
| V1 rebuilds (Pac-Man + Donkey Kong) | $40 | 3 rebuild events, attributed to V2 phase |
| Window edges + cleanup | $56 | Spillover spend at the boundaries of build / QA windows |
| Games total (audited) | $1,138 | Everything attributable to V1+V2 work |
| Concurrent orchestration noise | $660 | Cron, daily agents, side activities sharing the API key |
| Grand total during the period | $1,798 | Full PaleBlueDot spend, 30 Mar – 15 Apr |
The games-attributable spend is ~$1,138. Another ~$660 of unrelated automation (cron jobs, daily research agents, side projects) ran on the same API key during the period, bringing the gross PaleBlueDot bill to ~$1,798. The two are separated above because the wider number doesn't measure the cost of building or evaluating the games.
V2 cost 2.4× more than V1 — $806 vs $332 — driven by three real differences in scope, not by wall-clock duration (V2 ran across more days largely because of my availability, not because the work itself was 2× heavier):
Richer methodology costs sharply more. The cost ratio (2.4×) is closer to the methodology delta than the time delta would suggest.
Three findings worth flagging up front, separated from the model-quality narrative:
build_v2.py, qa_browser_v2.py) with less orchestrator chatter — but ran much longer total, so V2's absolute orchestration spend ended up higher anyway.Caveat: the audit relies on cross-referencing per-call CSV billing with bench-* JSONL timestamps. PaleBlueDot doesn't tag calls with project/agent metadata, so attribution between "games work" and "concurrent automation" is inferred from time windows, not direct labelling. The breakdown is directional, not exact to the cent.
All builds executed via the OpenClaw agent framework, which wraps API calls to each model provider. V1 builds ran fully in parallel across models and games. V2 builds ran in a mix of parallel and sequential batches depending on provider rate limits. Hard timeout: 15 minutes per build attempt. Output: single index.html file per build.
Three judges independently scored every V1 build. Their agreement — or lack thereof — is critical to interpreting the results.
| Builder | GPT-5.4 Judge | Opus Judge | Gemini Judge | GPT/Opus Avg | Gemini Inflation |
|---|---|---|---|---|---|
| Sonnet | 8.01 | 8.03 | 9.82 | 8.02 | +1.80 |
| GPT-5.4 | 7.98 | 8.04 | 9.72 | 8.01 | +1.71 |
| Opus | 7.55 | 7.42 | 9.68 | 7.49 | +2.19 |
| GPT-5.4 Mini | 7.25 | 6.92 | 9.43 | 7.09 | +2.34 |
| Gemini Flash | 6.92 | 6.89 | 9.43 | 6.91 | +2.52 |
| GPT-5.4 Nano | 6.74 | 6.40 | 9.46 | 6.57 | +2.89 |
| Haiku | 6.09 | 5.96 | 9.24 | 6.03 | +3.21 |
| Gemini Pro | 5.87 | 6.18 | 8.30 | 6.03 | +2.27 |
| o3-mini | 4.88 | 3.84 | 6.97 | 4.36 | +2.61 |
Gemini Flash judge inflates scores by an average of +2.39 points over the GPT-5.4/Opus consensus. The inflation is worse for lower-quality builds — Haiku builds get +3.21 points of inflation vs Sonnet builds at +1.80 — meaning Gemini Flash actively compresses the quality scale, making bad builds look mediocre and mediocre builds look excellent.
Based on 580 builds, here's when to use what:
| Use Case | Recommended Model | Why |
|---|---|---|
| Maximum quality, cost no object | Sonnet, GPT-5.4, or Opus | V1: Sonnet/GPT-5.4 tied. V2: Opus tops builder ranking (7.77). Depends on pipeline. |
| High volume generation | Gemini Flash | ~5× cheaper output tokens than Sonnet via PaleBlueDot. 0.88-point V1 quality drop. |
| Budget with decent quality | Haiku or GPT-5.4 Nano | Haiku 7.09 (V1). Test smaller models in your pipeline before committing — V2 showed several dropped significantly under stricter conditions. |
| Exploring / prototyping | GPT-5.4 Nano or Gemini Flash | Cheapest tokens, acceptable output for iteration. |
| Accumulated context pipelines | Opus | Only model that measurably benefits from seeing prior work (+0.95 points in V1). |
And what to avoid: