This guide is based on real data: 580 retro arcade games built by 9 AI models with zero human intervention, scored by automated judges and human playtesting. Every recommendation traces back to measured results, not vibes. Whether you're trying AI coding for the first time or optimising an existing workflow, this is the unvarnished version — what works, what doesn't, and what costs more than it should.
Ch. I
What AI coding actually is
Before you read anything else — this was all one-shot
Every number in this guide comes from a single-pass build: one prompt in, one result out, no iteration, no human review, no retry. That is deliberately unrepresentative of how people actually code with AI. In a real iterative workflow — write, run, fix, refine — the frontier models (Sonnet, GPT-5.4, Opus) almost certainly pull further ahead than this snapshot suggests. Their strength is adapting to feedback across turns, which a one-shot cannot measure.
Opus is the cleanest example: in V1's accumulated-context runs — where each new build saw the previous one's code — Opus scored roughly a full point higher than when sessions were isolated. The more information and iteration it had, the better it got. Treat every number here as the lower bound for premium models. In real use, the gap between frontier and budget widens, not narrows.
AI coding means giving a language model a description of what you want and getting code back. That's it. The model doesn't "understand" your project. It predicts the most likely sequence of tokens that follows your prompt, trained on billions of lines of code it's seen before.
This has two implications:
It's brilliant at common patterns. If thousands of developers have built something similar, the model has seen the solution. Pong, CRUD apps, REST APIs — these are well-trodden paths.
It's unreliable for novel problems. Unique business logic, unusual architectures, or anything outside the training distribution will get a plausible-looking answer that may be subtly wrong.
In our benchmark, models scored 8.72/10 on Snake (simple, well-known) but 6.09/10 on Donkey Kong (complex, more implementation-specific). The complexity of the game mattered far more than which model you used.
Myth
"AI can code anything if you describe it well enough."
Reality
AI can code well-known things reliably. Novel things are a gamble regardless of prompt quality.
A note on "quality" in this guide
Unless stated otherwise, quality means the consensus score from three AI judges reading each build's source code. It measures whether the code runs, whether mechanics look right, and whether controls are wired up. It does not measure readability, maintainability, or whether a human enjoys using the software. Where human taste diverges meaningfully from AI-judge scores, we flag it. Treat every number as directional, not absolute.
Ch. II
Choosing your model
There's no single "best" model. There's the right model for your budget and quality bar.
What are you optimising for?
Max qualityUse Sonnet, GPT-5.4, or Opus. V1 had Sonnet/GPT-5.4 nearly tied (8.62 vs 8.58). V2 put Opus at #1 (7.77) once context isolation was enforced. Any of the three is defensible for production work. Test your actual pipeline before locking in.
Volume / costUse Gemini Flash. Consensus score ~90% of Sonnet's, token rates ~5× cheaper on output and ~6× on input via PaleBlueDot. The honest caveat: that 90% is AI-judges-reading-code quality. They don't measure readability, code taste, or maintainability. For throwaway generation, Flash is compelling. For code you plan to read, extend, or trust long-term, the Sonnet gap may be larger than the score suggests.
Free-formGemini Pro is strong when the pipeline gives it free-form specs and isolated context. V2 ranked it #2 (7.45) — behind Opus, ahead of Sonnet and GPT-5.4. Its V1 "bad" reputation (6.78) was largely a scoring artefact, not a real quality gap. Worth a look for planner roles or isolated single-pass generation.
BudgetHaiku is the safest budget pick in our data (7.09 V1 consensus). GPT-5.4 Mini looked good in V1 (7.86) but collapsed to 5.07 under V2's stricter evaluation — the budget class is not interchangeable. Treat any budget choice as provisional until you've tested your own task.
What to avoid
Assuming price predicts quality. Opus is ~1.7× Sonnet's per-token rate at PaleBlueDot gateway rates ($25/M vs $15/M output, $5/M vs $3/M input) — modest on paper. Opus scored lower in V1 (8.22 vs 8.62) but topped V2's builder ranking (7.77). Across V1 + V2 builds, Opus drew $145 of the $377 builder API spend — the highest of any single builder. Per-call cost (driven by call length + context depth) is a much bigger factor than per-token rate. Model rankings also shift with evaluation methodology; always test in your own setup.
o3-mini for creative coding. Scored 5.23 — dead last by 1.55 points. Honest caveat: o3-mini was released in January 2025, making it over 15 months old at benchmark time, and it's a lightweight reasoning model by design (the smallest of the o-series). Its poor showing is as much about age and weight class as architecture — a genuinely useful snapshot of how fast the frontier moves. With hindsight we should have tested o5-mini instead (I thought we had, but the pipeline picked up o3-mini); if you want a reasoning-model comparison for coding, start there.
Extrapolating from models we barely tested. GPT-5.4 Nano appeared only in two V1 runs and was not included in V2. We don't have enough evidence to recommend or dismiss it — just don't treat the limited data as strong evidence.
Assuming Gemini Pro is bad because V1 said so. It isn't. V1's 6.78 consensus was depressed by the Flash judge inflating other models more than Pro. Clean V1 scores put Pro at 7.15, same as Flash. V2 (cleaner methodology) put Pro at 7.45 — #2.
Model
V1 cons.
V2
Tier
When to reach for it
Sonnet
8.62
7.40
Premium
Production code you'll read and extend
GPT-5.4
8.58
7.43
Premium
Production, ~17% cheaper input tokens than Sonnet, same output
Opus
8.22
7.77
Premium+
Complex tasks, isolated single-pass pipelines
Gemini Pro
6.78 *
7.45
Mid
Free-form specs, planner roles, isolated context
Gemini Flash
7.74
6.50
Budget
Volume generation, throwaway code, prototyping
Haiku
7.09
5.65
Budget
Lightweight tasks; check your use case first
* V1 Gemini Pro's 6.78 consensus reflects Flash-judge inflation affecting other models more. By the calibrated GPT+Opus judge average, Pro was 7.15 — competitive with Flash.
GPT-5.4 Mini, GPT-5.4 Nano, and o3-mini are omitted from this table because V1/V2 coverage is incomplete or their performance was too inconsistent to recommend.
Where to start — which subscriptions to buy
If you're new to AI coding and want to set yourself up properly without overpaying, the data points in a clear direction.
If you can buy one subscription Anthropic. Sonnet covers production work; Haiku covers cheap routine tasks; Opus covers the hardest cases when you need it. Anthropic appears in 7 of the 10 best builder/planner pairings in our V2 data — it's the load-bearing vendor in this space. You give up Gemini Flash's cheap volume and Gemini Pro's free-form planning niche, but you stay in the top quality tier for everything you'd realistically build.
If you can buy two Anthropic + Google. Anthropic for premium one-shot quality (Opus, Sonnet, Haiku), Google for the cheapest tokens (Flash) and the strongest free-form planner (Gemini Pro, V2 #2 builder). This combo covers 4 of the top 6 V2 builders plus the cheapest token tier — your $/build floor drops from ~$0.34 (Haiku) to ~$0.11 (Flash) for volume work.
Practical pipeline pairings (planner × builder)
If you're building agent systems where one model writes a spec and another model executes it, pair carefully. The wrong combination can make a strong builder produce broken output.
Concise spec into best builder. Spec-neutral builder absorbs any planner.
10
GLM-5 → Gemini Flash
Aggregator-API-only budget pair (no consumer subscription needed for either). ~$0.10–0.11 / build.
Pairings to avoid
Opus → Haiku, Mini, or Nano. Small builders choke on long specs. V2 Haiku Snake scored 7.93 with a light spec, 1.43 with Opus's 12 KB spec. Same model, same game. Spec complexity must be calibrated to builder capability.
Opus → Sonnet for well-known patterns. Sonnet loses 0.28 points with detailed specs. Use Gemini Pro as the planner instead — lighter spec, similar quality output.
GPT-5.4 builder with no detailed spec. GPT-5.4 builders gain +0.74 from a detailed plan — they're the only model that strongly needs one. Without a structured spec, GPT-5.4 underperforms its tier.
Haiku or GLM-5 as planners with any premium builder. Neither produces specs that improve premium builder output, and you're paying twice.
The single most useful rule Anthropic should be your first subscription. Whatever your second is — Google for cheap volume, OpenAI for spec-heavy pipelines, or none at all — Anthropic is the vendor most consistently producing top builders and top planners across both rounds of testing. If you only buy one, buy this. Add a second when you've identified a specific gap (cheap volume, vendor diversity, OAuth pricing) Anthropic doesn't fill.
Ch. III
The spec matters (or doesn't)
We ran a direct experiment: five planners wrote specifications for the same 10 games, ranging from 29,000-byte engineering blueprints (Opus) to 90 bytes (just the game name). Eight builder models built from each spec.
The result, confirmed by 3-judge static QA: Builders given only the game name averaged 6.86/10 — higher than the 6.70 average from detailed specifications. Most builders scored worse with specs; GPT-5.4 Mini lost over a full point.
What this means for you
If you're asking an AI to build something it already knows (a to-do app, a login form, a common game), a brief description is probably enough. Save your specification effort for the parts that are genuinely novel or specific to your needs.
This doesn't mean specs are always useless. It means the model's training data already includes the "spec" for well-known tasks. When you write a detailed specification for Pac-Man, you're largely restating what the model already knows — and sometimes constraining it in ways that make the output worse.
When specs DO help
Novel or domain-specific tasks — anything the model hasn't seen a thousand times
Non-obvious requirements — specific colour schemes, accessibility constraints, business rules
Integration points — how this code connects to your existing system
Constraints the model wouldn't assume — no external dependencies, must run offline, maximum 50KB
When specs hurt
Over-specifying implementation details for well-known patterns — the model's default approach may be better than yours
Contradictory requirements — long specs are more likely to contain contradictions that confuse the model
Template-driven specs — forcing a model to fill in 12 sections when 3 would do
Ch. IV
Vibe coding: how far can zero-input go?
"Vibe coding" — coined by Andrej Karpathy — means giving an AI a loose description and letting it build whatever it interprets. No detailed spec, no iteration, no code review. Just vibes.
Our benchmark is essentially a 580-build vibe coding experiment. Every game was built in a single pass with no human intervention. The results:
Simple games (Snake, Pong) — vibe coding works brilliantly. Average scores above 8.5. Most builds are playable and recognisable.
Medium games (Tetris, Breakout) — usually playable but with notable issues: missing features, visual shortcuts, occasional bugs. Scores 7.5–8.5.
Complex games (Pac-Man, Donkey Kong) — hit or miss. Some impressive implementations, some complete failures. Scores 6.0–7.0 with high variance.
The vibe coding trap
It feels magical when it works on simple tasks — which tricks you into thinking it'll scale to complex ones. It doesn't. The 580-build evidence shows quality drops sharply with complexity, and no amount of model quality closes the gap. Even Sonnet (best overall) scored 6.09 on Donkey Kong.
The vibe coding spectrum
Level
You do
AI does
Good for
Full vibe
One sentence
Everything
Prototypes, known patterns
Guided vibe
Requirements list
Architecture + code
Features with clear scope
Pair coding
Architecture + review
Implementation
Production features
Assisted
Everything except typing
Autocomplete + suggestions
Complex/novel systems
The right level depends on what you're building and how much you can afford to get wrong. Vibe coding a prototype? Go for it. Vibe coding your payment system? Please don't.
Ch. V
The money traps
AI coding costs are deceptive. The per-token prices look small until you multiply by volume, iterations, and the invisible cost of debugging bad output.
Trap 1 — Premium by default
Most people start with the "best" model and never test cheaper alternatives. In our benchmark, Gemini Flash scored ~90% of Sonnet's AI-judge consensus (7.74 vs 8.62 in V1) at PaleBlueDot rates that are ~5× cheaper on output ($3 vs $15 / M tokens) and ~6× cheaper on input ($0.50 vs $3 / M). For volume generation where small quality drops are acceptable, the math favours Flash heavily.
The honest caveat: AI judges read the code and score whether it runs, looks right, and handles controls. They don't meaningfully assess readability, maintainability, or code taste. Human evaluators often find a wider gap than the 10% suggests — Flash's output is more likely to be oddly structured, sparsely commented, or built on shortcuts that work but aren't good code. If you're generating once and discarding, Flash is obvious value. If the output will be read, extended, or live in production, the premium tier often buys more than the score says.
Trap 2 — Paying for reasoning you don't need
Reasoning models think step-by-step before generating. That's brilliant for maths and logic puzzles, expensive for creative work that wants whole-system generation. In our benchmark, the only reasoning model in the panel — o3-mini — scored dead last (5.23/10).
The honest context: o3-mini was released January 2025, making it over 15 months old at benchmark time, and it's a lightweight reasoning model by design. Its last-place finish is as much about age and weight class as it is about "reasoning vs creative coding" — you can read this as a snapshot of how fast the frontier has moved in 15 months. In hindsight we should have used o5-mini (I genuinely thought we had; turns out the pipeline picked up o3-mini and the mistake only surfaced in analysis). If you're comparing a reasoning model for code generation today, start with o5-mini or similar — not o3-mini.
The narrow takeaway: don't pay premium for o3-mini when a modern budget model will outperform it. The broader question — whether reasoning architectures genuinely hurt creative coding — needs more than one old data point to answer.
Trap 3 — Iteration without evaluation
The most expensive habit: re-prompting a model repeatedly without checking if the output is actually getting better. Each iteration costs tokens. If you're not measuring quality between iterations, you're paying for random walks.
The cost-smart approach
Start with a cheap model (Gemini Flash or GPT-5.4 Nano). If the output isn't good enough, try a mid-tier model (Haiku). Only escalate to premium (Sonnet, GPT-5.4, or Opus) when you've confirmed the task actually needs it. Most tasks don't.
Ch. VI
Checking the output — why you can't trust AI self-reports
When we asked AI judges to evaluate AI-built games, they disagreed significantly. V1's Gemini Flash judge rated nearly everything 9–10, inflating scores by ~1.8 points. V2 replaced it with Gemini Pro — and the entire leaderboard shifted as a result. The judge panel you choose changes your scores even if the underlying quality is identical.
The lesson generalises beyond our benchmark: AI models are unreliable evaluators of their own output. They're biased toward marking things as "correct" and they can't detect the kinds of bugs that only show up when you actually use the software.
What automated testing catches
Syntax errors and crashes
Missing elements (no canvas, no game loop)
Static analysis issues
Basic functionality checks (does the score change? does the game start?)
What only a human catches
The ghost AI in Pac-Man doesn't actually chase you
Tetris pieces rotate but clip through the floor
The game is technically playable but not fun
The scoring works but the difficulty never increases
It looks like Pac-Man drawn by a five-year-old
If you're not testing it yourself, you're not testing it
Automated QA is necessary but not sufficient. Run the code. Use the feature. Click the buttons. The five minutes you spend testing will save hours of debugging later.
Ch. VII
When to intervene
Zero-intervention AI coding (our benchmark methodology) produced average scores of 7.5/10 across 580 builds. That's "functional but flawed" territory. The question isn't whether human intervention helps — it obviously does — but when the return on your time is highest.
High-value interventions
Reviewing the architecture before coding starts. If the model's planned approach is wrong, fixing it early saves everything downstream.
Testing the output. 5 minutes of manual testing catches bugs that 3 AI judges missed.
Specifying non-obvious constraints. "Must work offline", "max 50KB", "needs to support screen readers" — things the model won't assume.
Low-value interventions
Micro-managing implementation. If the model knows how to build Pong, telling it which collision algorithm to use probably makes things worse.
Writing exhaustive specs for well-known patterns. Our data shows 29K-byte specs performed no better than 3-line prompts for known games.
Iterating on the prompt instead of the code. If the output is 80% right, it's faster to fix the code than to craft the perfect prompt that generates perfect code.
Ch. VIII
Tools of the trade
AI coding tools fall into a few categories:
Category
Examples
Best for
Autocomplete
GitHub Copilot, Cursor Tab
Speeding up known patterns. You're still driving.
Chat-based coding
Claude Code, Cursor Chat, Aider
Generating features, refactoring, explaining code
Agent frameworks
OpenClaw, Devin, SWE-agent
Autonomous multi-step tasks, bulk generation
Benchmarks / evaluation
SWE-bench, HumanEval, this project
Measuring model quality, choosing models
This benchmark used OpenClaw as the agent framework — it handles session management, model routing, and artefact verification. The games themselves were built entirely by the AI models; OpenClaw just orchestrated the process.
Ch. IX
The mistakes I made
Running 580 builds taught me things the hard way. Grouped into five categories where they actually bit me — data, workflow, QA, model choice, and infrastructure. The infrastructure category was added late after a retrospective token audit revealed where the money really went.
01 Data & cost tracking
I under-budgeted for QA.
A retrospective token audit broke the spend down honestly. The games-attributable work cost ~$1,138; another ~$660 of unrelated automation (cron jobs, daily research agents, side projects) shared the API key during the period for a gross PaleBlueDot total of ~$1,798. Within the games slice, V2 QA alone was $252 — a 41-hour bench-sonnet inline-judging session ($202) plus the 3-judge static QA panel ($50) — vs ~$10 of V1 QA that leaned on free Playwright checks. V2 QA was 25× more expensive than V1 QA. I did not see that coming, and it's by far the largest single chunk of avoidable spend.
I didn't record token counts from the start.
Runs 1–6 have no token data. I can only estimate costs for those builds. Always instrument your pipeline for cost tracking from day one — retrofitting it is painful and often impossible.
02 Workflow & methodology
I accumulated context between builds.
In V1, builders saw their own prior code between games. This caused context exhaustion (DNFs on complex games) and made comparisons unfair — Opus was penalised most. V2 isolated every session.
I used a template for specs.
V1's 12-section spec template meant both planners filled the same form. I was testing template-filling, not planning. V2's unconstrained specs revealed real differences between how each model interprets a loose brief.
I ran builds in parallel.
Parallel execution caused timeout cascading in V1. V2 used a mix of parallel and sequential batches depending on provider rate limits — slower but reliable.
I over-specified simple games.
Writing 300-line specs for Pong was a waste. The model already knows Pong. The spec just added noise and occasionally confused the builder into worse output than a one-line prompt.
03 Quality assurance
I trusted model self-reports.
Early builds were marked "complete" because the model said so. Many were broken. Always verify the artefact (does the file exist? does it open? does the canvas render?), not the model's claim about it.
I used Gemini Flash as a judge.
It rates nearly everything 9–10. It inflated every consensus score in V1 by roughly two points. A 3-judge panel is fine, but one of my judges was effectively noise — a cheerful yes-machine masquerading as calibration.
I assumed automated QA was sufficient.
V2's browser-bug scoring gave 88% of builds a perfect score. That's not discrimination — that's a broken metric. Human playtesting catches the things static analysis and AI judges miss: fun, fairness, whether the ghost AI actually chases you.
04 Model selection
I assumed more expensive = better.
Per PaleBlueDot rates: Opus is ~1.7× Sonnet's per-token price ($25/M vs $15/M output, same ratio on input). Flash is ~5× cheaper than Sonnet on output, ~6× on input. The bigger cost story isn't per-token rate — it's how many calls each model handles and how long they run. Opus drew $145 of the $377 builder API spend across V1 + V2 — the most of any single builder — because individual Opus calls run longer and accumulate more context. Published price tier doesn't predict quality, and per-token rate doesn't predict total spend.
05 Infrastructure & cost-saving
I let the agent framework eat 2/3 of my budget.
The actual builder + judge model API calls (the bench-* sub-agents that did the work) cost ~$364. The orchestration tax around them — main / ross / harvey / dwight dispatching, polling, summarising, retrying — was ~$753. For every $1 spent on actual game-building or judging, $1.83 went to coordination overhead. The pipeline is roughly 60% coordination, 35% builders, 5% judges. If you're using a framework like OpenClaw, every conversational turn re-sends accumulated context, so token cost grows quadratically with conversation length, not linearly. Trim context aggressively between turns, batch stateless calls where you can, or use leaner pipeline scripts — V2's build_v2.py ran at $8/h orchestration vs V1's ad-hoc parallel-dispatch at $19/h. Same work, ~2.3× cheaper to coordinate.
I paid raw PAYG rates instead of using subscription tiers.
Everything went through PaleBlueDot at gateway pay-as-you-go because that's how the pipeline was wired. Most providers offer subscription / committed-spend / OAuth flows that materially reduce effective rates for sustained usage. For a project at this scale ($1,798 across two weeks of intermittent runs), that alone could have cut the bill by a meaningful fraction. Lesson: for any benchmark or batch work that'll spend more than a few hundred dollars, audit your provider's billing tiers before you start, not after.
I let one API key bill for everything.
The same PaleBlueDot key billed for V1 builds, V2 builds, QA, V1 rebuilds, this paper's writing — and for unrelated cron jobs, daily research agents, side projects all running on the same OpenClaw instance. The retrospective audit had to back-correlate timestamps with bench-* JSONLs to separate "games work" from "everything else openclaw was doing in parallel." Separate keys per project would have made attribution trivial and saved me hours of CSV cross-referencing.
Ch. X
Where to go next
If this guide was useful, here's how to go deeper:
Read the full research paper — all 580 builds analysed in detail with interactive charts and playable game embeds. Read the paper →
Play the games — every build is a single HTML file you can open in a browser. See what AI-generated code looks and feels like. Open the library →
Try it yourself — pick a simple project (a to-do app, a calculator, a game) and build it with 3 different models. Compare the output. This teaches you more than any guide.
Track your costs — before you start, decide your budget. Log tokens and costs from the first build. You'll be surprised where the money goes.
Test everything — the single most impactful habit. Run the code. Click the buttons. Check the edge cases. AI output looks right more often than it is right.