Without the Waste

A Field Manual · No. 01 · April 2026

AI Coding Without the Waste

What 580 AI-built games taught us about prompts, models, money, and when to just let the machine cook.

580Builds
9Models
10Games
3Judges
£1.3kAll in
0Humans mid-build
Contents
  1. I.What AI coding actually is
  2. II.Choosing your model
  3. III.The spec matters (or doesn't)
  4. IV.Vibe coding: how far can zero-input go?
  5. V.The money traps
  6. VI.Checking the output
  7. VII.When to intervene
  8. VIII.Tools of the trade
  9. IX.The mistakes I made
  10. X.Where to go next

This guide is based on real data: 580 retro arcade games built by 9 AI models with zero human intervention, scored by automated judges and human playtesting. Every recommendation traces back to measured results, not vibes. Whether you're trying AI coding for the first time or optimising an existing workflow, this is the unvarnished version — what works, what doesn't, and what costs more than it should.

Ch. I

What AI coding actually is

Before you read anything else — this was all one-shot
Every number in this guide comes from a single-pass build: one prompt in, one result out, no iteration, no human review, no retry. That is deliberately unrepresentative of how people actually code with AI. In a real iterative workflow — write, run, fix, refine — the frontier models (Sonnet, GPT-5.4, Opus) almost certainly pull further ahead than this snapshot suggests. Their strength is adapting to feedback across turns, which a one-shot cannot measure.

Opus is the cleanest example: in V1's accumulated-context runs — where each new build saw the previous one's code — Opus scored roughly a full point higher than when sessions were isolated. The more information and iteration it had, the better it got. Treat every number here as the lower bound for premium models. In real use, the gap between frontier and budget widens, not narrows.

AI coding means giving a language model a description of what you want and getting code back. That's it. The model doesn't "understand" your project. It predicts the most likely sequence of tokens that follows your prompt, trained on billions of lines of code it's seen before.

This has two implications:

In our benchmark, models scored 8.72/10 on Snake (simple, well-known) but 6.09/10 on Donkey Kong (complex, more implementation-specific). The complexity of the game mattered far more than which model you used.

Myth
"AI can code anything if you describe it well enough."
Reality
AI can code well-known things reliably. Novel things are a gamble regardless of prompt quality.
A note on "quality" in this guide
Unless stated otherwise, quality means the consensus score from three AI judges reading each build's source code. It measures whether the code runs, whether mechanics look right, and whether controls are wired up. It does not measure readability, maintainability, or whether a human enjoys using the software. Where human taste diverges meaningfully from AI-judge scores, we flag it. Treat every number as directional, not absolute.
Ch. II

Choosing your model

There's no single "best" model. There's the right model for your budget and quality bar.

What are you optimising for?
Max quality Use Sonnet, GPT-5.4, or Opus. V1 had Sonnet/GPT-5.4 nearly tied (8.62 vs 8.58). V2 put Opus at #1 (7.77) once context isolation was enforced. Any of the three is defensible for production work. Test your actual pipeline before locking in.
Volume / cost Use Gemini Flash. Consensus score ~90% of Sonnet's, token rates ~5× cheaper on output and ~6× on input via PaleBlueDot. The honest caveat: that 90% is AI-judges-reading-code quality. They don't measure readability, code taste, or maintainability. For throwaway generation, Flash is compelling. For code you plan to read, extend, or trust long-term, the Sonnet gap may be larger than the score suggests.
Free-form Gemini Pro is strong when the pipeline gives it free-form specs and isolated context. V2 ranked it #2 (7.45) — behind Opus, ahead of Sonnet and GPT-5.4. Its V1 "bad" reputation (6.78) was largely a scoring artefact, not a real quality gap. Worth a look for planner roles or isolated single-pass generation.
Budget Haiku is the safest budget pick in our data (7.09 V1 consensus). GPT-5.4 Mini looked good in V1 (7.86) but collapsed to 5.07 under V2's stricter evaluation — the budget class is not interchangeable. Treat any budget choice as provisional until you've tested your own task.

What to avoid

ModelV1 cons.V2TierWhen to reach for it
Sonnet8.627.40PremiumProduction code you'll read and extend
GPT-5.48.587.43PremiumProduction, ~17% cheaper input tokens than Sonnet, same output
Opus8.227.77Premium+Complex tasks, isolated single-pass pipelines
Gemini Pro6.78 *7.45MidFree-form specs, planner roles, isolated context
Gemini Flash7.746.50BudgetVolume generation, throwaway code, prototyping
Haiku7.095.65BudgetLightweight tasks; check your use case first

* V1 Gemini Pro's 6.78 consensus reflects Flash-judge inflation affecting other models more. By the calibrated GPT+Opus judge average, Pro was 7.15 — competitive with Flash. GPT-5.4 Mini, GPT-5.4 Nano, and o3-mini are omitted from this table because V1/V2 coverage is incomplete or their performance was too inconsistent to recommend.

Where to start — which subscriptions to buy

If you're new to AI coding and want to set yourself up properly without overpaying, the data points in a clear direction.

If you can buy one subscription
Anthropic. Sonnet covers production work; Haiku covers cheap routine tasks; Opus covers the hardest cases when you need it. Anthropic appears in 7 of the 10 best builder/planner pairings in our V2 data — it's the load-bearing vendor in this space. You give up Gemini Flash's cheap volume and Gemini Pro's free-form planning niche, but you stay in the top quality tier for everything you'd realistically build.
If you can buy two
Anthropic + Google. Anthropic for premium one-shot quality (Opus, Sonnet, Haiku), Google for the cheapest tokens (Flash) and the strongest free-form planner (Gemini Pro, V2 #2 builder). This combo covers 4 of the top 6 V2 builders plus the cheapest token tier — your $/build floor drops from ~$0.34 (Haiku) to ~$0.11 (Flash) for volume work.

Practical pipeline pairings (planner × builder)

If you're building agent systems where one model writes a spec and another model executes it, pair carefully. The wrong combination can make a strong builder produce broken output.

#Planner → BuilderSweet spot
1Opus → SonnetNovel / complex tasks. Maximum-depth spec, reliable executor.
2Opus → GPT-5.4Cross-vendor production. GPT-5.4 builders gain +0.74 with detailed specs — only model with a strong positive spec benefit.
3Opus → OpusHardest tasks, cost no object. V2 #1 builder + deepest planner.
4GPT-5.4 → GPT-5.4OpenAI-only pipeline. The only same-model pair where specs strongly help.
5Gemini Pro → SonnetBest Anthropic + Google pair. Pro writes the lightest non-trivial spec, Sonnet executes anything.
6GPT-5.4 → SonnetMixed-vendor org. Concise OpenAI specs, Anthropic execution.
7Gemini Pro → OpusLighter spec into V2 #1 builder. Saves planner cost without hurting build.
8Gemini Pro → Gemini FlashSame-vendor budget pair. Free-form planning + cheap execution. ~10× cheaper end-to-end than Anthropic premium.
9GPT-5.4 → OpusConcise spec into best builder. Spec-neutral builder absorbs any planner.
10GLM-5 → Gemini FlashAggregator-API-only budget pair (no consumer subscription needed for either). ~$0.10–0.11 / build.

Pairings to avoid

The single most useful rule
Anthropic should be your first subscription. Whatever your second is — Google for cheap volume, OpenAI for spec-heavy pipelines, or none at all — Anthropic is the vendor most consistently producing top builders and top planners across both rounds of testing. If you only buy one, buy this. Add a second when you've identified a specific gap (cheap volume, vendor diversity, OAuth pricing) Anthropic doesn't fill.
Ch. III

The spec matters (or doesn't)

We ran a direct experiment: five planners wrote specifications for the same 10 games, ranging from 29,000-byte engineering blueprints (Opus) to 90 bytes (just the game name). Eight builder models built from each spec.

The result, confirmed by 3-judge static QA: Builders given only the game name averaged 6.86/10 — higher than the 6.70 average from detailed specifications. Most builders scored worse with specs; GPT-5.4 Mini lost over a full point.

What this means for you
If you're asking an AI to build something it already knows (a to-do app, a login form, a common game), a brief description is probably enough. Save your specification effort for the parts that are genuinely novel or specific to your needs.

This doesn't mean specs are always useless. It means the model's training data already includes the "spec" for well-known tasks. When you write a detailed specification for Pac-Man, you're largely restating what the model already knows — and sometimes constraining it in ways that make the output worse.

When specs DO help

When specs hurt

Ch. IV

Vibe coding: how far can zero-input go?

"Vibe coding" — coined by Andrej Karpathy — means giving an AI a loose description and letting it build whatever it interprets. No detailed spec, no iteration, no code review. Just vibes.

Our benchmark is essentially a 580-build vibe coding experiment. Every game was built in a single pass with no human intervention. The results:

The vibe coding trap
It feels magical when it works on simple tasks — which tricks you into thinking it'll scale to complex ones. It doesn't. The 580-build evidence shows quality drops sharply with complexity, and no amount of model quality closes the gap. Even Sonnet (best overall) scored 6.09 on Donkey Kong.

The vibe coding spectrum

LevelYou doAI doesGood for
Full vibeOne sentenceEverythingPrototypes, known patterns
Guided vibeRequirements listArchitecture + codeFeatures with clear scope
Pair codingArchitecture + reviewImplementationProduction features
AssistedEverything except typingAutocomplete + suggestionsComplex/novel systems

The right level depends on what you're building and how much you can afford to get wrong. Vibe coding a prototype? Go for it. Vibe coding your payment system? Please don't.

Ch. V

The money traps

AI coding costs are deceptive. The per-token prices look small until you multiply by volume, iterations, and the invisible cost of debugging bad output.

Trap 1 — Premium by default

Most people start with the "best" model and never test cheaper alternatives. In our benchmark, Gemini Flash scored ~90% of Sonnet's AI-judge consensus (7.74 vs 8.62 in V1) at PaleBlueDot rates that are ~5× cheaper on output ($3 vs $15 / M tokens) and ~6× cheaper on input ($0.50 vs $3 / M). For volume generation where small quality drops are acceptable, the math favours Flash heavily.

The honest caveat: AI judges read the code and score whether it runs, looks right, and handles controls. They don't meaningfully assess readability, maintainability, or code taste. Human evaluators often find a wider gap than the 10% suggests — Flash's output is more likely to be oddly structured, sparsely commented, or built on shortcuts that work but aren't good code. If you're generating once and discarding, Flash is obvious value. If the output will be read, extended, or live in production, the premium tier often buys more than the score says.

Trap 2 — Paying for reasoning you don't need

Reasoning models think step-by-step before generating. That's brilliant for maths and logic puzzles, expensive for creative work that wants whole-system generation. In our benchmark, the only reasoning model in the panel — o3-mini — scored dead last (5.23/10).

The honest context: o3-mini was released January 2025, making it over 15 months old at benchmark time, and it's a lightweight reasoning model by design. Its last-place finish is as much about age and weight class as it is about "reasoning vs creative coding" — you can read this as a snapshot of how fast the frontier has moved in 15 months. In hindsight we should have used o5-mini (I genuinely thought we had; turns out the pipeline picked up o3-mini and the mistake only surfaced in analysis). If you're comparing a reasoning model for code generation today, start with o5-mini or similar — not o3-mini.

The narrow takeaway: don't pay premium for o3-mini when a modern budget model will outperform it. The broader question — whether reasoning architectures genuinely hurt creative coding — needs more than one old data point to answer.

Trap 3 — Iteration without evaluation

The most expensive habit: re-prompting a model repeatedly without checking if the output is actually getting better. Each iteration costs tokens. If you're not measuring quality between iterations, you're paying for random walks.

The cost-smart approach
Start with a cheap model (Gemini Flash or GPT-5.4 Nano). If the output isn't good enough, try a mid-tier model (Haiku). Only escalate to premium (Sonnet, GPT-5.4, or Opus) when you've confirmed the task actually needs it. Most tasks don't.
Ch. VI

Checking the output — why you can't trust AI self-reports

When we asked AI judges to evaluate AI-built games, they disagreed significantly. V1's Gemini Flash judge rated nearly everything 9–10, inflating scores by ~1.8 points. V2 replaced it with Gemini Pro — and the entire leaderboard shifted as a result. The judge panel you choose changes your scores even if the underlying quality is identical.

The lesson generalises beyond our benchmark: AI models are unreliable evaluators of their own output. They're biased toward marking things as "correct" and they can't detect the kinds of bugs that only show up when you actually use the software.

What automated testing catches

What only a human catches

If you're not testing it yourself, you're not testing it
Automated QA is necessary but not sufficient. Run the code. Use the feature. Click the buttons. The five minutes you spend testing will save hours of debugging later.
Ch. VII

When to intervene

Zero-intervention AI coding (our benchmark methodology) produced average scores of 7.5/10 across 580 builds. That's "functional but flawed" territory. The question isn't whether human intervention helps — it obviously does — but when the return on your time is highest.

High-value interventions

Low-value interventions

Ch. VIII

Tools of the trade

AI coding tools fall into a few categories:

CategoryExamplesBest for
AutocompleteGitHub Copilot, Cursor TabSpeeding up known patterns. You're still driving.
Chat-based codingClaude Code, Cursor Chat, AiderGenerating features, refactoring, explaining code
Agent frameworksOpenClaw, Devin, SWE-agentAutonomous multi-step tasks, bulk generation
Benchmarks / evaluationSWE-bench, HumanEval, this projectMeasuring model quality, choosing models

This benchmark used OpenClaw as the agent framework — it handles session management, model routing, and artefact verification. The games themselves were built entirely by the AI models; OpenClaw just orchestrated the process.

Ch. IX

The mistakes I made

Running 580 builds taught me things the hard way. Grouped into five categories where they actually bit me — data, workflow, QA, model choice, and infrastructure. The infrastructure category was added late after a retrospective token audit revealed where the money really went.

01 Data & cost tracking

  1. I under-budgeted for QA.

    A retrospective token audit broke the spend down honestly. The games-attributable work cost ~$1,138; another ~$660 of unrelated automation (cron jobs, daily research agents, side projects) shared the API key during the period for a gross PaleBlueDot total of ~$1,798. Within the games slice, V2 QA alone was $252 — a 41-hour bench-sonnet inline-judging session ($202) plus the 3-judge static QA panel ($50) — vs ~$10 of V1 QA that leaned on free Playwright checks. V2 QA was 25× more expensive than V1 QA. I did not see that coming, and it's by far the largest single chunk of avoidable spend.

  2. I didn't record token counts from the start.

    Runs 1–6 have no token data. I can only estimate costs for those builds. Always instrument your pipeline for cost tracking from day one — retrofitting it is painful and often impossible.

02 Workflow & methodology

  1. I accumulated context between builds.

    In V1, builders saw their own prior code between games. This caused context exhaustion (DNFs on complex games) and made comparisons unfair — Opus was penalised most. V2 isolated every session.

  2. I used a template for specs.

    V1's 12-section spec template meant both planners filled the same form. I was testing template-filling, not planning. V2's unconstrained specs revealed real differences between how each model interprets a loose brief.

  3. I ran builds in parallel.

    Parallel execution caused timeout cascading in V1. V2 used a mix of parallel and sequential batches depending on provider rate limits — slower but reliable.

  4. I over-specified simple games.

    Writing 300-line specs for Pong was a waste. The model already knows Pong. The spec just added noise and occasionally confused the builder into worse output than a one-line prompt.

03 Quality assurance

  1. I trusted model self-reports.

    Early builds were marked "complete" because the model said so. Many were broken. Always verify the artefact (does the file exist? does it open? does the canvas render?), not the model's claim about it.

  2. I used Gemini Flash as a judge.

    It rates nearly everything 9–10. It inflated every consensus score in V1 by roughly two points. A 3-judge panel is fine, but one of my judges was effectively noise — a cheerful yes-machine masquerading as calibration.

  3. I assumed automated QA was sufficient.

    V2's browser-bug scoring gave 88% of builds a perfect score. That's not discrimination — that's a broken metric. Human playtesting catches the things static analysis and AI judges miss: fun, fairness, whether the ghost AI actually chases you.

04 Model selection

  1. I assumed more expensive = better.

    Per PaleBlueDot rates: Opus is ~1.7× Sonnet's per-token price ($25/M vs $15/M output, same ratio on input). Flash is ~5× cheaper than Sonnet on output, ~6× on input. The bigger cost story isn't per-token rate — it's how many calls each model handles and how long they run. Opus drew $145 of the $377 builder API spend across V1 + V2 — the most of any single builder — because individual Opus calls run longer and accumulate more context. Published price tier doesn't predict quality, and per-token rate doesn't predict total spend.

05 Infrastructure & cost-saving

  1. I let the agent framework eat 2/3 of my budget.

    The actual builder + judge model API calls (the bench-* sub-agents that did the work) cost ~$364. The orchestration tax around them — main / ross / harvey / dwight dispatching, polling, summarising, retrying — was ~$753. For every $1 spent on actual game-building or judging, $1.83 went to coordination overhead. The pipeline is roughly 60% coordination, 35% builders, 5% judges. If you're using a framework like OpenClaw, every conversational turn re-sends accumulated context, so token cost grows quadratically with conversation length, not linearly. Trim context aggressively between turns, batch stateless calls where you can, or use leaner pipeline scripts — V2's build_v2.py ran at $8/h orchestration vs V1's ad-hoc parallel-dispatch at $19/h. Same work, ~2.3× cheaper to coordinate.

  2. I paid raw PAYG rates instead of using subscription tiers.

    Everything went through PaleBlueDot at gateway pay-as-you-go because that's how the pipeline was wired. Most providers offer subscription / committed-spend / OAuth flows that materially reduce effective rates for sustained usage. For a project at this scale ($1,798 across two weeks of intermittent runs), that alone could have cut the bill by a meaningful fraction. Lesson: for any benchmark or batch work that'll spend more than a few hundred dollars, audit your provider's billing tiers before you start, not after.

  3. I let one API key bill for everything.

    The same PaleBlueDot key billed for V1 builds, V2 builds, QA, V1 rebuilds, this paper's writing — and for unrelated cron jobs, daily research agents, side projects all running on the same OpenClaw instance. The retrospective audit had to back-correlate timestamps with bench-* JSONLs to separate "games work" from "everything else openclaw was doing in parallel." Separate keys per project would have made attribution trivial and saved me hours of CSV cross-referencing.

Ch. X

Where to go next

If this guide was useful, here's how to go deeper: