AI Code Evaluation · a playable field guide

Lessons

Eight practical takeaways from 580+ AI-generated retro games, 690 scored builds, and a few months of sufferings. Every number below is live from the dataset — click "See evidence" to verify in the other tabs.

The AI Iron Triangle
You cannot maximise all three. When speed and cost are optimised, quality often becomes appeasement — the model gives you what you want to hear, not what is true. Most of these lessons are symptoms of the triangle.
⚡ Speed 💵 Cost ✓ Quality (truth, not just vibes)

Library

Every build is here. Click a card to play the game, see all three AI judge scores, and optionally add your own rating. Filter by version, planner, builder, or game.

0 / 0 builds

Plans & Prompts

What each planner was asked for, and what they wrote. V1 planners were given the same 12-section template — same input, different fills. V2 planners were given one free-form prompt — same ask, wildly different interpretations.

V1 Template prompt

Both V1 planners (Opus and GPT-5.4) were instructed to produce a spec following a strict 12-section structure. This isolates how each model fills a fixed template.

View the 12 sections & V1 prompt
  1. Overview
  2. Canvas & Rendering
  3. Game Objects
  4. Controls
  5. Game Rules & Logic
  6. Collision Detection
  7. Scoring & State
  8. UI
  9. Audio
  10. Implementation Notes
  11. Acceptance Criteria
  12. Build Task Checklist

Note: the exact V1 prompt string wasn't saved to disk — the 12-section structure was imposed by the orchestrator logic rather than stored in a reusable file. Both V1 planners produced specs following this structure.

V2 Free-form prompt

V2 removed the template entirely. All 5 V2 planners (Opus, GPT-5.4, Gemini Pro, GLM-5, Control) received the same brief prompt and decided for themselves what to produce.

View the V2 planner prompt
Loading...
0 specs · range

QA Data

Every judge's score, every dimension, every build. Click a column header to sort. Click a row to jump to the game in the Library tab. The judge prompt and rubric are viewable below so you can replicate.

View judge prompt & rubric
0 rows

The Paper

Academic companion — 580-build benchmark, methodology, findings, limitations.