AI Code Evaluation — A Playable Field Guide

Lessons

Eight practical takeaways from 580+ AI-generated retro games, 690 scored builds, and a few months of sufferings. Every number below is live from the dataset — click "See evidence" to verify in the other tabs.

Library

Every build is here. Click a card to play the game, see all three AI judge scores, and optionally add your own rating. Filter by version, planner, builder, or game.

Version Planner Run Builder Game Min AI score Has my rating Include DNF/pending Sort 0 / 0 builds

Plans & Prompts

What each planner was asked for, and what they wrote. V1 planners were given the same 12-section template — same input, different fills. V2 planners were given one free-form prompt — same ask, wildly different interpretations.

V1 Template prompt

Both V1 planners (Opus and GPT-5.4) were instructed to produce a spec following a strict 12-section structure. This isolates how each model fills a fixed template.

View the 12 sections & V1 prompt

Overview
Canvas & Rendering
Game Objects
Controls
Game Rules & Logic
Collision Detection
Scoring & State
UI
Audio
Implementation Notes
Acceptance Criteria
Build Task Checklist

Note: the exact V1 prompt string wasn't saved to disk — the 12-section structure was imposed by the orchestrator logic rather than stored in a reusable file. Both V1 planners produced specs following this structure.

V2 Free-form prompt

V2 removed the template entirely. All 5 V2 planners (Opus, GPT-5.4, Gemini Pro, GLM-5, Control) received the same brief prompt and decided for themselves what to produce.

View the V2 planner prompt

Loading...

Game Include V1 planners 0 specs · range —

QA Data

Every judge's score, every dimension, every build. Click a column header to sort. Click a row to jump to the game in the Library tab. The judge prompt and rubric are viewable below so you can replicate.

View judge prompt & rubric

Dataset Game Planner Builder Min spread Has my rating Include DNF/pending 0 rows