Anthropic Harness Design for Long-Running Applications

# Harness Design for Long-Running Application Development **Executive Summary (Layer 3):** Anthropic Labs developed a **GAN-inspired multi-agent harness** (generator + separate evaluator) that enabled Claude to complete complex 6-hour full-stack development tasks and produce "museum quality" frontend designs—work that solo agents failed at in 20 minutes. The harness cost 20× more ($200 vs $9) but produced functional output vs. broken attempts. **Key Insight (Layer 2):** Two failure modes doom long-running AI tasks: **context degradation** (models lose coherence as windows fill) and **self-evaluation bias** (agents grade their own work positively). The harness solves both through **context resets** (handing state to fresh agents) and **separate evaluator agents** with structured scoring rubrics. Creative breakthroughs emerged after 5–15 iterations—suggesting quality requires iteration depth, not just model capability. **Context (Layer 1):** Traditional single-agent approaches work for short tasks but fail as complexity and duration increase. Anthropic's experiments compared solo Claude runs against multi-agent harnesses on two domains: subjective frontend design (aesthetic quality) and verifiable full-stack development (retro game maker, DAW). The harness used the Claude Agent SDK with Playwright MCP for live page evaluation. --- ## Why Naive Implementations Fail ### 1. Context Degradation & "Context Anxiety" Models lose coherence as context windows fill. **Context anxiety**: models prematurely wrap up work as they approach perceived context limits. **Solution:** Context resets (clearing window entirely, handing off state to fresh agent) vs. compaction (summarizing history in place). > "Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn't sufficient... context resets became essential." Trade-off: Resets provide clean slate but add orchestration complexity, token overhead, and latency. ### 2. Self-Evaluation Bias Agents reliably skew positive when grading their own work, praising mediocre output. This is pronounced in subjective tasks (design) but persists even in verifiable tasks. **Solution:** Separate generator from evaluator. While evaluators are still generous initially, tuning a standalone evaluator to be skeptical is "far more tractable than making a generator critical of its own work." --- ## Frontend Design: Making Subjective Quality Gradable ### The Four Criteria Evaluators and generators use these weighted criteria to avoid "AI slop" (generic purple gradients, template layouts): | Criterion | Focus | Weight | |-----------|-------|--------| | **Design Quality** | Coherent whole: colors, typography, layout create distinct mood/identity | High | | **Originality** | Evidence of custom decisions vs. stock components/AI patterns | High | | **Craft** | Technical competence: hierarchy, spacing, contrast ratios | Low (model handles by default) | | **Functionality** | Usability independent of aesthetics | Low | > "'Is this design beautiful?' is hard to answer consistently, but 'does this follow our principles for good design?' gives Claude something concrete to grade against." ### Process - **5–15 iterations** using Claude Agent SDK - Evaluator uses **Playwright MCP** to navigate live page, screenshot, and critique before scoring - Generator makes strategic decisions: refine if trending well, pivot aesthetic if failing - Calibration via few-shot examples with detailed score breakdowns **Notable Result:** Dutch art museum website transformed on iteration 10 from expected dark-themed landing page to **3D spatial experience** (CSS perspective room with checkered floor, wall-hung artwork, doorway navigation)—a creative leap not seen in single-pass generation. --- ## Full-Stack Architecture: Three-Agent System ### Agent Personas **1. Planner** - Expands 1–4 sentence prompt into full product spec (16 features in game maker example) - Stays high-level on technical implementation to avoid cascading errors - Tasked with weaving AI features into specs **2. Generator** - Works in **sprints** (one feature at a time) - Stack: React, Vite, FastAPI, SQLite/PostgreSQL - Self-evaluates at sprint end, uses git for version control **3. Evaluator (QA)** - Uses **Playwright MCP** to click through running app - Grades against sprint contract criteria with **hard thresholds** (fail → feedback loop) - Catches bugs like: - `fillRectangle` function exists but isn't triggered properly - Route ordering causing FastAPI to match 'reorder' as frame_id - Logic errors requiring `selection || (selectedEntityId && activeLayer === 'entity')` ### Sprint Contracts Before coding, generator and evaluator negotiate a contract defining "done" for that sprint—bridging the gap between high-level user stories and testable implementation. Communication via files. ### Comparison: Solo vs. Harness **Prompt:** "Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode." | Harness | Duration | Cost | Result | |---------|----------|------|--------| | **Solo agent** | 20 minutes | $9 | Broken, incomplete | | **Harness** | 6 hours | $200 | Functional retro game maker | --- ## Key Results & Evolution | Metric | Frontend Design | Full-Stack | |--------|-----------------|------------| | **Iterations** | 5–15 | Feature-by-feature sprints | | **Duration** | ~4 hours | ~6 hours | | **Creative leaps** | Yes (3D spatial transformation) | N/A (verifiable output) | | **Cost** | Higher than solo | $200 vs $9 | | **Success rate** | Museum-quality output | Functional complex apps | **Model Evolution:** As models improved (Opus 4.5 → 4.6), harness complexity could be reduced while maintaining capability—suggesting harness sophistication is inversely related to base model capability. --- ## Cross-Domain Connections - **[[AI Agent Harness Engineering]]** — General principles of harness over model capability - **[[Harness Over Model - The New AI Product Moat]]** — Harness design as competitive advantage - **Multi-agent patterns** — Generator-evaluator separation mirrors GAN architecture - **Quality iteration** — Creative breakthroughs require depth (5–15 iterations), not just prompting skill **Discoverability Score:** High — Official Anthropic engineering blog, detailed implementation guidance, production-tested patterns --- ## Source - **Article:** [Harness Design for Long-Running Application Development](https://www.anthropic.com/engineering/harness-design-long-running-apps) - **Author:** Prithvi Rajasekaran, Anthropic Labs - **Published:** March 24, 2026