# Harness Design for Long-Running Application Development
**Executive Summary (Layer 3):** Anthropic Labs developed a **GAN-inspired multi-agent harness** (generator + separate evaluator) that enabled Claude to complete complex 6-hour full-stack development tasks and produce "museum quality" frontend designs—work that solo agents failed at in 20 minutes. The harness cost 20× more ($200 vs $9) but produced functional output vs. broken attempts.
**Key Insight (Layer 2):** Two failure modes doom long-running AI tasks: **context degradation** (models lose coherence as windows fill) and **self-evaluation bias** (agents grade their own work positively). The harness solves both through **context resets** (handing state to fresh agents) and **separate evaluator agents** with structured scoring rubrics. Creative breakthroughs emerged after 5–15 iterations—suggesting quality requires iteration depth, not just model capability.
**Context (Layer 1):** Traditional single-agent approaches work for short tasks but fail as complexity and duration increase. Anthropic's experiments compared solo Claude runs against multi-agent harnesses on two domains: subjective frontend design (aesthetic quality) and verifiable full-stack development (retro game maker, DAW). The harness used the Claude Agent SDK with Playwright MCP for live page evaluation.
---
## Why Naive Implementations Fail
### 1. Context Degradation & "Context Anxiety"
Models lose coherence as context windows fill. **Context anxiety**: models prematurely wrap up work as they approach perceived context limits.
**Solution:** Context resets (clearing window entirely, handing off state to fresh agent) vs. compaction (summarizing history in place).
> "Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn't sufficient... context resets became essential."
Trade-off: Resets provide clean slate but add orchestration complexity, token overhead, and latency.
### 2. Self-Evaluation Bias
Agents reliably skew positive when grading their own work, praising mediocre output. This is pronounced in subjective tasks (design) but persists even in verifiable tasks.
**Solution:** Separate generator from evaluator. While evaluators are still generous initially, tuning a standalone evaluator to be skeptical is "far more tractable than making a generator critical of its own work."
---
## Frontend Design: Making Subjective Quality Gradable
### The Four Criteria
Evaluators and generators use these weighted criteria to avoid "AI slop" (generic purple gradients, template layouts):
| Criterion | Focus | Weight |
|-----------|-------|--------|
| **Design Quality** | Coherent whole: colors, typography, layout create distinct mood/identity | High |
| **Originality** | Evidence of custom decisions vs. stock components/AI patterns | High |
| **Craft** | Technical competence: hierarchy, spacing, contrast ratios | Low (model handles by default) |
| **Functionality** | Usability independent of aesthetics | Low |
> "'Is this design beautiful?' is hard to answer consistently, but 'does this follow our principles for good design?' gives Claude something concrete to grade against."
### Process
- **5–15 iterations** using Claude Agent SDK
- Evaluator uses **Playwright MCP** to navigate live page, screenshot, and critique before scoring
- Generator makes strategic decisions: refine if trending well, pivot aesthetic if failing
- Calibration via few-shot examples with detailed score breakdowns
**Notable Result:** Dutch art museum website transformed on iteration 10 from expected dark-themed landing page to **3D spatial experience** (CSS perspective room with checkered floor, wall-hung artwork, doorway navigation)—a creative leap not seen in single-pass generation.
---
## Full-Stack Architecture: Three-Agent System
### Agent Personas
**1. Planner**
- Expands 1–4 sentence prompt into full product spec (16 features in game maker example)
- Stays high-level on technical implementation to avoid cascading errors
- Tasked with weaving AI features into specs
**2. Generator**
- Works in **sprints** (one feature at a time)
- Stack: React, Vite, FastAPI, SQLite/PostgreSQL
- Self-evaluates at sprint end, uses git for version control
**3. Evaluator (QA)**
- Uses **Playwright MCP** to click through running app
- Grades against sprint contract criteria with **hard thresholds** (fail → feedback loop)
- Catches bugs like:
- `fillRectangle` function exists but isn't triggered properly
- Route ordering causing FastAPI to match 'reorder' as frame_id
- Logic errors requiring `selection || (selectedEntityId && activeLayer === 'entity')`
### Sprint Contracts
Before coding, generator and evaluator negotiate a contract defining "done" for that sprint—bridging the gap between high-level user stories and testable implementation. Communication via files.
### Comparison: Solo vs. Harness
**Prompt:** "Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode."
| Harness | Duration | Cost | Result |
|---------|----------|------|--------|
| **Solo agent** | 20 minutes | $9 | Broken, incomplete |
| **Harness** | 6 hours | $200 | Functional retro game maker |
---
## Key Results & Evolution
| Metric | Frontend Design | Full-Stack |
|--------|-----------------|------------|
| **Iterations** | 5–15 | Feature-by-feature sprints |
| **Duration** | ~4 hours | ~6 hours |
| **Creative leaps** | Yes (3D spatial transformation) | N/A (verifiable output) |
| **Cost** | Higher than solo | $200 vs $9 |
| **Success rate** | Museum-quality output | Functional complex apps |
**Model Evolution:** As models improved (Opus 4.5 → 4.6), harness complexity could be reduced while maintaining capability—suggesting harness sophistication is inversely related to base model capability.
---
## Cross-Domain Connections
- **[[AI Agent Harness Engineering]]** — General principles of harness over model capability
- **[[Harness Over Model - The New AI Product Moat]]** — Harness design as competitive advantage
- **Multi-agent patterns** — Generator-evaluator separation mirrors GAN architecture
- **Quality iteration** — Creative breakthroughs require depth (5–15 iterations), not just prompting skill
**Discoverability Score:** High — Official Anthropic engineering blog, detailed implementation guidance, production-tested patterns
---
## Source
- **Article:** [Harness Design for Long-Running Application Development](https://www.anthropic.com/engineering/harness-design-long-running-apps)
- **Author:** Prithvi Rajasekaran, Anthropic Labs
- **Published:** March 24, 2026