## L3: Executive Summary
Developer demonstrates that a 9B parameter model (Qwen 3.5 9B Q4_K_M) running on an RTX 3060 (12GB VRAM) via llama.cpp can build a complete space shooter game across 13 files and 3,263 lines from a single prompt. The agent iterated through 4 phases autonomously, fixing bugs—including a browser caching issue the developer never mentioned. Key enablers: optimized llama.cpp flags for 128K context, Hermes Agent tool-calling harness with model-specific parsers, and precise prompt engineering over raw model size.
## L2: Key Insights
- **Hardware**: RTX 3060 (12GB VRAM) — $250 GPU
- **Model**: Qwen 3.5 9B quantized to Q4_K_M
- **Performance**: 50 tok/s baseline, 30-45 tok/s with agent workload overhead
- **Context**: 128K tokens at 8.2GB VRAM usage (4GB headroom)
- **Agent Framework**: Hermes Agent (31 tools: file ops, terminal, code execution)
- **Output**: Space shooter game (Octopus Invaders) — 13 files, 3,263 lines
- **Autonomous iteration**: 4 phases, 6 prompts, zero handwritten code
- **Surprising capability**: Model discovered and fixed browser caching issue without prompting
### llama.cpp Optimized Config
```bash
./llama-server -m Qwen3.5-9B-Q4_K_M.gguf \
-ngl 99 \
-c 131072 \
-np 1 \
-fa on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--host 0.0.0.0
```
**Flag explanations:**
- `-ngl 99`: All layers on GPU, no CPU offload
- `-c 131072`: 128K context window
- `-np 1`: Single parallel slot (saves 190MB VRAM)
- `-fa on`: Flash Attention (maintains speed at long context)
- `--cache-type-k/v q4_0`: KV cache quantization (enables 128K+ on 12GB)
### Why Hermes Agent Works on Small Models
- 11 model-specific tool call parsers (Qwen, DeepSeek, Llama, Mistral, GLM, Kimi, etc.)
- Handles malformed JSON and partial completions
- Compensates for rough edges in 9B model output that would fail on other frameworks
## L1: Context
Sudo su is a developer experimenting with local LLM capabilities. This post documents a systematic experiment challenging the assumption that small models are "not good enough" for complex agentic tasks. The open-sourced experiment includes prompts, configs, and every iteration.
### Iteration Breakdown
| Phase | Description |
|-------|-------------|
| Initial | Blank screen — 11 bugs identified |
| Fix 1 | Model fixed all 11 bugs from diagnostic list — game working |
| Phase 0 | Regression fix: variable scope mismatch |
| Phase 1 | Homepage redesign: unified dual start systems |
| Phases 2-4 | Gameplay polish: background optimization, level progression, bullet sizing |
| Autonomous | Model detected browser caching issue, diagnosed root cause, implemented version parameters on script/stylesheet tags |
## Cross-Domain Connections
- [[Local LLM Optimization]] — llama.cpp flags, KV cache quantization, Flash Attention
- [[Agent Frameworks]] — Hermes Agent, tool calling, model-specific parsers
- [[Game Development with AI]] — End-to-end game generation from prompts
- [[Small Model Capabilities]] — 9B parameter models for production tasks
- [[Qwen Models]] — Qwen 3.5 architecture and quantization
## Raw Content
> 12GB of VRAM runs more intelligence than you think in 2026.
>
> i know because i tested it. one RTX 3060 with 12 gigs of VRAM. one 9 billion parameter model. zero handwritten code.
>
> the model wrote a full space shooter across 13 files, 3,263 lines, from a single prompt. then it iterated on its own work across 4 phases and 6 prompts, finding and fixing bugs i never pointed out.
>
> this is not a tutorial. this is what actually happened when i stopped asking whether small models were good enough and started measuring.
>
> [Experiment details and llama.cpp config flags documented above]
**Source**: https://x.com/sudoingx/status/2035000411342659979
## Discoverability Score
- **Primary topic**: AI/ML / Local LLM Development
- **Secondary topics**: Game Development, Agent Frameworks, GPU Optimization
- **Relevance**: High for Q2 2026; challenges assumptions about model size requirements
- **Source credibility**: High (documented experiment with reproducible config)