12GB VRAM Local AI Development - Nestor G Pestelos Jr (ngpestelos)

## L3: Executive Summary Developer demonstrates that a 9B parameter model (Qwen 3.5 9B Q4_K_M) running on an RTX 3060 (12GB VRAM) via llama.cpp can build a complete space shooter game across 13 files and 3,263 lines from a single prompt. The agent iterated through 4 phases autonomously, fixing bugs—including a browser caching issue the developer never mentioned. Key enablers: optimized llama.cpp flags for 128K context, Hermes Agent tool-calling harness with model-specific parsers, and precise prompt engineering over raw model size. ## L2: Key Insights - **Hardware**: RTX 3060 (12GB VRAM) — $250 GPU - **Model**: Qwen 3.5 9B quantized to Q4_K_M - **Performance**: 50 tok/s baseline, 30-45 tok/s with agent workload overhead - **Context**: 128K tokens at 8.2GB VRAM usage (4GB headroom) - **Agent Framework**: Hermes Agent (31 tools: file ops, terminal, code execution) - **Output**: Space shooter game (Octopus Invaders) — 13 files, 3,263 lines - **Autonomous iteration**: 4 phases, 6 prompts, zero handwritten code - **Surprising capability**: Model discovered and fixed browser caching issue without prompting ### llama.cpp Optimized Config ```bash ./llama-server -m Qwen3.5-9B-Q4_K_M.gguf \ -ngl 99 \ -c 131072 \ -np 1 \ -fa on \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ --host 0.0.0.0 ``` **Flag explanations:** - `-ngl 99`: All layers on GPU, no CPU offload - `-c 131072`: 128K context window - `-np 1`: Single parallel slot (saves 190MB VRAM) - `-fa on`: Flash Attention (maintains speed at long context) - `--cache-type-k/v q4_0`: KV cache quantization (enables 128K+ on 12GB) ### Why Hermes Agent Works on Small Models - 11 model-specific tool call parsers (Qwen, DeepSeek, Llama, Mistral, GLM, Kimi, etc.) - Handles malformed JSON and partial completions - Compensates for rough edges in 9B model output that would fail on other frameworks ## L1: Context Sudo su is a developer experimenting with local LLM capabilities. This post documents a systematic experiment challenging the assumption that small models are "not good enough" for complex agentic tasks. The open-sourced experiment includes prompts, configs, and every iteration. ### Iteration Breakdown | Phase | Description | |-------|-------------| | Initial | Blank screen — 11 bugs identified | | Fix 1 | Model fixed all 11 bugs from diagnostic list — game working | | Phase 0 | Regression fix: variable scope mismatch | | Phase 1 | Homepage redesign: unified dual start systems | | Phases 2-4 | Gameplay polish: background optimization, level progression, bullet sizing | | Autonomous | Model detected browser caching issue, diagnosed root cause, implemented version parameters on script/stylesheet tags | ## Cross-Domain Connections - [[Local LLM Optimization]] — llama.cpp flags, KV cache quantization, Flash Attention - [[Agent Frameworks]] — Hermes Agent, tool calling, model-specific parsers - [[Game Development with AI]] — End-to-end game generation from prompts - [[Small Model Capabilities]] — 9B parameter models for production tasks - [[Qwen Models]] — Qwen 3.5 architecture and quantization ## Raw Content > 12GB of VRAM runs more intelligence than you think in 2026. > > i know because i tested it. one RTX 3060 with 12 gigs of VRAM. one 9 billion parameter model. zero handwritten code. > > the model wrote a full space shooter across 13 files, 3,263 lines, from a single prompt. then it iterated on its own work across 4 phases and 6 prompts, finding and fixing bugs i never pointed out. > > this is not a tutorial. this is what actually happened when i stopped asking whether small models were good enough and started measuring. > > [Experiment details and llama.cpp config flags documented above] **Source**: https://x.com/sudoingx/status/2035000411342659979 ## Discoverability Score - **Primary topic**: AI/ML / Local LLM Development - **Secondary topics**: Game Development, Agent Frameworks, GPU Optimization - **Relevance**: High for Q2 2026; challenges assumptions about model size requirements - **Source credibility**: High (documented experiment with reproducible config)