Damian Galarza | AI Engineering logo

Damian Galarza | AI Engineering

Archives
April 30, 2026

I built Karpathy's LLM knowledge layer into my AI operator (live)

This week's video: live build of the Karpathy/Holmberg knowledge layer wired into Emma, plus why I've been running her on a local Qwen model since picking up a DGX Spark.

This week's video

Three weeks ago, Karpathy posted a thread about using LLMs to compile a personal knowledge base. Shortly after, Shann Holmberg followed with a longer piece that named the pattern (the "AI knowledge layer") and shipped a working framework. Both made the rounds and sparked a wave of "I built one" videos. However, none applied the pattern to an actual production agent. This week, I did, live, on camera, in Claude Code's plan mode. In this build, I integrated the two-layer pattern (raw clippings, compiled wiki) into Emma, my AI operator, alongside her existing vault, CRM, and skills. Plan mode handled the architectural decisions on screen. Then, auto mode handled execution as I walked through the entire compile skill end-to-end.

Building Karpathy's LLM Knowledge Base Into My AI Operator (Live)

Watch the video →

The point of the video isn't the wiki itself. It's the process of working through the problem with Claude in plan mode. I loaded up deep context first, including Karpathy's original X post and gist, then spent most of the time guiding the coding agent through the architecture.

Resources mentioned: - Karpathy's LLM knowledge base thread - Shann Holmberg: The AI Knowledge Layer - Obsidian Web Clipper


Why this one runs on a local model

The build in the video runs against Qwen 3.6-27B on a DGX Spark, not a hosted model. That's a deliberate shift. Emma has been on local Qwen for about a week and a half now, since the Spark arrived.

The forcing function was cost realism. Anthropic has been tightening how Max plans can be used outside Claude Code, which makes "what can I run locally" a real cost question again rather than a hobby one. A lot of an agent's work is cheap, high-volume, low-stakes (classification, extraction, routing, the compile step in this video). Pushing that off the frontier and onto a local model is mostly free latency and zero per-token cost. Reserve the frontier for what actually needs it.

Two threads from Ahmad Osman were what got me thinking clearly about the hardware side. Both are linked below.


Open source updates

doc-sentinel plugin shipped to claude-code-workflows. It's the automated companion to doc-audit. Where doc-audit runs on demand, doc-sentinel watches via post-commit and stop hooks, scans for drift between docs and code, and dispatches a drift-resolver agent when it finds something. The pattern is: don't ask the human to remember to audit, let the harness fire it. Documentation rot compounds silently, and silent compounding is exactly what hooks are good at catching.

creatorsignal-api skill shipped to agent-skills. Programmatic access to CreatorSignal (idea submission, validation polling, webhook management, quota). If you're using the API directly or wiring it into your own pipeline, this gives an agent the surface it needs.


Curated links

Ahmad Osman on memory bandwidth for local AI hardware. The mental model that finally clicked for me on local hardware. Capacity decides whether the model fits. Bandwidth decides whether the box feels alive or like it's grinding through wet cement at 3 tokens per second. A 32GB RTX 5090 will outrun a much larger unified-memory machine on most workloads, while a Mac Studio M3 Ultra or DGX Spark earns its keep when the model simply will not fit on a normal GPU. If you're shopping local hardware, read this before you read any spec sheet.

Ahmad Osman on GPU memory math for LLMs. The companion thread. One formula (VRAM ≈ params × effective bits ÷ 8) explains FP16, FP8, INT8, GPTQ, AWQ, NF4, and every GGUF variant you'll touch. If quantization has felt like a wall of acronyms, this is the back-of-the-napkin version that actually generalizes.

Both of these threads informed what I bought, what I'm running on it, and how I think about which workloads to keep local versus push to the frontier. If you're interested in running local LLMs but aren't sure where to start, these are both a great read.


If you're building agents and want help figuring out where the architecture should sit (frontier model, local model, knowledge layer, governance, evals), let's talk. 30 minutes, concrete roadmap, no slides.

There's one more thing I wanted to mention. Beyond this newsletter, Practical AI is also a small, invite-only Slack community I've been building for builders putting AI into real work. Same calm, production-grounded tone. If this sounds like a good fit, request an invite →. I'll share more in the coming weeks.

Damian

Don't miss what's next. Subscribe to Damian Galarza | AI Engineering:

Add a comment:

Website
YouTube
Twitter
Powered by Buttondown, the easiest way to start and grow your newsletter.