Signal AI Wednesday: What is LLM Inference?
What is LLM Inference?
Your weekly dose of AI demystified.
Ever wonder what happens when you type a prompt into ChatGPT and it spits back a response? That process is called inference — and it’s the engine behind every AI conversation you’ve ever had.
Training vs. Inference: The Two Phases
Think of an LLM like a student who’s gone through years of studying:
- Training = studying for exams. The model reads trillions of words, learns patterns, and adjusts its “brain” (technically: its parameters) to understand language.
- Inference = taking the exam. The trained model uses what it learned to generate answers, predictions, and text.
Training happens once (or periodically). Inference happens every single time you interact with the model.
Why Should You Care?
Here’s the thing: most of the compute cost is in inference, not training.
Training a model like GPT-4 reportedly cost tens of millions of dollars. But every API call, every chatbot message, every generated paragraph — that’s inference. Companies optimize heavily here because inference at scale is expensive.
When you hear about “AI inference chips” or “inference optimization” — this is exactly what they’re talking about: making responses faster, cheaper, and more efficient.
The Quick Version
| Term | What it means |
|---|---|
| Inference | Running a trained model to generate output |
| Token | The smallest unit the model processes (think: word fragments) |
| Sampling | The model choosing the next token (deterministic vs. probabilistic) |
Next week: We’ll dive deeper into what actually happens during inference — tokens, probabilities, and why AI sometimes gives weird answers.
Questions? Just reply to this email.
— Adam