Signal AI Wednesday: What is LLM Inference?

Your weekly dose of AI demystified.

        March 4, 2026

        What is LLM Inference?
Your weekly dose of AI demystified.

Ever wonder what happens when you type a prompt into ChatGPT and it spits back a response? That process is called inference — and it’s the engine behind every AI conversation you’ve ever had.
Training vs. Inference: The Two Phases
Think of an LLM like a student who’s gone through years of studying:

Training = studying for exams. The model reads trillions of words, learns patterns, and adjusts its “brain” (technically: its parameters) to understand language.

Inference = taking the exam. The trained model uses what it learned to generate answers, predictions, and text.

Training happens once (or periodically). Inference happens every single time you interact with the model.
Why Should You Care?
Here’s the thing: most of the compute cost is in inference, not training.
Training a model like GPT-4 reportedly cost tens of millions of dollars. But every API call, every chatbot message, every generated paragraph — that’s inference. Companies optimize heavily here because inference at scale is expensive.
When you hear about “AI inference chips” or “inference optimization” — this is exactly what they’re talking about: making responses faster, cheaper, and more efficient.
The Quick Version

Term
What it means

Inference
Running a trained model to generate output

Token
The smallest unit the model processes (think: word fragments)

Sampling
The model choosing the next token (deterministic vs. probabilistic)

Next week: We’ll dive deeper into what actually happens during inference — tokens, probabilities, and why AI sometimes gives weird answers.

Questions? Just reply to this email.
— Adam

                                Don't miss what's next. Subscribe to Signal AI:

            Email address (required)

Term	What it means
Inference	Running a trained model to generate output
Token	The smallest unit the model processes (think: word fragments)
Sampling	The model choosing the next token (deterministic vs. probabilistic)