Signal AI Wednesday: LLM Inference — The Engine Behind AI Responses (April 2026)
Signal AI Wednesday: LLM Inference — The Engine Behind Every AI Response
Note: This is an extensive deep-dive (5x normal length) covering everything you need to know about LLM inference.
Introduction
Every time you type a prompt into ChatGPT, Claude, or any other LLM and receive a response, a fascinating and computationally intensive process unfolds behind the scenes. This process is called inference — and understanding it is crucial for anyone working with or building upon AI systems.
In our last installment, we mentioned we'd cover inference, and today we're diving deep. We'll explore:
- What exactly inference is
- How it works step-by-step
- Why it's different from training
- What "parameters" actually mean
- The computational challenges
- Real-world examples and analogies
Let's get into it.
What is LLM Inference?
Inference is the process of using a trained model to generate outputs based on new input. In the context of Large Language Models (LLMs), inference means taking a trained neural network (with weights already learned during training) and using it to predict the next token in a sequence given some input text.
Think of it this way: training is like studying for an exam — the model learns patterns, facts, and relationships from massive amounts of text. Inference is like taking the exam — you're applying what you learned to answer new questions.
The Core Task: Next Token Prediction
At its heart, an LLM performs next-token prediction. Given a sequence of tokens (words, subwords, or characters), the model predicts the most likely token to follow.
Input: "The capital of France is"
Model predicts: "Paris"
Input: "The capital of France is Paris,"
Model predicts: "and"
Input: "The capital of France is Paris, and its"
Model predicts: "population"
This process repeats, token by token, until: - A maximum length is reached - A special end-of-sequence token is generated - Some other stopping criterion is met
The result is the coherent text you see as the model's "response."
Training vs. Inference: Two Sides of the Coin
Understanding the difference between training and inference is essential. They sound similar but are fundamentally different operations.
Training: Learning from Data
| Aspect | Training |
|---|---|
| Goal | Learn patterns, weights, and representations from data |
| Direction | Forward pass + backward pass (backpropagation) |
| Data | Massive datasets (trillions of tokens) |
| Duration | Weeks to months on thousands of GPUs |
| Compute | Massive (ex: GPT-4 reportedly trained on ~25,000 A100 GPUs for months) |
| Frequency | One-time (or occasional fine-tuning) |
| Weight updates | Weights change and improve over time |
During training: 1. The model sees billions of text examples 2. For each example, it makes a prediction 3. It compares prediction to the actual next token 4. It calculates loss (how wrong it was) 5. It adjusts weights slightly to reduce error 6. Repeat trillions of times
Inference: Applying What Was Learned
| Aspect | Inference |
|---|---|
| Goal | Generate outputs for new inputs using learned weights |
| Direction | Forward pass only (no backpropagation) |
| Data | Your prompt (typically a few hundred to thousand tokens) |
| Duration | Milliseconds to seconds per response |
| Compute | Significant but far less than training |
| Frequency | Every time a user makes a request |
| Weight updates | Weights remain frozen |
During inference: 1. Your prompt is converted to tokens 2. The model processes the sequence through its layers 3. It outputs probability distributions for the next token 4. A sampling strategy selects the next token 5. The token is appended and the process repeats
Key Differences Summarized
- Training modifies weights; inference uses them
- Training requires massive datasets; inference needs only your prompt
- Training is extremely expensive (millions $); inference is cheap (cents per request)
- Training happens rarely; inference happens constantly
Analogy: Training is like going through years of school. Inference is like taking a test. You only train once (or occasionally retrain), but you "infer" every time you answer a question.
How LLM Inference Works: A Step-by-Step Journey
Let's trace exactly what happens when you send a prompt to an LLM.
Step 1: Tokenization
Your text input is first converted into tokens — numerical representations of text chunks.
Input: "The cat sat on the mat"
Tokens: [464, 3746, 6313, 322, 264, 6890]
Tokenizers split text into subword units. Common approaches include: - Byte Pair Encoding (BPE) — used by GPT, GPT-2, GPT-3 - WordPiece — used by BERT, Gemini - SentencePiece — used by T5, Llama
Why subwords? It balances: - Vocabulary size (can't have every word) - Meaning preservation ("unhappiness" → "un" + "happi" + "ness") - Handling rare/unseen words
Step 2: Embedding
Each token is converted into a vector (list of numbers) — typically 1,024 to 8,192 dimensions depending on model size.
Token "cat" → [0.12, -0.34, 0.78, ..., 0.02] (1024 dimensions)
These embeddings capture semantic meaning. Similar words have similar vectors.
Step 3: Positional Encoding
Since the model processes tokens in parallel, it needs to know order. Positional encodings are added to embeddings to represent position in sequence:
Position 1: sin/cos waves at different frequencies
Position 2: sin/cos waves shifted
...
This tells the model "cat" at position 2 is different from "cat" at position 5.
Step 4: The Transformer Layers
Here's where the magic happens. The input flows through multiple Transformer layers (12 to 96+ layers depending on model):
Inside Each Layer:
a) Multi-Head Self-Attention - Each token "looks at" every other token - It determines how much each token should "attend to" every other token - This captures relationships, context, dependencies
Visualizing attention (described):
For sentence: "The animal didn't cross the street because it was too tired"
Attention weights for "it":
The → 0.02
animal → 0.85 ← HIGH attention! (it refers to animal)
didn't → 0.01
cross → 0.01
...
tired → 0.08
Multiple attention heads (8 to 64+) each learn different types of relationships: - Syntactic (subject-verb) - Semantic (cat → animal) - Contextual (this word depends on that word)
b) Feed-Forward Network (FFN) - After attention, each token passes through a feed-forward network - This processes and transforms the information - Another residual connection and layer normalization
Mathematical Representation:
For each layer l:
# Self-Attention
Q = X × W_Q # Query
K = X × W_K # Key
V = X × W_V # Value
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
# Feed-Forward
FFN(X) = GELU(X × W_1 + b_1) × W_2 + b_2
Where W_Q, W_K, W_V, W_1, W_2 are learned weight matrices.
Step 5: Output Projection
After all transformer layers, we get representations for each position. The final layer projects to vocabulary size:
Final hidden state → [scores for each token in vocabulary]
Step 6: Converting Scores to Probabilities
Raw scores (logits) are converted to probabilities via softmax:
P(token_i) = exp(score_i) / Σ exp(score_j)
Now we have probabilities for every possible next token.
Step 7: Sampling Strategy
How do we pick the next token from these probabilities? Several approaches:
Greedy Selection: Always pick the highest probability token. - Pros: Deterministic, reproducible - Cons: Can be repetitive, get stuck in loops
Temperature Sampling: Divide logits by temperature before softmax: - Temperature = 1.0: Normal distribution - Temperature < 1.0: More focused, confident (less random) - Temperature > 1.0: More creative, random (more diverse)
# Low temperature (0.1)
P(cat) = 0.95, P(dog) = 0.03, P(rock) = 0.02
# High temperature (2.0)
P(cat) = 0.35, P(dog) = 0.30, P(rock) = 0.15
Top-K Sampling: Limit to top K most likely tokens, then sample.
Top-P (Nucleus) Sampling: Limit to smallest set of tokens whose cumulative probability exceeds P (e.g., 0.9).
Step 8: Autoregressive Generation
The selected token is appended to the input, and we repeat the entire process:
Generation Process:
Iteration 1:
Input: [USER_PROMPT]
Output: token_1
Input: [USER_PROMPT, token_1]
Iteration 2:
Input: [USER_PROMPT, token_1]
Output: token_2
Input: [USER_PROMPT, token_1, token_2]
...
Iteration N:
Input: [USER_PROMPT, token_1, ..., token_{N-1}]
Output: [EOS] (end-of-sequence)
STOP
This is why LLMs are called autoregressive — they predict the future based on the past, then use that prediction to predict the next step.
Parameters Explained: What Do Those Numbers Mean?
When you hear "GPT-3 has 175 billion parameters" or "Llama 3 has 8 billion parameters," what does that actually mean?
What is a Parameter?
A parameter is a value in the model that was learned during training. Think of it as a "knob" or "weight" that the model adjusts to minimize prediction error.
More parameters generally mean: - More capacity to learn complex patterns - Better ability to capture nuances - Larger model file size - More compute required for inference
Where Are Parameters Located?
In a typical Transformer-based LLM, parameters exist in:
-
Attention weight matrices (Q, K, V projections)
- For each attention head: W_Q, W_K, W_V matrices
- Output projection matrix
-
Feed-Forward Network weights
- Two matrices (up-projection and down-projection)
-
Layer Normalization weights
- Scale and bias terms
-
Embedding matrices
- Token embeddings
- Position embeddings
Counting Parameters: A Simplified Example
For a model with: - Vocabulary size: 50,000 tokens - Hidden dimension: 4,096 - Number of layers: 32 - Attention heads: 32
Per-layer parameters: - Q, K, V projections: 3 × (4096 × 4096) = 50M - Output projection: 4096 × 4096 = 17M - FFN (up): 4096 × 16,384 = 67M - FFN (down): 16,384 × 4096 = 67M - LayerNorm: ~8,000
Total per layer: ~201M parameters
All layers: 32 × 201M = 6.4B
Add embeddings: 50,000 × 4096 × 2 (input + output) = ~400M
Total: ~6.8B parameters
(Real models have additional components, hence "8B" rather than "6.8B")
Parameter Scaling: A Brief History
| Model | Year | Parameters | Notable |
|---|---|---|---|
| GPT-1 | 2018 | 117M | First GPT |
| GPT-2 | 2019 | 1.5B | Zero-shot learning |
| GPT-3 | 2020 | 175B | Few-shot learning |
| GPT-4 | 2023 | ~1.7T (estimated, mixture of experts) | Multimodal |
| Llama 2 | 2023 | 7B, 13B, 70B | Open weights |
| Llama 3 | 2024 | 8B, 70B, 400B | Open weights |
| Claude 3 | 2024 | Unknown (estimated >100B) | Strong reasoning |
Do More Parameters Always Mean Better?
Not necessarily. Key considerations:
- Quality matters more than quantity — A well-trained 8B model can outperform a poorly-trained 70B model
- Efficiency improvements — Quantization, distillation, and architecture improvements can make smaller models competitive
- Diminishing returns — The leap from 7B to 70B is significant; the leap from 70B to 700B is less so
- Training data quality — A model trained on more高质量 data can achieve more with fewer parameters
The Computational Reality of Inference
Inference is computationally intensive. Let's break down why.
Memory Requirements
Model weights: A 70B parameter model in FP16 (16-bit float) requires:
70B × 2 bytes = 140 GB of GPU memory
That's more than any single GPU has (currently max ~80GB). Solutions:
- Tensor parallelism: Split weights across multiple GPUs
- Quantization: Use INT8 or INT4 (4x or 8x less memory)
- CPU offloading: Keep some weights in CPU RAM
KV Cache: During autoregressive generation, the model caches key and value vectors to avoid recomputation:
KV cache memory ≈ 2 × (layers) × (heads) × (head_dim) × (context_length) × (batch_size) × 2 bytes
For a 70B model with 80K context, this can be tens of GBs.
FLOPs: Floating Point Operations
For each token generated, a model performs approximately:
FLOPs per token ≈ 2 × parameters
For a 70B model: ~140 TFLOPs per token.
Generating 1,000 tokens = 140,000 TFLOPs.
Latency Factors
What determines how fast an LLM responds?
- Model size — More parameters = more compute per token
- Batch size — Larger batches increase throughput but latency
- Context length — Longer contexts require more memory and computation
- Hardware — GPU speed, memory bandwidth, interconnect
- Quantization — INT4 is faster than FP16 but may reduce quality
Real-World Latency Examples (RTX 4090, 70B model)
| Configuration | Tokens/Second |
|---|---|
| FP16, batch 1 | ~5-10 t/s |
| INT8, batch 1 | ~15-25 t/s |
| INT4, batch 1 | ~30-40 t/s |
| INT4, batch 8 | ~80-120 t/s |
Inference Optimization Techniques
To make inference practical, several techniques are employed:
1. Quantization
Reduce precision of weights from FP32/FP16 to INT8 or INT4:
Original: 3.14159265 (32 bits)
INT8: 3 (8 bits, values 0-255)
INT4: 3 (4 bits, values 0-15)
Types: - Post-training quantization (PTQ): Quantize after training - Quantization-aware training (QAT): Train with quantization in mind
2. KV Cache Optimization
The KV cache grows as generation proceeds. Optimizations include: - PagedAttention (vLLM): Non-contiguous memory allocation - Streaming cache: Evict older KV pairs strategically
3. Speculative Execution
Use a smaller "draft" model to predict multiple tokens, verify with the main model:
Draft model generates: [A, B, C, D]
Main model validates: [A, B] ✓, [C] ✓, [D] ✗
Keep: [A, B, C] (3 tokens instead of 2)
Up to 3x speedup possible.
4. Prompt Caching
If prompts share common prefixes (system messages, context), cache the KV for that prefix:
First request:
System: "You are a helpful assistant..."
User: "What is Python?"
(computes KV for system)
Second request:
System: "You are a helpful assistant..."
User: "What is Java?"
(reuses cached KV for system!)
5. Continuous Batching
Instead of waiting for a full request to complete, batch new requests as soon as any completes:
Traditional batching:
Wait for all 8 requests to finish → process next 8
Continuous batching:
Request 3 finishes → immediately add Request 9
(better GPU utilization)
A Note on Training vs Inference Hardware
Training and inference have different hardware requirements:
| Aspect | Training | Inference |
|---|---|---|
| Peak memory | Very high (gradients + activations) | Lower (just forward pass) |
| Batch size | Can be very large | Often small (1-8) |
| Precision | FP32/BF16 required | INT8/INT4 often sufficient |
| Interconnect | Critical (data parallel) | Less critical |
| Throughput | Important | Latency often more important |
This is why training requires expensive data center GPUs (A100, H100) while inference can run on consumer GPUs or even CPUs with quantization.
Historical Context: The Evolution of Inference
Understanding inference today requires understanding how we got here.
2017: The Transformer Revolution
"Attention Is All You Need" (Vaswani et al., 2017) introduced the Transformer architecture. This paper changed everything:
- Replaced RNNs/LSTMs with attention mechanisms
- Enabled parallel processing of sequences
- Scaled dramatically with data and compute
2018-2019: GPT and GPT-2
OpenAI's Generative Pre-trained Transformer showed that: - Pre-training on large corpora + fine-tuning worked - Larger models exhibited emergent capabilities - Zero-shot and few-shot learning emerged
2020: GPT-3 and the Inference Era
GPT-3 (175B parameters) demonstrated that: - Scale unlocks new capabilities - Inference became the primary use case - API access enabled widespread application building
2022-2023: The ChatGPT Explosion
ChatGPT brought LLM inference to the masses: - User-friendly interface - Reinforcement learning from human feedback (RLHF) improved outputs - Inference became the bottleneck (not model training)
2024-Present: Efficiency Era
With models reaching diminishing returns on scale, focus shifted to: - Making inference cheaper and faster - Open-source models (Llama, Mistral, Qwen) - Specialized inference hardware - Edge deployment
Real-World Examples: Inference in Action
Let's put this all together with concrete examples.
Example 1: Simple Question Answering
User: "What is the capital of Japan?"
Inference process:
1. Tokenize → [2103, 318, 263, 16259, 5755, 30]
2. Embed → vectors
3. Process through 32 layers of transformers
4. Output: logits over 50k vocabulary
5. Sample next token: "Tokyo"
6. Append "Tokyo", repeat
7. Generate EOS token → STOP
Output: "Tokyo"
Total time: ~500ms on a good GPU
Example 2: Code Generation
User: "Write a Python function to fibonacci numbers"
Inference process:
1. Input processed through model
2. Model attends to similar code in training data
3. Token-by-token generation:
- "def" → "fibonacci"
- "fibonacci" → "(n):"
- "(n):" → "\n"
- ... (generates full function)
Example 3: Long Context
When you provide a 50-page document and ask questions:
User: Based on the document, what was the revenue in Q3?
The model:
1. Processes entire 50-page context (all 80K tokens)
2. Attention identifies relevant sections
3. Extracts and synthesizes answer
This requires significant KV cache memory but demonstrates the power of attention over long contexts.
Why This Matters for You
Understanding inference helps you:
- Debug issues — Know why responses are slow or poor quality
- Optimize prompts — Understand how context length and prompt structure affect output
- Choose models — Select right model for your use case (size, quantization, speed)
- Build better products — Design around inference costs and limitations
- Stay informed — Follow the rapidly evolving AI landscape
Coming Next Week
Now that we've covered inference, our logical next topic is LLM Training — how models actually learn. We'll cover:
- Pre-training vs fine-tuning
- The training process step-by-step
- Loss functions and optimization
- RLHF and DPO
- How to train your own model
Stay tuned!
References & Further Reading
- "Attention Is All You Need" — Vaswani et al., 2017
- "Language Models are Few-Shot Learners" — Brown et al., 2020 (GPT-3 paper)
- "LLaMA: Open and Efficient Foundation Language Models" — Touvron et al., 2023
- vLLM Documentation — Efficient inference
- "Efficient Memory Management for LLMs" — Kwon et al.
This is Signal AI Wednesday — simplifying AI, one concept at a time.
Have questions about this deep-dive? Reply to this email.