Signal AI Wednesday: LLM Inference — The Engine Behind AI Responses (April 2026)

Note: This is an extensive deep-dive (5x normal length) covering everything you need to know about LLM inference.

        April 1, 2026

Signal AI Wednesday: LLM Inference — The Engine Behind AI Responses (April 2026)

        Signal AI Wednesday: LLM Inference — The Engine Behind Every AI Response

Note: This is an extensive deep-dive (5x normal length) covering everything you need to know about LLM inference.

Introduction
Every time you type a prompt into ChatGPT, Claude, or any other LLM and receive a response, a fascinating and computationally intensive process unfolds behind the scenes. This process is called inference — and understanding it is crucial for anyone working with or building upon AI systems.
In our last installment, we mentioned we'd cover inference, and today we're diving deep. We'll explore:

What exactly inference is
How it works step-by-step
Why it's different from training
What "parameters" actually mean
The computational challenges
Real-world examples and analogies

Let's get into it.

What is LLM Inference?
Inference is the process of using a trained model to generate outputs based on new input. In the context of Large Language Models (LLMs), inference means taking a trained neural network (with weights already learned during training) and using it to predict the next token in a sequence given some input text.
Think of it this way: training is like studying for an exam — the model learns patterns, facts, and relationships from massive amounts of text. Inference is like taking the exam — you're applying what you learned to answer new questions.
The Core Task: Next Token Prediction
At its heart, an LLM performs next-token prediction. Given a sequence of tokens (words, subwords, or characters), the model predicts the most likely token to follow.
Input:  "The capital of France is"
Model predicts: "Paris"

Input:  "The capital of France is Paris,"
Model predicts: "and"

Input:  "The capital of France is Paris, and its"
Model predicts: "population"

This process repeats, token by token, until:
- A maximum length is reached
- A special end-of-sequence token is generated
- Some other stopping criterion is met
The result is the coherent text you see as the model's "response."

Training vs. Inference: Two Sides of the Coin
Understanding the difference between training and inference is essential. They sound similar but are fundamentally different operations.
Training: Learning from Data

Aspect
Training

Goal
Learn patterns, weights, and representations from data

Direction
Forward pass + backward pass (backpropagation)

Data
Massive datasets (trillions of tokens)

Duration
Weeks to months on thousands of GPUs

Compute
Massive (ex: GPT-4 reportedly trained on ~25,000 A100 GPUs for months)

Frequency
One-time (or occasional fine-tuning)

Weight updates
Weights change and improve over time

During training:
1. The model sees billions of text examples
2. For each example, it makes a prediction
3. It compares prediction to the actual next token
4. It calculates loss (how wrong it was)
5. It adjusts weights slightly to reduce error
6. Repeat trillions of times
Inference: Applying What Was Learned

Aspect
Inference

Goal
Generate outputs for new inputs using learned weights

Direction
Forward pass only (no backpropagation)

Data
Your prompt (typically a few hundred to thousand tokens)

Duration
Milliseconds to seconds per response

Compute
Significant but far less than training

Frequency
Every time a user makes a request

Weight updates
Weights remain frozen

During inference:
1. Your prompt is converted to tokens
2. The model processes the sequence through its layers
3. It outputs probability distributions for the next token
4. A sampling strategy selects the next token
5. The token is appended and the process repeats
Key Differences Summarized

Training modifies weights; inference uses them
Training requires massive datasets; inference needs only your prompt
Training is extremely expensive (millions $); inference is cheap (cents per request)
Training happens rarely; inference happens constantly

Analogy: Training is like going through years of school. Inference is like taking a test. You only train once (or occasionally retrain), but you "infer" every time you answer a question.

How LLM Inference Works: A Step-by-Step Journey
Let's trace exactly what happens when you send a prompt to an LLM.
Step 1: Tokenization
Your text input is first converted into tokens — numerical representations of text chunks.
Input: "The cat sat on the mat"
Tokens: [464, 3746, 6313, 322, 264, 6890]

Tokenizers split text into subword units. Common approaches include:
- Byte Pair Encoding (BPE) — used by GPT, GPT-2, GPT-3
- WordPiece — used by BERT, Gemini
- SentencePiece — used by T5, Llama
Why subwords? It balances:
- Vocabulary size (can't have every word)
- Meaning preservation ("unhappiness" → "un" + "happi" + "ness")
- Handling rare/unseen words
Step 2: Embedding
Each token is converted into a vector (list of numbers) — typically 1,024 to 8,192 dimensions depending on model size.
Token "cat" → [0.12, -0.34, 0.78, ..., 0.02]  (1024 dimensions)

These embeddings capture semantic meaning. Similar words have similar vectors.
Step 3: Positional Encoding
Since the model processes tokens in parallel, it needs to know order. Positional encodings are added to embeddings to represent position in sequence:
Position 1: sin/cos waves at different frequencies
Position 2: sin/cos waves shifted
...

This tells the model "cat" at position 2 is different from "cat" at position 5.
Step 4: The Transformer Layers
Here's where the magic happens. The input flows through multiple Transformer layers (12 to 96+ layers depending on model):
Inside Each Layer:
a) Multi-Head Self-Attention
- Each token "looks at" every other token
- It determines how much each token should "attend to" every other token
- This captures relationships, context, dependencies
Visualizing attention (described):
For sentence: "The animal didn't cross the street because it was too tired"

Attention weights for "it":
  The → 0.02
  animal → 0.85  ← HIGH attention! (it refers to animal)
  didn't → 0.01
  cross → 0.01
  ...
  tired → 0.08

Multiple attention heads (8 to 64+) each learn different types of relationships:
- Syntactic (subject-verb)
- Semantic (cat → animal)
- Contextual (this word depends on that word)
b) Feed-Forward Network (FFN)
- After attention, each token passes through a feed-forward network
- This processes and transforms the information
- Another residual connection and layer normalization
Mathematical Representation:
For each layer l:
# Self-Attention
Q = X × W_Q  # Query
K = X × W_K  # Key
V = X × W_V  # Value

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

# Feed-Forward
FFN(X) = GELU(X × W_1 + b_1) × W_2 + b_2

Where W_Q, W_K, W_V, W_1, W_2 are learned weight matrices.
Step 5: Output Projection
After all transformer layers, we get representations for each position. The final layer projects to vocabulary size:
Final hidden state → [scores for each token in vocabulary]

Step 6: Converting Scores to Probabilities
Raw scores (logits) are converted to probabilities via softmax:
P(token_i) = exp(score_i) / Σ exp(score_j)

Now we have probabilities for every possible next token.
Step 7: Sampling Strategy
How do we pick the next token from these probabilities? Several approaches:
Greedy Selection: Always pick the highest probability token.
- Pros: Deterministic, reproducible
- Cons: Can be repetitive, get stuck in loops
Temperature Sampling: Divide logits by temperature before softmax:
- Temperature = 1.0: Normal distribution
- Temperature < 1.0: More focused, confident (less random)
- Temperature > 1.0: More creative, random (more diverse)
# Low temperature (0.1)
P(cat) = 0.95, P(dog) = 0.03, P(rock) = 0.02

# High temperature (2.0)
P(cat) = 0.35, P(dog) = 0.30, P(rock) = 0.15

Top-K Sampling: Limit to top K most likely tokens, then sample.
Top-P (Nucleus) Sampling: Limit to smallest set of tokens whose cumulative probability exceeds P (e.g., 0.9).
Step 8: Autoregressive Generation
The selected token is appended to the input, and we repeat the entire process:
Generation Process:

Iteration 1:
  Input: [USER_PROMPT]
  Output: token_1
  Input: [USER_PROMPT, token_1]

Iteration 2:
  Input: [USER_PROMPT, token_1]
  Output: token_2
  Input: [USER_PROMPT, token_1, token_2]

...

Iteration N:
  Input: [USER_PROMPT, token_1, ..., token_{N-1}]
  Output: [EOS] (end-of-sequence)
  STOP

This is why LLMs are called autoregressive — they predict the future based on the past, then use that prediction to predict the next step.

Parameters Explained: What Do Those Numbers Mean?
When you hear "GPT-3 has 175 billion parameters" or "Llama 3 has 8 billion parameters," what does that actually mean?
What is a Parameter?
A parameter is a value in the model that was learned during training. Think of it as a "knob" or "weight" that the model adjusts to minimize prediction error.
More parameters generally mean:
- More capacity to learn complex patterns
- Better ability to capture nuances
- Larger model file size
- More compute required for inference
Where Are Parameters Located?
In a typical Transformer-based LLM, parameters exist in:

Attention weight matrices (Q, K, V projections)

For each attention head: W_Q, W_K, W_V matrices
Output projection matrix

Feed-Forward Network weights

Two matrices (up-projection and down-projection)

Layer Normalization weights

Scale and bias terms

Embedding matrices

Token embeddings
Position embeddings

Counting Parameters: A Simplified Example
For a model with:
- Vocabulary size: 50,000 tokens
- Hidden dimension: 4,096
- Number of layers: 32
- Attention heads: 32
Per-layer parameters:
- Q, K, V projections: 3 × (4096 × 4096) = 50M
- Output projection: 4096 × 4096 = 17M
- FFN (up): 4096 × 16,384 = 67M
- FFN (down): 16,384 × 4096 = 67M
- LayerNorm: ~8,000
Total per layer: ~201M parameters
All layers: 32 × 201M = 6.4B
Add embeddings: 50,000 × 4096 × 2 (input + output) = ~400M
Total: ~6.8B parameters
(Real models have additional components, hence "8B" rather than "6.8B")
Parameter Scaling: A Brief History

Model
Year
Parameters
Notable

GPT-1
2018
117M
First GPT

GPT-2
2019
1.5B
Zero-shot learning

GPT-3
2020
175B
Few-shot learning

GPT-4
2023
~1.7T (estimated, mixture of experts)
Multimodal

Llama 2
2023
7B, 13B, 70B
Open weights

Llama 3
2024
8B, 70B, 400B
Open weights

Claude 3
2024
Unknown (estimated >100B)
Strong reasoning

Do More Parameters Always Mean Better?
Not necessarily. Key considerations:

Quality matters more than quantity — A well-trained 8B model can outperform a poorly-trained 70B model
Efficiency improvements — Quantization, distillation, and architecture improvements can make smaller models competitive
Diminishing returns — The leap from 7B to 70B is significant; the leap from 70B to 700B is less so
Training data quality — A model trained on more高质量 data can achieve more with fewer parameters

The Computational Reality of Inference
Inference is computationally intensive. Let's break down why.
Memory Requirements
Model weights: A 70B parameter model in FP16 (16-bit float) requires:
70B × 2 bytes = 140 GB of GPU memory

That's more than any single GPU has (currently max ~80GB). Solutions:

Tensor parallelism: Split weights across multiple GPUs
Quantization: Use INT8 or INT4 (4x or 8x less memory)
CPU offloading: Keep some weights in CPU RAM

KV Cache: During autoregressive generation, the model caches key and value vectors to avoid recomputation:
KV cache memory ≈ 2 × (layers) × (heads) × (head_dim) × (context_length) × (batch_size) × 2 bytes

For a 70B model with 80K context, this can be tens of GBs.
FLOPs: Floating Point Operations
For each token generated, a model performs approximately:
FLOPs per token ≈ 2 × parameters

For a 70B model: ~140 TFLOPs per token.
Generating 1,000 tokens = 140,000 TFLOPs.
Latency Factors
What determines how fast an LLM responds?

Model size — More parameters = more compute per token
Batch size — Larger batches increase throughput but latency
Context length — Longer contexts require more memory and computation
Hardware — GPU speed, memory bandwidth, interconnect
Quantization — INT4 is faster than FP16 but may reduce quality

Real-World Latency Examples (RTX 4090, 70B model)

Configuration
Tokens/Second

FP16, batch 1
~5-10 t/s

INT8, batch 1
~15-25 t/s

INT4, batch 1
~30-40 t/s

INT4, batch 8
~80-120 t/s

Inference Optimization Techniques
To make inference practical, several techniques are employed:
1. Quantization
Reduce precision of weights from FP32/FP16 to INT8 or INT4:
Original: 3.14159265 (32 bits)
INT8:     3 (8 bits, values 0-255)
INT4:     3 (4 bits, values 0-15)

Types:
- Post-training quantization (PTQ): Quantize after training
- Quantization-aware training (QAT): Train with quantization in mind
2. KV Cache Optimization
The KV cache grows as generation proceeds. Optimizations include:
- PagedAttention (vLLM): Non-contiguous memory allocation
- Streaming cache: Evict older KV pairs strategically
3. Speculative Execution
Use a smaller "draft" model to predict multiple tokens, verify with the main model:
Draft model generates: [A, B, C, D]
Main model validates: [A, B] ✓, [C] ✓, [D] ✗
Keep: [A, B, C] (3 tokens instead of 2)

Up to 3x speedup possible.
4. Prompt Caching
If prompts share common prefixes (system messages, context), cache the KV for that prefix:
First request:
  System: "You are a helpful assistant..."
  User: "What is Python?"
  (computes KV for system)

Second request:
  System: "You are a helpful assistant..."
  User: "What is Java?"
  (reuses cached KV for system!)

5. Continuous Batching
Instead of waiting for a full request to complete, batch new requests as soon as any completes:
Traditional batching:
  Wait for all 8 requests to finish → process next 8

Continuous batching:
  Request 3 finishes → immediately add Request 9
  (better GPU utilization)

A Note on Training vs Inference Hardware
Training and inference have different hardware requirements:

Aspect
Training
Inference

Peak memory
Very high (gradients + activations)
Lower (just forward pass)

Batch size
Can be very large
Often small (1-8)

Precision
FP32/BF16 required
INT8/INT4 often sufficient

Interconnect
Critical (data parallel)
Less critical

Throughput
Important
Latency often more important

This is why training requires expensive data center GPUs (A100, H100) while inference can run on consumer GPUs or even CPUs with quantization.

Historical Context: The Evolution of Inference
Understanding inference today requires understanding how we got here.
2017: The Transformer Revolution
"Attention Is All You Need" (Vaswani et al., 2017) introduced the Transformer architecture. This paper changed everything:

Replaced RNNs/LSTMs with attention mechanisms
Enabled parallel processing of sequences
Scaled dramatically with data and compute

2018-2019: GPT and GPT-2
OpenAI's Generative Pre-trained Transformer showed that:
- Pre-training on large corpora + fine-tuning worked
- Larger models exhibited emergent capabilities
- Zero-shot and few-shot learning emerged
2020: GPT-3 and the Inference Era
GPT-3 (175B parameters) demonstrated that:
- Scale unlocks new capabilities
- Inference became the primary use case
- API access enabled widespread application building
2022-2023: The ChatGPT Explosion
ChatGPT brought LLM inference to the masses:
- User-friendly interface
- Reinforcement learning from human feedback (RLHF) improved outputs
- Inference became the bottleneck (not model training)
2024-Present: Efficiency Era
With models reaching diminishing returns on scale, focus shifted to:
- Making inference cheaper and faster
- Open-source models (Llama, Mistral, Qwen)
- Specialized inference hardware
- Edge deployment

Real-World Examples: Inference in Action
Let's put this all together with concrete examples.
Example 1: Simple Question Answering
User: "What is the capital of Japan?"

Inference process:
1. Tokenize → [2103, 318, 263, 16259, 5755, 30]
2. Embed → vectors
3. Process through 32 layers of transformers
4. Output: logits over 50k vocabulary
5. Sample next token: "Tokyo"
6. Append "Tokyo", repeat
7. Generate EOS token → STOP

Output: "Tokyo"

Total time: ~500ms on a good GPU
Example 2: Code Generation
User: "Write a Python function to fibonacci numbers"

Inference process:
1. Input processed through model
2. Model attends to similar code in training data
3. Token-by-token generation:
   - "def" → "fibonacci"
   - "fibonacci" → "(n):"
   - "(n):" → "\n"
   - ... (generates full function)

Example 3: Long Context
When you provide a 50-page document and ask questions:
User: Based on the document, what was the revenue in Q3?

The model:
1. Processes entire 50-page context (all 80K tokens)
2. Attention identifies relevant sections
3. Extracts and synthesizes answer

This requires significant KV cache memory but demonstrates the power of attention over long contexts.

Why This Matters for You
Understanding inference helps you:

Debug issues — Know why responses are slow or poor quality
Optimize prompts — Understand how context length and prompt structure affect output
Choose models — Select right model for your use case (size, quantization, speed)
Build better products — Design around inference costs and limitations
Stay informed — Follow the rapidly evolving AI landscape

Coming Next Week
Now that we've covered inference, our logical next topic is LLM Training — how models actually learn. We'll cover:

Pre-training vs fine-tuning
The training process step-by-step
Loss functions and optimization
RLHF and DPO
How to train your own model

Stay tuned!

References & Further Reading

"Attention Is All You Need" — Vaswani et al., 2017
"Language Models are Few-Shot Learners" — Brown et al., 2020 (GPT-3 paper)
"LLaMA: Open and Efficient Foundation Language Models" — Touvron et al., 2023
vLLM Documentation — Efficient inference
"Efficient Memory Management for LLMs" — Kwon et al.

This is Signal AI Wednesday — simplifying AI, one concept at a time.
Have questions about this deep-dive? Reply to this email.

                                Don't miss what's next. Subscribe to Signal AI:

            Email address (required)

Aspect	Training
Goal	Learn patterns, weights, and representations from data
Direction	Forward pass + backward pass (backpropagation)
Data	Massive datasets (trillions of tokens)
Duration	Weeks to months on thousands of GPUs
Compute	Massive (ex: GPT-4 reportedly trained on ~25,000 A100 GPUs for months)
Frequency	One-time (or occasional fine-tuning)
Weight updates	Weights change and improve over time

Aspect	Inference
Goal	Generate outputs for new inputs using learned weights
Direction	Forward pass only (no backpropagation)
Data	Your prompt (typically a few hundred to thousand tokens)
Duration	Milliseconds to seconds per response
Compute	Significant but far less than training
Frequency	Every time a user makes a request
Weight updates	Weights remain frozen

Model	Year	Parameters	Notable
GPT-1	2018	117M	First GPT
GPT-2	2019	1.5B	Zero-shot learning
GPT-3	2020	175B	Few-shot learning
GPT-4	2023	~1.7T (estimated, mixture of experts)	Multimodal
Llama 2	2023	7B, 13B, 70B	Open weights
Llama 3	2024	8B, 70B, 400B	Open weights
Claude 3	2024	Unknown (estimated >100B)	Strong reasoning

Configuration	Tokens/Second
FP16, batch 1	~5-10 t/s
INT8, batch 1	~15-25 t/s
INT4, batch 1	~30-40 t/s
INT4, batch 8	~80-120 t/s

Aspect	Training	Inference
Peak memory	Very high (gradients + activations)	Lower (just forward pass)
Batch size	Can be very large	Often small (1-8)
Precision	FP32/BF16 required	INT8/INT4 often sufficient
Interconnect	Critical (data parallel)	Less critical
Throughput	Important	Latency often more important