Overview
The Transformer architecture, introduced in “Attention Is All You Need” (2017), revolutionized NLP and now powers all modern LLMs.
Key Components
Input → Embedding → Positional Encoding → Transformer Blocks → Output
↓
[Multi-Head Attention + FFN] × N
Self-Attention
The core mechanism that allows tokens to attend to all other tokens.
import torch
import torch.nn.functional as F
def attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V)
Multi-Head Attention
Run attention in parallel with different learned projections:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
batch_size = x.size(0)
Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
attn_output = attention(Q, K, V)
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
return self.W_o(attn_output)
Positional Encoding
Transformers have no inherent notion of position. Add positional information:
def positional_encoding(seq_len, d_model):
position = torch.arange(seq_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
pe = torch.zeros(seq_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
Feed-Forward Network
Applied after attention in each block:
class FFN(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.activation = nn.GELU()
def forward(self, x):
return self.linear2(self.activation(self.linear1(x)))
Encoder vs Decoder
| Type | Attention | Use Case | Examples |
|---|---|---|---|
| Encoder | Bidirectional | Classification | BERT |
| Decoder | Causal (masked) | Generation | GPT |
| Encoder-Decoder | Both | Translation | T5, BART |
Modern Improvements
- RoPE: Rotary positional embeddings (Llama)
- GQA: Grouped-query attention (Llama 2)
- Flash Attention: Memory-efficient attention
- RMSNorm: Simpler normalization
- SwiGLU: Better activation function
💬 Comments
Comments are coming soon! We're setting up our discussion system.
In the meantime, feel free to contact us with your feedback.