Overview

The Transformer architecture, introduced in “Attention Is All You Need” (2017), revolutionized NLP and now powers all modern LLMs.

Key Components

Input → Embedding → Positional Encoding → Transformer Blocks → Output
                              [Multi-Head Attention + FFN] × N

Self-Attention

The core mechanism that allows tokens to attend to all other tokens.

import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

Multi-Head Attention

Run attention in parallel with different learned projections:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        batch_size = x.size(0)
        
        Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        attn_output = attention(Q, K, V)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        
        return self.W_o(attn_output)

Positional Encoding

Transformers have no inherent notion of position. Add positional information:

def positional_encoding(seq_len, d_model):
    position = torch.arange(seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
    
    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    
    return pe

Feed-Forward Network

Applied after attention in each block:

class FFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.activation = nn.GELU()
    
    def forward(self, x):
        return self.linear2(self.activation(self.linear1(x)))

Encoder vs Decoder

TypeAttentionUse CaseExamples
EncoderBidirectionalClassificationBERT
DecoderCausal (masked)GenerationGPT
Encoder-DecoderBothTranslationT5, BART

Modern Improvements

  • RoPE: Rotary positional embeddings (Llama)
  • GQA: Grouped-query attention (Llama 2)
  • Flash Attention: Memory-efficient attention
  • RMSNorm: Simpler normalization
  • SwiGLU: Better activation function

Key Resources