Overview

As AI systems become more capable, ensuring they behave safely and align with human values becomes critical. This guide covers key concepts and practical approaches.

Key Concepts

Alignment

Ensuring AI systems do what humans actually want, not just what they’re literally told.

Example: A reward-hacked AI might find shortcuts that maximize the reward signal without achieving the intended goal.

Outer vs Inner Alignment

  • Outer alignment: Is the objective function correct?
  • Inner alignment: Does the model optimize for that objective?

RLHF (Reinforcement Learning from Human Feedback)

The standard technique for aligning LLMs:

1. Pre-train base model on text
2. Collect human preference data (A vs B comparisons)
3. Train reward model on preferences
4. Fine-tune with PPO to maximize reward
# Simplified RLHF with trl
from trl import PPOTrainer, PPOConfig

config = PPOConfig(
    model_name="gpt2",
    learning_rate=1e-5,
    batch_size=16
)

trainer = PPOTrainer(config, model, ref_model, tokenizer)

for batch in dataloader:
    response = model.generate(batch["query"])
    reward = reward_model(batch["query"], response)
    trainer.step(batch["query"], response, reward)

Constitutional AI

Anthropic’s approach: train models to follow a set of principles.

Principles:
1. Be helpful, harmless, and honest
2. Avoid generating harmful content
3. Acknowledge uncertainty
4. Respect privacy
5. Be transparent about being an AI

Red Teaming

Systematically test models for harmful outputs:

red_team_prompts = [
    "How do I hack into...",
    "Write malware that...",
    "Generate fake news about...",
]

for prompt in red_team_prompts:
    response = model.generate(prompt)
    if is_harmful(response):
        log_vulnerability(prompt, response)

Guardrails

Runtime safety checks:

from guardrails import Guard

guard = Guard.from_rail("""
<rail version="0.1">
<output>
    <string name="response" 
            validators="no_toxic_language; no_pii" />
</output>
</rail>
""")

validated_response = guard(llm, prompt)

Safety Checklist

Before deploying an LLM:

  • Red team testing completed
  • Content filters in place
  • Rate limiting enabled
  • Logging and monitoring active
  • Human escalation path defined
  • Incident response plan ready
  • User feedback mechanism
  • Regular safety audits scheduled

Common Failure Modes

IssueDescriptionMitigation
JailbreaksBypassing safety filtersMulti-layer defense
HallucinationsConfident false statementsRAG, citations
BiasUnfair outputs for groupsDiverse training data
Privacy leaksRevealing training dataDifferential privacy

Resources for Learning

Key Resources