Ai-Safety

Overview As AI systems become more capable, ensuring they behave safely and align with human values becomes critical. This guide covers key concepts and practical approaches. Key Concepts Alignment Ensuring AI systems do what humans actually want, not just what they’re literally told. Example: A reward-hacked AI might find shortcuts that maximize the reward signal without achieving the intended goal. Outer vs Inner Alignment Outer alignment: Is the objective function correct? Inner alignment: Does the model optimize for that objective? RLHF (Reinforcement Learning from Human Feedback) The standard technique for aligning LLMs: ...