Overview
Evaluating LLMs is challenging because quality is subjective. This guide covers automated metrics, benchmarks, and human evaluation approaches.
Automated Metrics
Perplexity
Measures how well a model predicts text. Lower is better.
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
def calculate_perplexity(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
return torch.exp(outputs.loss).item()
perplexity = calculate_perplexity("The quick brown fox jumps over the lazy dog.")
BLEU Score
Measures n-gram overlap with reference text. Used for translation.
from nltk.translate.bleu_score import sentence_bleu
reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]
score = sentence_bleu(reference, candidate) # 0.0 - 1.0
ROUGE Score
Measures recall of n-grams. Used for summarization.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(
"The cat sat on the mat.",
"A cat was sitting on a mat."
)
LLM-as-Judge
Use a stronger LLM to evaluate outputs:
evaluation_prompt = """
Rate the following response on a scale of 1-5 for:
- Accuracy: Is the information correct?
- Helpfulness: Does it answer the question?
- Clarity: Is it well-written?
Question: {question}
Response: {response}
Provide ratings as JSON: {"accuracy": X, "helpfulness": X, "clarity": X}
"""
Benchmark Suites
| Benchmark | Tests | Use Case |
|---|---|---|
| MMLU | 57 subjects | Knowledge |
| HumanEval | 164 problems | Coding |
| GSM8K | 8.5K problems | Math |
| TruthfulQA | 817 questions | Factuality |
| MT-Bench | 80 questions | Chat quality |
Human Evaluation
A/B Testing
Show evaluators two responses (A and B)
Ask: "Which response is better?"
Calculate win rate for each model
Likert Scale
Rate this response:
1 - Very poor
2 - Poor
3 - Acceptable
4 - Good
5 - Excellent
Elo Rating
Run pairwise comparisons, calculate Elo scores like chess ratings.
Evaluation Checklist
- Factual accuracy
- Relevance to query
- Coherence and fluency
- Harmlessness
- Following instructions
- Appropriate length
💬 Comments
Comments are coming soon! We're setting up our discussion system.
In the meantime, feel free to contact us with your feedback.