LLM Evaluation Metrics

Overview

Evaluating LLMs is challenging because quality is subjective. This guide covers automated metrics, benchmarks, and human evaluation approaches.

Automated Metrics

Perplexity

Measures how well a model predicts text. Lower is better.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def calculate_perplexity(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()

perplexity = calculate_perplexity("The quick brown fox jumps over the lazy dog.")

BLEU Score

Measures n-gram overlap with reference text. Used for translation.

from nltk.translate.bleu_score import sentence_bleu

reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]
score = sentence_bleu(reference, candidate)  # 0.0 - 1.0

ROUGE Score

Measures recall of n-grams. Used for summarization.

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(
    "The cat sat on the mat.",
    "A cat was sitting on a mat."
)

LLM-as-Judge

Use a stronger LLM to evaluate outputs:

evaluation_prompt = """
Rate the following response on a scale of 1-5 for:
- Accuracy: Is the information correct?
- Helpfulness: Does it answer the question?
- Clarity: Is it well-written?

Question: {question}
Response: {response}

Provide ratings as JSON: {"accuracy": X, "helpfulness": X, "clarity": X}
"""

Benchmark Suites

Benchmark	Tests	Use Case
MMLU	57 subjects	Knowledge
HumanEval	164 problems	Coding
GSM8K	8.5K problems	Math
TruthfulQA	817 questions	Factuality
MT-Bench	80 questions	Chat quality

Human Evaluation

A/B Testing

Show evaluators two responses (A and B)
Ask: "Which response is better?"
Calculate win rate for each model

Likert Scale

Rate this response:
1 - Very poor
2 - Poor  
3 - Acceptable
4 - Good
5 - Excellent

Elo Rating

Run pairwise comparisons, calculate Elo scores like chess ratings.

Evaluation Checklist

Factual accuracy
Relevance to query
Coherence and fluency
Harmlessness
Following instructions
Appropriate length