📊 LLM Benchmarks

Compare Large Language Models across standardized benchmarks.

Last updated: Nov 12, 2025

Ad Space

📚 Understanding Benchmarks

Science Reasoning GitHub

ARC-Challenge

Accuracy (%)

Science exam questions requiring reasoning beyond simple fact retrieval. Focuses on challenging questions that retrieval methods fail on.

Excellent: >90% Good: 75-90% Average: 60-75%
Multi-Task Reasoning GitHub

BBH (Big Bench Hard)

Average Accuracy (%)

23 challenging tasks from BIG-Bench where previous models struggled. Tests diverse reasoning abilities.

Excellent: >70% Good: 55-70% Average: 40-55%
Reading Comprehension GitHub

DROP (Discrete Reasoning Over Paragraphs)

F1 Score

Reading comprehension requiring discrete reasoning (counting, sorting, addition, subtraction) over text passages.

Excellent: >80% Good: 65-80% Average: 50-65%
Mathematical Reasoning GitHub

GSM8K (Grade School Math)

Accuracy (%)

8,500 grade school math word problems requiring multi-step reasoning. Tests arithmetic and basic mathematical reasoning.

Excellent: >90% Good: 70-90% Average: 50-70%
Common Sense Reasoning GitHub

HellaSwag

Accuracy (%)

Tests commonsense reasoning about everyday situations. Model must predict the most likely continuation of a scenario.

Excellent: >85% Good: 70-85% Average: 50-70%
Code Generation GitHub

HumanEval

Pass@1 (%)

Measures code generation ability through 164 programming problems. Model generates code, which is then tested against test cases.

Excellent: >80% Good: 60-80% Average: 40-60%
Knowledge & Reasoning GitHub

MMLU (Massive Multitask Language Understanding)

Accuracy (%)

Tests model’s world knowledge across 57 subjects including STEM, humanities, social sciences, and more. Requires both factual knowledge and reasoning ability.

Excellent: >85% Good: 70-85% Average: 50-70%
Conversational AI GitHub

MT-Bench (Multi-Turn)

Score (1-10)

Tests multi-turn conversational ability across 8 categories. Uses GPT-4 as judge to evaluate response quality.

Excellent: >8.5 Good: 7.5-8.5 Average: 6.0-7.5
Truthfulness & Safety GitHub

TruthfulQA

Accuracy (%)

Tests whether models generate truthful answers or reproduce common misconceptions. Critical for trustworthy AI.

Excellent: >70% Good: 60-70% Average: 50-60%
Common Sense Reasoning GitHub

WinoGrande

Accuracy (%)

Tests common sense reasoning through pronoun resolution. Requires understanding context and world knowledge.

Excellent: >85% Good: 75-85% Average: 60-75%
Ad Space