ARC-Challenge
Accuracy (%)
Science exam questions requiring reasoning beyond simple fact retrieval. Focuses on challenging questions that retrieval methods fail on.
Compare Large Language Models across standardized benchmarks.
Last updated: Nov 12, 2025
Accuracy (%)
Science exam questions requiring reasoning beyond simple fact retrieval. Focuses on challenging questions that retrieval methods fail on.
Average Accuracy (%)
23 challenging tasks from BIG-Bench where previous models struggled. Tests diverse reasoning abilities.
F1 Score
Reading comprehension requiring discrete reasoning (counting, sorting, addition, subtraction) over text passages.
Accuracy (%)
8,500 grade school math word problems requiring multi-step reasoning. Tests arithmetic and basic mathematical reasoning.
Accuracy (%)
Tests commonsense reasoning about everyday situations. Model must predict the most likely continuation of a scenario.
Pass@1 (%)
Measures code generation ability through 164 programming problems. Model generates code, which is then tested against test cases.
Accuracy (%)
Tests model’s world knowledge across 57 subjects including STEM, humanities, social sciences, and more. Requires both factual knowledge and reasoning ability.
Score (1-10)
Tests multi-turn conversational ability across 8 categories. Uses GPT-4 as judge to evaluate response quality.
Accuracy (%)
Tests whether models generate truthful answers or reproduce common misconceptions. Critical for trustworthy AI.
Accuracy (%)
Tests common sense reasoning through pronoun resolution. Requires understanding context and world knowledge.