Benchmarks — Glossary

Standardized tests used to measure and compare AI models — like report cards that show how well a model performs on specific tasks.

What are benchmarks?

Benchmarks are fixed sets of questions, puzzles, or tasks that many AI models run, so their results can be compared on a common yardstick. Headlines like "tops the leaderboard" or "beats GPT on MMLU" refer to benchmark scores.

Different benchmarks measure different skills: factual knowledge, reasoning and math, coding, instruction-following. Well-known examples include MMLU (general knowledge), HumanEval (code writing), and GSM8K (grade-school math). Shared tests create a common language for progress and give research concrete targets to push against.

What are the limits of benchmarks?

Their limits matter: models can be optimized to ace a specific benchmark without improving the underlying skill, and a high score doesn't guarantee helpful, safe, or reliable behavior in real-world use. Benchmarks are a starting point for comparison, not proof of quality.