Benchmarks

Standardized tests used to measure and compare AI models — like report cards that show how well a model performs on specific tasks.

Benchmarks are fixed sets of questions, puzzles, or tasks that many AI models run, so their results can be compared on a common yardstick. Headlines like "tops the leaderboard" or "beats GPT on MMLU" refer to benchmark scores.

Different benchmarks measure different skills: factual knowledge, reasoning and math, coding, instruction-following. Well-known examples include MMLU (general knowledge), HumanEval (code writing), and GSM8K (grade-school math). Shared tests create a common language for progress and give research concrete targets to push against.

Their limits matter: models can be optimized to ace a specific benchmark without improving the underlying skill, and a high score doesn't guarantee helpful, safe, or reliable behavior in real-world use. Benchmarks are a starting point for comparison, not proof of quality.