When a new AI model launches, you'll often see headlines like "beats GPT on MMLU" or "tops the leaderboard." Those claims come from benchmarks — fixed sets of questions, puzzles, or tasks that many models are tested on so their results can be compared. Think of them like standardized tests for AI: everyone takes the same exam, and the scores tell you something about relative performance.
Benchmarks cover different skills. Some test factual knowledge (history, science, law). Others test reasoning (logic puzzles, math), coding ability, or how well a model follows instructions. Popular examples include MMLU (general knowledge), HumanEval (code writing), and GSM8K (grade-school math). Each gives a snapshot of capability in that area — useful, but not the whole picture.
Why benchmarks matter: they create a shared language for progress. Without them, every company would test differently, and claims would be hard to verify. Benchmarks also drive research: improving on a benchmark becomes a concrete goal, which pushes the field forward.
The catch: benchmarks have limits. Models can be optimized for specific benchmarks — trained or tuned to ace the test without getting better at the underlying skill. And a high score on math or coding doesn't guarantee the model will be helpful, safe, or reliable in real-world use. Benchmarks are a starting point for comparison, not a guarantee of quality.