Mechanistic Interpretability — Glossary

The practice of reverse-engineering AI models to understand how they actually arrive at their answers, rather than treating them as black boxes.

What is mechanistic interpretability?

Mechanistic interpretability is the practice of reverse-engineering AI models: mapping the internal pathways — the specific components and connections — a model uses to arrive at its answers, instead of treating it as a black box that turns inputs into outputs.

A notable result came in 2025, when Anthropic's circuit-tracing technique showed that, asked "what is the capital of the state containing Dallas," Claude first identifies Texas internally and then derives Austin — evidence that models form intermediate steps rather than merely pattern-matching words.

Why does mechanistic interpretability matter?

The practical value: detecting hidden flaws, predicting failure modes, and verifying that models behave as intended. Skeptics question whether the methods scale to the largest models, but the goal stands — AI systems that can be inspected, debugged, and trusted.