Mechanistic interpretability is the practice of reverse-engineering AI models: mapping the internal pathways — the specific components and connections — a model uses to arrive at its answers, instead of treating it as a black box that turns inputs into outputs.
A notable result came in 2025, when Anthropic's circuit-tracing technique showed that, asked "what is the capital of the state containing Dallas," Claude first identifies Texas internally and then derives Austin — evidence that models form intermediate steps rather than merely pattern-matching words.
The practical value: detecting hidden flaws, predicting failure modes, and verifying that models behave as intended. Skeptics question whether the methods scale to the largest models, but the goal stands — AI systems that can be inspected, debugged, and trusted.