Linear Probes Mechanistic Interpretability. Yet recent findings in mechanistic interpretability (MI)&md

Yet recent findings in mechanistic interpretability (MI)—the emerging field endeavoring to reverse-engineer the internal computations of these models—render this picture increasingly untenable. Download the Linear app for desktop and mobile. In this project, we extend the investigations presented by Kenneth Li et al. To investigate this question, they use linear probes to determine whether the model encodes a reasoning tree. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. If our claims about truly reverse-engineering models are true, then the mech interp toolkit should give grounded and true beliefs about models. Linear will launch directly in your browser window. 2023; Rimsky et al. g. e. The first technical deep dive covers post hoc explanation methods, data-centric explanation techniques, mechanistic interpretability approaches, and presents a unified framework demonstrating that these methods share fundamental techniques such as perturbations, gradients, and local linear approximations. Sparse autoencoders ‪Google DeepMind‬ - ‪‪Cited by 535‬‬ - ‪AGI Safety‬ - ‪Mechanistic Interpretability‬ Linear streamlines issues, projects, and roadmaps. Sep 8, 2025 · In Neel’s team’s experiments, linear probes (simple correlations) detected harmful prompts better than sophisticated methods They work by immediately noticing the model is using a ‘direction’ correlated with a concept like “this prompt is harmful” The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? May 14, 2025 · The field of mechanistic interpretability aims to better understand how neural networks work. Coefficient probe across complexities. 4-memorizing and generalizing. They reveal how semantic content evolves across network depths, providing actionable insights for model interpretability and performance In this sheet, we will look at methods for identifying the computational mechanisms that lead to the outputs, i. May 6, 2025 · We first outline the use of probing in revealing internal structures within LLMs. The quality of a product is driven by both the talent of its creators and how they feel while they’re crafting it. Finally, good probing performance would hint at the presence of the said property, which has the potential of being used in making final decisions to choose a label in the farthest layer of the neural network. Streamline work across the entire development cycle, from roadmap to release. Mar 18, 2025 · The most straightforward subspaces to target are those defined by linear probes. Supporting automated checks—cosine alignment, adversarial fragment tests, and causal probes—ran on the same cloud TPU pod used for training. A parallel body of work has sought to distinguish between reasoning and memorization through mechanistic interpretability. Linear is the system for modern product development. What started as a simple issue tracker, has since evolved into a powerful project and issue tracking system that streamlines workflows across the entire product development process. Mechanistic interpretability [14], [16] attempts to discover specific circuits within models; many of these studies [15], [17] have been conducted on the GPT-2 model which is large enough to be interesting but smaller than some of the more recent LLMs Apr 28, 2025 · Linear probes are often preferred because their simplicity ensures that high accuracy reflects the quality of the model’s representations, rather than the complexity of the probe itself Jan 12, 2026 · To unmask their villain, the OpenAI team used in-house mechanistic interpretability tools to compare the internal workings of models with and without the bad training. [T3] Undermining oversight: If the AI is being used in internal protocols (e. The Linear web app can be access by logging in to linear. Mechanistic interpretability represents one of three threads of interpretability research, each with distinct but sometimes overlapping motivations, which roughly reflects the changing aims of interpretability work over time. Jan 12, 2026 · Artificial intelligence Mechanistic interpretability New techniques are giving researchers a glimpse at the inner workings of AI models. Use Linear for free with your whole team. We then introduce the field of mechanistic interpretability, discussing the superposition hypothesis and the role of sparse autoencoders in disentangling polysemantic internal representations of LLMs into more interpretable, monosemantic features. , 2024]. In both mechanistic interpretability and disentangled representation learning, methods typically rely on labeled concept sets, manual labeling of visualizations, or computationally intensive searches over Nov 1, 2024 · We can also test the setting where we have imbalanced classes in the training data but balanced classes in the test set. Topics covered include classical AI interpretability techniques such as decision trees, saliency maps, feature inversion, and linear probes; foundational concepts like features, QK and OV circuits, and induction heads; advanced techniques involving sparse autoencoders, induction heads, alignment strategies, and AI safety methodologies; and alignment gemma sparse-autoencoders multi-agent-systems ai-safety emergent-behavior interpretability deception-detection activation-analysis mechanistic-interpretability llm-agents gemma-2b gemma-scope transformer-lens linear-probes Updated 4 days ago Python 6 days ago · One approach, generally known as mechanistic interpretability, goals to map the important thing features and the pathways between them across a complete model. To bring back the right focus, these are the foundational and evolving ideas Linear is built on. Sep 21, 2024 · In this work, we use linear probes to identify the subspaces responsible for storing previous token information in Llama-2-7b and Llama-3-8b. Sep 19, 2024 · Non-linear probes have been alleged to have this property, and that is why a linear probe is entrusted with this task. Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based on simplied models (e. Jun 23, 2025 · Mechanistic interpretability ultimately seeks what philosophers call mechanistic explanations – accounts of how phenomena arise from organized causal interactions among parts [Machamer et al. Mechanistic interpretability represents one of three threads of interpretability research, each with distinct but sometimes overlapping motivations, which roughly reflects the changin May 1, 2025 · Mechanistic interpretability decodes neural networks by mapping internal causal mechanisms into human-understandable processes, enabling validation and targeted intervention. Linear streamlines issues, projects, and roadmaps. The Aug 21, 2025 · Similar experiments using sparse autoencoders in place of linear probes yielded significantly poorer performance. To that end, we aim to perform Figure 2. 3 Background Explore transformer architectures, mechanistic interpretability, and machine learning techniques. see the differing activation for neuron 255 in layer 3 and neuron 250 in layer 8. Jan 22, 2025 · A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. underlying neural network generalization. A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable algorithms and concepts, focusing on its relevance to AI safety. This is really important for mechanistic interpretability, fairly important for other areas. , at mechanistic interpretability. 2) Model Transparency: Zou et al. focuses on demonstrating the feasibility of SAEs on pLMs and performs quantitative interpretability comparisons between ESM and SAE. Interpretability Of course, SAEs were created for interpretability research, and we find that some of the most interesting applications of SAE probes are in providing interpretability insight. Purpose-built for modern product development. To verify the integrity of AI systems, you want reproducible outcomes, and specifically to work as much as you can with integers. Overview This project compares two complementary approaches to mechanistic interpretability in large language models: Linear Probes ("The Lie Detector") - Tests whether concepts are linearly encoded in activation space Causal Abstraction - Tests whether models use expected causal/logical structures internally Apr 28, 2025 · Linear probes are often preferred because their simplicity ensures that high accuracy reflects the quality of the model’s representations, rather than the complexity of the probe itself Through mechanistic interpretability techniques, we uncover the neural mechanisms under-lying deception, employing logit lens analysis, causal interventions, and contrastive activation steering to identify and control deceptive behavior. Upgrade to enable unlimited issues, enhanced security controls, and additional features. 1-8B-instruct (probe accuracy) Nov 1, 2024 · We can also test the setting where we have imbalanced classes in the training data but balanced classes in the test set. We built probes using simple training data (from RepE paper) and techniques (logistic regression). Types of Interpretability Interpretability by design: This thread focuses on constructing AI models to be transparent from the outset, often using inherently interpretable architectures such as decision trees, linear models, or additive models alongside classical attribution techniques to explain sensitivity to inputs and training data. The probe R2 decreases and the MSE increases with increasing probe complexity, giving evidence of a linear encoding of the coefficients within the activations. We test these probes in more complicated and realistic environments where Llama-3. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to pr Nov 1, 2024 · We can also test the setting where we have imbalanced classes in the training data but balanced classes in the test set. scoring behavioral evals, automated interpretability, or deployment-time monitoring), it could strategically undermine the protocol in order to cause an unsafe model (potentially a version of itself) to be deployed. The approach that comes closest to our work isStolfo et al. Oct 29, 2025 · Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems. Mechanistic interpretability [14], [16] attempts to discover specific circuits within models; many of these studies [15], [17] have been conducted on the GPT-2 model which is large enough to be interesting but smaller than some of the more recent LLMs Jan 27, 2025 · Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. (2023), presenting a mechanistic interpreta- tion of arithmetic reasoning by investigating the information ow in the model given simple mathe- matical questions. Sep 20, 2019 · Predictive performance of linear probes against training epochs on a held-out validation set for ImageNet10; for the three types of networks: ran- domized, 0. Sep 9, 2024 · Understanding AI systems' inner workings is critical for ensuring value alignment and safety. disentangled representations for human-interpretable concepts. Nov 1, 2024 · We can also test the setting where we have imbalanced classes in the training data but balanced classes in the test set. Powerful yet simple to use, it streamlines issues, sprints, and projects. The first thread aims to build AI systems that are interpretable by design. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to pr 2) Model Transparency: Zou et al. . 3-70B responds deceptively. It can be seen as trying to reverse-engineer the computational algorithms the mdoel has learned during training and that are active during certain tasks. s now moving towards a more granular approach. We named it Linear to signify progress. app. Linear is the tool of choice for ambitious startups to plan, build, and scale their products. linear probes etc) can be prone to generalisation illusions. We calculate their precision / recall on the first letter identification tasks (as a proxy for monosemanticity / interpretability) and find they significantly underperform linear probes. Probing R2 and MSE for coefficient of the linear relationship plotted across increasing probe complexities, adding hidden layers to an MLP. Probe performance could reflect its own capabilities more than actual characteristics of the representation. Dec 30, 2024 · Linear Probes: Train simple linear models on internal representations to determine what information is encoded at each layer. Use linear probes to find attention heads that correspond to the desired attribute Shift attention head activations during inference along directions determined by these probes More activation manipulation Contrastive steering vectors (Turner et al. Jan 12, 2026 · They lack the sensitivity to underlying structure associated with genuine understanding. Aug 1, 2025 · Human annotators scored randomly sampled activations for interpretability, without knowing the training parameters. The linear representation hypothesis offers a “resolution” to this problem. But when I say important, what I mean is being able to reason fluently about what does this following equation of matrices mean, not about knowing a long list of theorems. r ensuring value alignment and safety. 2023) What can model interpretability give us? Nov 24, 2025 · Explore how mechanistic interpretability dissects neural network internals via causal, observational, and interventional methods for human-understandable insights. Linear and nonlinear probes study how information about labels or latent variables is decodable from intermedi-ate representations, and recent mechanistic-interpretability work examines circuits, features, and invariants inside large models for safety and reliability [17]. Sep 16, 2025 · This work contributes to mechanistic interpretability by identifying a meaningful confidence direction within LLM activations, corroborating recent works with sparse auto-encoders. Linear Probes: Neural Network Diagnostics Updated 3 July 2025 Linear probes are simple, independently trained linear classifiers added to intermediate layers to gauge the linear separability of features. Available for Mac, Windows, iOS, and Android. repre-sentational approaches. Map out your product journey and navigate from idea to launch with Linear's purpose-built product planning features. We provide a mechanistic explanation for this correlation using feature attribution, demonstrating that increased response uncertainty leads to relevance signals distributed across a greater number of features, thus hindering probe model performance. These results highlight the effectiveness of simple linear probes as valuable tools for interpretability and control. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. This shift from surface-level analysis to a focus on the internal mechanics of deep neural networks characterizes the transition towards Mechanistic interpretability, as an approach to inner interpretability, aims to completely specify a neural reverse engineering r and precise understanding of We provide a mechanistic explanation for this correlation using feature attribution, demonstrating that increased response uncertainty leads to relevance signals distributed across a greater number of features, thus hindering probe model performance. Nearly all functionality in the desktop app including offline mode is available on the web in most browsers. , 2000, Bechtel and Abrahamsen, 2005, Craver, 2007, Craver et al. Simon et al. We show that these subspaces are causally implicated in induction by using them to "edit" previous token information and trigger random token copying in new contexts. in their ICLR 2023 Paper Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. The loss is simply the probe's loss function, and SSR will handle the rest, finding a perturbation to reroute your activations: Rerouting activations using probes, again, on Llama 3. Behavioral Probes: Examine how altering specific inputs affects the Probe performance could reflect its own capabilities more than actual characteristics of the representation. Dec 16, 2024 · For example, simple probes have shown language models to contain information about simple syntactical features like Part of Speech tags, and more complex probes have shown models to contain entire Parse trees of sentences. This guide is intended to give you an overview of Linear's features, discover their flexibility, and provide tips for how to use Linear to improve the speed, value, and joy of your work. Probing classifiers are one tool that researchers can use to try and achieve this. Despite recent progress toward these May 1, 2025 · We use linear probes on four downstream tasks to extract interpretable features with the goal of enabling scientific discovery. 2 1b. Their judgments formed the basis of the 70% interpretability metric. 3 days ago · Interpretability of the Reasoning-Memorization Interplay. [13] divide the field of model transparency into mechanistic [14]–[18] vs. Based on his work, a tool was made to inspect each MLP neuron in Othello-GPT, e. We contribute the following new insights: We first show that trained linear probes can accurately map the Neel Nanda just released a TransformerLens version of Othello-GPT (Colab, Repo Notebook), boosting the mechanistic interpretability research of it. gemma-2-9b-it (probe accuracy) Llama-3. Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. Sep 8, 2025 · Ones I’ll call out: linear algebra. Mar 29, 2023 · Mech interp for science of deep learning: A motivating belief for my grokking work is that mechanistic interpretability should be a valuable tool for the science of deep learning. Jun 23, 2024 · Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models Tianyi Men1,2, Pengfei Cao1,2, Zhuoran Jin1,2, Yubo Chen1,2, Kang Liu1,2, Jun Zhao1,2 1 The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2 School of Artificial Intelligence, University of Chinese Jun 2, 2025 · Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.

opnhfdnx
bgljrx2
7pnxpzb
au4g8j7k
tnsy1t77iw
6zxzr7sv5e
jeso4zy9xpoa
j1cpxodo
807rwr5f
ikr95