Senior Applied Scientist, Microsoft (MSAI)
Posted June 06, 2026
In 2023, I made a deliberate bet on where to spend my side-quest research time: mechanistic interpretability of large language models. The thesis was simple, even if it doesn’t sound like a hot take: the shortest path to a better product is understanding why the model behaves the way it does – not making it bigger, not adding another loss term, not chasing another evaluation suite. I have been an advocate of interpretability even before November 2022. Explaining (using tools like LIME, SHAP, etc.) why machine learning models were predicting a specific outcome helped me in:
When you genuinely understand what’s happening inside the model, you stop being surprised by it, you spot the gaps that benchmarks hide, and you ship features that hold up in production rather than fall apart at the user-experience layer.
This post is a stake in the ground on what that bet looked like in practice – what I worked on, what surprised me, and where I think the field needs to go next, especially as I gear up to welcome a new Intern this year to uncover some production model behaviors.
“Mechanistic interpretability” is a broadening label, that enables deciphering model “circuits” using small toy transformers. What I am/was particularly fasinated with, is the tooling probing real production-scale language models to understand which internal representations get used when, especially when the model has both parametric memory and retrieved context to draw from.
This is also where it parts ways with the older, model-agnostic interpretability stack — methods like LIME and SHAP that estimate which inputs mattered for a single prediction by perturbing a black-box model from the outside. Those methods are useful as a sanity check, but they are post-hoc, correlational, and silent on how the model actually arrived at the output; they tell you a story about the prediction without giving you a handle on the computation behind it. For an LLM-plus-retriever system where the failure mode is “the model deferred to context it shouldn’t have,” LIME-style importance scores describe the symptom but don’t tell you which sub-network to instrument or steer. Mechanistic interpretability opens the model, identifies the specific representations and circuits responsible, and — crucially — gives you a causal lever you can pull at inference time. That shift from explanation-as-description to explanation-as-intervention is what makes it product-shaping rather than just diagnostic.
The question this paper asks is direct: when a language model is given retrieved context plus a query, does it actually use the retrieved context, or does it fall back on what it learned during pre-training? People had hand-waved about this for years, usually arguing one side or the other based on prompt-level tricks. We wanted a mechanistic answer.
We probed activations layer-by-layer and found that production-class LLMs lean heavily on retrieved context, they actively minimize their use of parametric memory when context is available. That sounds like unambiguous good news for RAG. It isn’t. The same property that makes RAG work is the property that makes prompt injection work. The model is wired to defer to whatever you put in its context window; that’s a feature for retrieval and a vulnerability for adversarial inputs. Once we understood the mechanism, it stopped being a surprise that prompt-injection defenses are so hard to get right by patching at the API layer alone.
The precursor to the RAG line. Before I was probing models to understand RAG behavior, I was probing them to understand which layers actually need to move during task adaptation. The finding, that you can get most of the benefit of full fine-tuning by touching a small, principled subset of layers, was itself a small piece of interpretability. It told us layer-level locality matters, and that prepared the ground for the activation-patching work that came later.
The learnings from the research are broadly applicable:
Context: I was born into a Bengali family that lived in several states across western and southern India. Bengali for those who don’t know, is a gender-neutral language and doesn’t assign pronouns with objects and verbs. This is in sharp-constrast with Konkani and Hindi (which I had to take up in school as second and third language) and Marathi, which do have genders assigned to verbs and pronouns influence sentence formation to a very large degree. My Bengali knowledge often would overfit my “actual” intelligence to learn gender-dependent languages like Konkani, Hindi, and Marathi, and till date I do not know the protocol on how these languages assign genders to non-living objects, like chair, tables, cars, etc. My abysmally bad non-English scores in exams are a testament to my struggles. Today, if I have to code-switch and speak a non-English language, I generally wing the genders for inanimate objects.
I have observed the best language models today misgender (maybe not as frequently as I do :P), and this behavior is visible even with grounding context. This got me thinking and I wanted to probe whether multilingual LLMs have a cross-lingual gender-resolution circuit: when a model is given a source-language passage and asked to translate it into German (which forces a pronoun choice between Er and Sie), does it actually extract gender from the source-language grammar, or does it default to something else? Four source languages give a clean ladder of how much gender information the source provides:
The dataset is small but deliberate: five female + five male names per language, written in source-language script. The choice to use native names instead of a single shared roster was a direct response to v1 (see appendix below) — v1 embedded European first names (Maria, Carlos, Lisa, …) inside Bengali/Hindi/Marathi sentences, and the script-mixing itself leaked gender signal in a way we couldn’t separate from the source-language grammar. Native names in native script force the model to engage the actual Bengali/Hindi/Marathi name distribution it would see in a real product setting.
| Language | Female names (with romanisation) | Male names (with romanisation) |
|---|---|---|
| Bengali | বদিশা (Bidisha), রাজশ্রী (Rajashree), দেবলিনা (Deboleena), নীলাঞ্জনা (Nilanjana), ঐন্দ্রিলা (Oindrilla) | উৎপল (Utpal), বিমল (Bimal), দেবদিত্য (Debaditya), শায়ন (Shayan), অরবিন্দ (Aurobindo) |
| Hindi | रेशमी (Reshmi), प्रिया (Priya), आरती (Aarti), अंजली (Anjali), काजल (Kajal) | अर्जुन (Arjun), विक्रम (Vikram), राहुल (Rahul), अभिषेक (Abhishek), अमित (Amit) |
| Marathi | साई (Sai), मधुरा (Madhura), अश्विनी (Ashwini), वैष्णवी (Vaishnavi), ओजस्वी (Ojasvi) | साकेत (Saket), अजिंक्य (Ajinkya), चैतन्य (Chaitanya), अमोल (Amol), अनिकेत (Aniket) |
| English | Emily, Sarah, Anna, Lisa, Maria | Alex, Michael, Carlos, Felix, Pedro |
All four name lists were cross-checked with a native or near-native reader of each script, one early Bengali name (ইমন / Iman) was a homograph for an unrelated Bengali word and had to be replaced. (Most Bengali names are of my family members. Also fun-fact if not Reshmi (Hindi language centered), I was to be named Nilanjana, which is deeply rooted in Bengali cultural context -so this was a fun experiment also calibrating how LLMs using my first name and potential first name could misgender me.) That kind of cultural validation is invisible labor that’s easy to skip; in this experiment, skipping it was the difference between a sub-chance probe accuracy and a clean signal.
I ran two models on this: Gemma-2-2B-it and Aya-23-8B (Cohere’s purpose-built multilingual model). For each, I measured two separate things:
The separation between encoding and routing is the whole point. A model can encode gender perfectly and still refuse to use it at generation time — and that gap is invisible to standard translation benchmarks but recoverable by mechanistic analysis.

Gemma encodes gender exactly along the predicted ladder: English saturates at layer 0 (explicit pronoun cue), Marathi rises sharply around layer 8 to near-perfect by layer 12, Hindi follows a similar arc, and Bengali sits stable at 0.6–0.7 across most of the network — above chance but not perfect, exactly as you’d expect from a name-only signal.

Aya encodes faster: English saturates by layer 6, Marathi by layer 9, Hindi (noisily) by layer 22. This is the multilingual-training dividend showing up mechanistically — Aya has dedicated grammar-extraction circuits that fire earlier in the forward pass than Gemma’s. Bengali is the outlier; I’ll come back to it.
This is where the story gets interesting. Here’s the side-by-side comparison of probe peak (encoding) vs. translation accuracy (routing):

| Language | Gemma probe | Gemma routing | Aya probe | Aya routing |
|---|---|---|---|---|
| English | 1.00 | 1.00 | 1.00 | 1.00 |
| Marathi | 1.00 | 1.00 | 1.00 | 0.50 |
| Hindi | 1.00 | 1.00 | 1.00 | 0.80 |
| Bengali | 0.70 | 0.70 | 0.30 | 0.60 |
Gemma’s encoding and routing are matched everywhere. What the model knows, it uses. The Bengali ceiling at 0.70 is real but explainable — only the name token carries gender information, and Gemma’s grasp of distinctively-Bengali names is partial.
Aya has a large, language-dependent encoding-to-routing gap. Most strikingly: Aya’s Marathi probe accuracy is 1.00 — the model has perfectly encoded the subject’s gender from the Marathi grammatical cues, but Aya’s Marathi translation accuracy is 0.50, chance. The model knows the gender and doesn’t use it. Hindi shows the same pattern at lower magnitude (gap 0.2). English shows no gap. The clearer the grammatical signal, the worse Aya’s routing relative to its encoding.
The translation box plots make the failure mode visible:

For Gemma every (language × gender) pair separates cleanly: F-boxes positive, M-boxes negative, modest magnitudes.

For Aya the F-boxes for Marathi and Bengali sit in negative territory — the model is predicting Er (masculine) for female-subject Marathi and Bengali sentences despite the probe saying it has the right gender information internally. English is clean, with massive magnitudes; the rest of the languages have a quiet masculine default.
Aya’s behavior reads as a masculine default for non-English source contexts. The model’s German generation distribution is anchored to a strong masculine prior — Er as the unmarked third-person — and only English source context produces a signal strong enough to override it. Indic-language sources, however grammatically rich, get partially or wholly overridden.
This is exactly the kind of bias mechanistic analysis is the right tool for. It doesn’t show up cleanly on translation benchmarks (BLEU-style metrics tolerate pronoun substitutions, especially when the surrounding sentence is otherwise correct). It doesn’t show up on safety evals (it’s not a refusal, not a toxic output). What it is: a language-dependent generation bias that is invisible to standard metrics but recoverable by reading the residual stream. If a downstream product summarizes Marathi or Bengali source documents in German, it will systematically default to masculine pronouns, and no aggregate benchmark will catch it.
The mechanistic finding tells you exactly where to intervene: not in the encoding (it’s fine), not in the source-language data mixture (the model evidently learned the grammar), but in the late-stage generation circuit that decides which pronoun to commit to. That’s a much sharper diagnosis than “Aya has a translation bias” — it tells you which subsystem to audit and which not to bother with.
Three directions I expect to spend time on:
The cross-lingual probing experiment above is v2. The first pass (v1) used a different framing and produced almost no signal, but the way it failed turned out to be the most useful part of the project — it told us what kind of measurement actually works for this question. Worth including here because the negative results are themselves a finding.
V1 used continuation framing: present a source-language passage followed by a German question — “Was hat diese Person gestern gemacht?” and measure the model’s next-token logit difference for Er vs. Sie. The dataset used European first names (Maria, Carlos, Lisa, Pedro, …) embedded inside Indic-language sentences, on the theory that name-controlled minimal pairs would let us isolate the source-language grammatical signal.
Two things went wrong simultaneously: the prompt contained a feminine grammatical anchor, and the dataset contained a script-mixing confound. The result was that neither model showed a clean cross-lingual signal under v1’s measurement, and we couldn’t tell which part of the setup was responsible.

Gemma v1 logit lens. All four source-language curves lie on top of each other across all 26 layers. The pronoun decision crystallizes only at layers 22–23 of 26, and the curves for Bengali (no grammatical signal), Hindi (verb-gender), Marathi (both pronoun and verb gender), and English (explicit pronoun) are within noise of each other. Read literally, this says Gemma has no cross-lingual gender-resolution circuit. The v2 probe later showed this was wrong — Gemma encodes gender from grammar quite cleanly — so what v1 was actually measuring was something narrower. The narrowest interpretation: Gemma was reading the name token, ignoring the surrounding Indic context, and the European names dominated whatever signal there was.

Aya v1 logit lens. Here we do get differentiation — English peaks at ~78, Hindi at ~40, Bengali and Marathi at ~33 — but the absolute magnitudes are tangled up with Aya’s preference for Sie driven by the German prompt’s “diese Person” (grammatically feminine). The relative ordering is suggestive, but you can’t tell from this alone whether Aya was extracting gender from the source or was being pushed by an anaphor in the prompt that we put there ourselves.
The probe figures in the main section don’t have the “diese Person” anchor (the model is run on just the source sentence, with no German prompt at all), and they use a forced-choice scoring template that avoids sentence-starter priors. With those two fixes the language ladder that v1 said wasn’t there became immediately visible. The lesson: measurement choice can hide a clean circuit just as effectively as the absence of one. If you have a prior on what the model should be doing and the measurement says it isn’t, doubt the measurement first.
V1 also included cross-lingual activation patching: take the residual stream from a clean English example, splice it into the corresponding (layer, position) of a corrupt Bengali example, and see whether the German pronoun choice gets restored. The hope was to identify a localized circuit responsible for gender attribution.


Neither heatmap shows a clean restoration hotspot. Gemma has scattered blue (mild restoration) concentrated in early layers at late prompt positions, and Aya is mostly red (patching makes things worse). I include them here as a methodological warning rather than a finding: cross-prompt activation patching across different scripts and different prompt lengths is fundamentally noisy, and the noise can easily masquerade as either evidence-for or evidence-against a circuit. The Bengali prompt and the English prompt have different token counts (Devanagari and Bengali script tokenize longer than Latin for similar content), and position p=25 in one prompt is not the semantic equivalent of p=25 in the other. Patching there isn’t a controlled intervention — it’s a perturbation whose semantics depend on which arbitrary tokens happened to align by index.
The right way to use activation patching for a question like this is minimal pairs within a single language: e.g., for Hindi, patch between a sentence with masculine verb-form था and one with feminine थी, holding every other token (including the name and the position of the verb) fixed. That gives you a controlled causal intervention. v1’s cross-script patching didn’t, and the heatmaps reflect that.
— Reshmi