Cross-Lingual Factual Recall via Reinforcement Learning

Research · 2025 · with Jonathan von Rad, George Burgess et al.

Python · Mechanistic Interpretability · LLMs · NLP · GRPO

Large language models are trained predominantly on English text, and as a result they often know a fact in English but fail to express it correctly in other languages. This project investigates that inconsistency systematically. We built PolyFact, a dataset of 100K Wikidata-grounded facts across 12 typologically diverse languages, and used it to benchmark three post-training approaches on two 7B-parameter models: continual pretraining, supervised fine-tuning, and reinforcement learning via GRPO. Reinforcement learning consistently outperformed the alternatives, improving factual consistency not only on trained languages but also generalising to held-out ones, suggesting it learns shared internal representations rather than surface-level memorisation of training labels.

My contribution was the mechanistic interpretability analysis. Using LAPE (Language-specific neuron Analysis using Probing Entropy), a probing-based technique that scores neuron specialisation via Shannon entropy over activation distributions, I identified which neurons in the MLP layers of the model were acting as language-specific routing components. The key finding was that GRPO training substantially reduces this language specialisation: neurons that previously activated selectively for particular languages become more general-purpose, and language processing shifts toward deeper layers of the network. This reorganisation of internal representations is what enables the model to retrieve the same underlying fact regardless of the query language, offering a mechanistic explanation for the observed cross-lingual consistency gains.