Investigating What LLMs Are Really Doing with Adejumobi Joshua
"Does the model mean what it says, or has it just learned to look like it does?"
What students do
Over five weeks, students contribute to two connected investigations that share one question: are AI models actually doing what they appear to be doing, or have they learned to look like it?
The first investigation tests whether published methods for reducing bias in language models actually work the way the papers claim. Students reproduce the original results on a standard bias benchmark, then test whether those results hold up when the evaluation framing is changed. If the reductions disappear when the model can't tell it's being tested, that's a strong signal the method is teaching compliance rather than reasoning.
The second investigation looks inside the model. Students take an open language model, give it harmful and ambiguous requests, and extract its internal activations: the patterns that arise inside the model as it processes each prompt. They then train simple classifiers on those internal patterns to ask: when the model refuses, do its internals look like other refusals? When it complies, do its internals look like other compliances? Or are there cases where the model says one thing while its internals say another?
Adejumobi's framing: benchmarks alone aren't enough. Real evaluation requires both, measuring what the model says and looking at what it represents internally. The investigations together model that approach.
The investigations share a tooling backbone, and students contribute to both. They leave with working pipelines, real findings, and a written report. Last summer, Adejumobi's students presented research at the Women in Machine Learning workshop at NeurIPS, and one of those papers has been cited. This summer's projects are positioned to produce similarly strong outputs.
About the mentor
Adejumobi Joshua leads AI Evaluation research at SeqHub, where her work has taken her across the defining questions in the field: how models handle bias, whether they reason or perform, how safety and alignment get measured and gamed. Last summer one of her students presented at NeurIPS and her paper has since been cited. This summer students look inside the model, not just at its output.
