LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

The LMR-BENCH benchmark, introduced by researchers at the University of Texas at Dallas at EMNLP 2025, evaluates whether LLM agents can reproduce core implementations from NLP research papers by filling in masked code stubs using the paper's description and surrounding codebase context. The benchmark includes 28 tasks from 23 recent NLP papers and scores agents on both functional correctness (via unit tests) and implementation fidelity (using GPT-4o as a judge). Key findings show that model architecture tasks are the hardest due to multi-file reasoning requirements, while the surrounding codebase context often provides stronger signals for correct implementation than the paper's equations alone.

A research team from the University of Texas at Dallas published LMR-BENCH at EMNLP 2025, asking a specific question: can LLM agents reproduce the core implementation from an NLP research paper when given the paper, a partially masked codebase, and explicit instructions? This is harder than it sounds. And the benchmark design is smart enough to be worth understanding in detail. Sources: arXiv 2506.17335 https://arxiv.org/abs/2506.17335 | ACL Anthology https://aclanthology.org/2025.emnlp-main.314/ | GitHub https://github.com/du-nlp-lab/LMR-Bench What the benchmark actually tests LMR-BENCH contains 28 reproduction tasks drawn from 23 NLP papers published in ACL, EMNLP, NAACL, and AAAI over the past five years. Each task follows the same structure: - Paper : the full PDF - Masked repository : a real codebase from the paper, but with one or more critical functions replaced by TODO: implement stubs - Implementation instruction : a natural language description of what the masked function should do, including cross-file dependencies and design intent The agent's job is to generate patch code that fills the stubs correctly. This tests something distinct from "can an LLM write a function from a docstring." The function body has to match what the paper describes, use the surrounding codebase's conventions, and pass unit tests against the paper's reference implementation. The nine task categories | Category | What gets masked | |---|---| | Tokenization | Custom tokenizer logic | | Attention mechanism | Scaled dot-product or custom attention | | Positional encoding | RoPE, ALiBi, learned variants | | Loss function | Custom training objectives | | Data preprocessing | Dataset-specific transforms | | Model architecture | Layer definitions, custom blocks | | Training procedure | Optimizer steps, gradient modifications | | Decoding strategy | Beam search variants, constrained decoding | | Evaluation metric | BLEU variants, task-specific metrics | The hardest category is model architecture : reproducing a custom layer requires reading across multiple files to understand tensor shapes, class inheritance, and forward pass conventions — exactly the kind of multi-file reasoning that current LLMs struggle with. The easiest is evaluation metric : formulas are usually self-contained, well-documented in the paper, and don't require deep codebase knowledge. How masking works in practice Here's what a masked task looks like synthetic example based on paper methodology : python Original in paper's codebase: rotary embedding.py def apply rotary emb xq, xk, freqs cis : """Apply rotary embeddings to query and key tensors.""" xq = torch.view as complex xq.float .reshape xq.shape :-1 , -1, 2 xk = torch.view as complex xk.float .reshape xk.shape :-1 , -1, 2 freqs cis = reshape for broadcast freqs cis, xq xq out = torch.view as real xq freqs cis .flatten 3 xk out = torch.view as real xk freqs cis .flatten 3 return xq out.type as xq , xk out.type as xk Masked version what the agent receives : def apply rotary emb xq, xk, freqs cis : """Apply rotary embeddings to query and key tensors.""" TODO: implement Instruction: Apply rotary position embeddings to xq and xk. Use torch.view as complex for complex number representation. freqs cis shape must be broadcast-compatible with xq . Return float tensors matching input dtype. raise NotImplementedError The info.json for this task would also specify which files the agent should read reshape for broadcast definition lives in utils.py , for example . Dual evaluation: unit tests + LLM-as-judge LMR-BENCH scores agents on two axes: Axis 1 — Functional correctness unit tests Numerical equivalence against the reference implementation. The agent's patch must produce the same tensor outputs as the original function. Axis 2 — Implementation fidelity LLM-as-judge GPT-4o reads the paper's algorithm description and the agent's code, then scores whether the implementation actually follows the described method — even if it passes unit tests through an equivalent but differently structured approach. This dual axis matters because: - A function can pass unit tests but use a different algorithm memorized shortcut - A function can fail unit tests due to floating-point differences but be conceptually correct Both axes tell you different things about the agent's reasoning. What the results show The paper doesn't release a full leaderboard in the public arXiv version, but the findings indicate: - o3-mini high compute was the best-performing model tested - Pass@1 rates ranged roughly from 20% to 60% across task categories - Multi-file reasoning was the single biggest differentiator: models that could trace function calls across 3+ files significantly outperformed those that stayed in the target file - Simply giving the model the paper PDF without the masked code resulted in worse performance than giving both — the code context matters more than the paper text for reproduction tasks The last point is counterintuitive. You'd expect the paper's equations to be the key signal. But the surrounding codebase tensor shapes, variable naming, utility functions constrains the solution space more tightly than the abstract algorithm description. Why this benchmark matters for developers If you're building an AI-assisted research coding tool or evaluating whether an agent can help you implement a paper , LMR-BENCH is the most realistic evaluation framework available. The alternatives: - HumanEval / MBPP : function-level, no paper context, no cross-file reasoning - SWE-bench : bug fixing in large codebases, different skill set from paper reproduction - APPS : competitive programming, not research implementation LMR-BENCH specifically targets the "I read a paper, now implement it" workflow — which is what most ML engineers actually do. Running the benchmark yourself The benchmark repo requires Python ≥ 3.12 and supports any LLM backend through its evaluation harness: git clone https://github.com/du-nlp-lab/LMR-Bench cd LMR-Bench pip install -r requirements.txt Run a single task with Claude python evaluate.py \ --task benchmark/rotary emb task/ \ --model claude-opus-4-7-20251001 \ --api-key $ANTHROPIC API KEY The evaluation harness handles: sending the paper + masked code to the model, collecting the patch, running unit tests, and recording fidelity scores. What to expect if you run it Based on the paper's findings, expect: - Evaluation metric tasks : 50–60% pass@1 with a capable model - Model architecture tasks : 20–30% pass@1, sometimes lower - Most failures : not wrong algorithm, but wrong tensor handling — shape mismatches from not reading the surrounding code carefully enough If you're using this to evaluate your own agent, the architecture tasks are the most informative discriminator between models. The broader picture LMR-BENCH reveals a gap that matters: LLMs can explain papers well and can write code well, but the intersection — implement exactly what this paper describes, in this codebase, with these constraints — is still hard. The benchmark gives that gap a number. For the AI research community, this is also a forcing function: if you want your paper to be reproducible by an LLM agent, write clearer implementation instructions and reduce cross-file dependencies in your codebase. Paper: Shuo Yan et al., "LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research," EMNLP 2025. arXiv:2506.17335. All results cited from the published paper. PoC evidence in data/lab-runs/lmr-bench-llm-reproduce-nlp-research-code-paper-poc-2026.md.