{"slug": "lmr-bench-can-llm-agents-reproduce-nlp-research-code-emnlp-2025", "title": "LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)", "summary": "The LMR-BENCH benchmark, introduced by researchers at the University of Texas at Dallas at EMNLP 2025, evaluates whether LLM agents can reproduce core implementations from NLP research papers by filling in masked code stubs using the paper's description and surrounding codebase context. The benchmark includes 28 tasks from 23 recent NLP papers and scores agents on both functional correctness (via unit tests) and implementation fidelity (using GPT-4o as a judge). Key findings show that model architecture tasks are the hardest due to multi-file reasoning requirements, while the surrounding codebase context often provides stronger signals for correct implementation than the paper's equations alone.", "body_md": "A research team from the University of Texas at Dallas published LMR-BENCH at EMNLP 2025, asking a specific question: can LLM agents reproduce the core implementation from an NLP research paper when given the paper, a partially masked codebase, and explicit instructions?\n\nThis is harder than it sounds. And the benchmark design is smart enough to be worth understanding in detail.\n\nSources: [arXiv 2506.17335](https://arxiv.org/abs/2506.17335) | [ACL Anthology](https://aclanthology.org/2025.emnlp-main.314/) | [GitHub](https://github.com/du-nlp-lab/LMR-Bench)\n\n## What the benchmark actually tests\n\nLMR-BENCH contains 28 reproduction tasks drawn from 23 NLP papers published in ACL, EMNLP, NAACL, and AAAI over the past five years. Each task follows the same structure:\n\n-\n**Paper**: the full PDF -\n**Masked repository**: a real codebase from the paper, but with one or more critical functions replaced by`# TODO: implement`\n\nstubs -\n**Implementation instruction**: a natural language description of what the masked function should do, including cross-file dependencies and design intent\n\nThe agent's job is to generate patch code that fills the stubs correctly.\n\nThis tests something distinct from \"can an LLM write a function from a docstring.\" The function body has to match what the paper describes, use the surrounding codebase's conventions, and pass unit tests against the paper's reference implementation.\n\n## The nine task categories\n\n| Category | What gets masked |\n|---|---|\n| Tokenization | Custom tokenizer logic |\n| Attention mechanism | Scaled dot-product or custom attention |\n| Positional encoding | RoPE, ALiBi, learned variants |\n| Loss function | Custom training objectives |\n| Data preprocessing | Dataset-specific transforms |\n| Model architecture | Layer definitions, custom blocks |\n| Training procedure | Optimizer steps, gradient modifications |\n| Decoding strategy | Beam search variants, constrained decoding |\n| Evaluation metric | BLEU variants, task-specific metrics |\n\nThe hardest category is **model architecture**: reproducing a custom layer requires reading across multiple files to understand tensor shapes, class inheritance, and forward pass conventions — exactly the kind of multi-file reasoning that current LLMs struggle with.\n\nThe easiest is **evaluation metric**: formulas are usually self-contained, well-documented in the paper, and don't require deep codebase knowledge.\n\n## How masking works in practice\n\nHere's what a masked task looks like (synthetic example based on paper methodology):\n\n``` python\n# Original in paper's codebase: rotary_embedding.py\ndef apply_rotary_emb(xq, xk, freqs_cis):\n    \"\"\"Apply rotary embeddings to query and key tensors.\"\"\"\n    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))\n    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))\n    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)\n    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)\n    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)\n    return xq_out.type_as(xq), xk_out.type_as(xk)\n\n# Masked version (what the agent receives):\ndef apply_rotary_emb(xq, xk, freqs_cis):\n    \"\"\"Apply rotary embeddings to query and key tensors.\"\"\"\n    # TODO: implement\n    # Instruction: Apply rotary position embeddings to xq and xk.\n    # Use torch.view_as_complex for complex number representation.\n    # freqs_cis shape must be broadcast-compatible with xq_.\n    # Return float tensors matching input dtype.\n    raise NotImplementedError\n```\n\nThe `info.json`\n\nfor this task would also specify which files the agent should read (`reshape_for_broadcast`\n\ndefinition lives in `utils.py`\n\n, for example).\n\n## Dual evaluation: unit tests + LLM-as-judge\n\nLMR-BENCH scores agents on two axes:\n\n**Axis 1 — Functional correctness (unit tests)**\n\nNumerical equivalence against the reference implementation. The agent's patch must produce the same tensor outputs as the original function.\n\n**Axis 2 — Implementation fidelity (LLM-as-judge)**\n\nGPT-4o reads the paper's algorithm description and the agent's code, then scores whether the implementation actually follows the described method — even if it passes unit tests through an equivalent but differently structured approach.\n\nThis dual axis matters because:\n\n- A function can pass unit tests but use a different algorithm (memorized shortcut)\n- A function can fail unit tests due to floating-point differences but be conceptually correct\n\nBoth axes tell you different things about the agent's reasoning.\n\n## What the results show\n\nThe paper doesn't release a full leaderboard in the public arXiv version, but the findings indicate:\n\n-\n**o3-mini (high compute)** was the best-performing model tested - Pass@1 rates ranged roughly from 20% to 60% across task categories\n- Multi-file reasoning was the single biggest differentiator: models that could trace function calls across 3+ files significantly outperformed those that stayed in the target file\n- Simply giving the model the paper PDF without the masked code resulted in worse performance than giving both — the code context matters more than the paper text for reproduction tasks\n\nThe last point is counterintuitive. You'd expect the paper's equations to be the key signal. But the surrounding codebase (tensor shapes, variable naming, utility functions) constrains the solution space more tightly than the abstract algorithm description.\n\n## Why this benchmark matters for developers\n\nIf you're building an AI-assisted research coding tool (or evaluating whether an agent can help you implement a paper), LMR-BENCH is the most realistic evaluation framework available. The alternatives:\n\n-\n**HumanEval / MBPP**: function-level, no paper context, no cross-file reasoning -\n**SWE-bench**: bug fixing in large codebases, different skill set from paper reproduction -\n**APPS**: competitive programming, not research implementation\n\nLMR-BENCH specifically targets the \"I read a paper, now implement it\" workflow — which is what most ML engineers actually do.\n\n## Running the benchmark yourself\n\nThe benchmark repo requires Python ≥ 3.12 and supports any LLM backend through its evaluation harness:\n\n```\ngit clone https://github.com/du-nlp-lab/LMR-Bench\ncd LMR-Bench\npip install -r requirements.txt\n\n# Run a single task with Claude\npython evaluate.py \\\n  --task benchmark/rotary_emb_task/ \\\n  --model claude-opus-4-7-20251001 \\\n  --api-key $ANTHROPIC_API_KEY\n```\n\nThe evaluation harness handles: sending the paper + masked code to the model, collecting the patch, running unit tests, and recording fidelity scores.\n\n## What to expect if you run it\n\nBased on the paper's findings, expect:\n\n-\n**Evaluation metric tasks**: 50–60% pass@1 with a capable model -\n**Model architecture tasks**: 20–30% pass@1, sometimes lower -\n**Most failures**: not wrong algorithm, but wrong tensor handling — shape mismatches from not reading the surrounding code carefully enough\n\nIf you're using this to evaluate your own agent, the architecture tasks are the most informative discriminator between models.\n\n## The broader picture\n\nLMR-BENCH reveals a gap that matters: LLMs can explain papers well and can write code well, but the intersection — implement exactly what this paper describes, in this codebase, with these constraints — is still hard. The benchmark gives that gap a number.\n\nFor the AI research community, this is also a forcing function: if you want your paper to be reproducible by an LLM agent, write clearer implementation instructions and reduce cross-file dependencies in your codebase.\n\n*Paper: Shuo Yan et al., \"LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research,\" EMNLP 2025. arXiv:2506.17335. All results cited from the published paper. PoC evidence in data/lab-runs/lmr-bench-llm-reproduce-nlp-research-code-paper-poc-2026.md.*", "url": "https://wpnews.pro/news/lmr-bench-can-llm-agents-reproduce-nlp-research-code-emnlp-2025", "canonical_source": "https://dev.to/jangwook_kim_e31e7291ad98/lmr-bench-can-llm-agents-reproduce-nlp-research-code-emnlp-2025-2j2c", "published_at": "2026-05-22 12:16:53+00:00", "updated_at": "2026-05-22 12:37:33.534800+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "research", "open-source"], "entities": ["University of Texas at Dallas", "EMNLP 2025", "ACL", "NAACL", "AAAI", "arXiv", "ACL Anthology", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/lmr-bench-can-llm-agents-reproduce-nlp-research-code-emnlp-2025", "markdown": "https://wpnews.pro/news/lmr-bench-can-llm-agents-reproduce-nlp-research-code-emnlp-2025.md", "text": "https://wpnews.pro/news/lmr-bench-can-llm-agents-reproduce-nlp-research-code-emnlp-2025.txt", "jsonld": "https://wpnews.pro/news/lmr-bench-can-llm-agents-reproduce-nlp-research-code-emnlp-2025.jsonld"}}