LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

wpnews.pro

cd /news/large-language-models/llm-based-scientific-peer-review-met… · home › topics › large-language-models › article

[ARTICLE · art-38769] src=arxiv.org ↗ pub=2026-06-25T04:00Z topic=large-language-models verified=true sentiment=· neutral

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

A new survey analyzes the use of large language models (LLMs) for scientific peer review, focusing on critique generation and score prediction. The study identifies reliability, robustness, and security risks such as prompt injection and data poisoning, and calls for developing trustworthy AI-assisted evaluation systems.

read1 min views1 publishedJun 25, 2026

arXiv:2606.25057v1 Announce Type: new Abstract: The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision-support systems remain insufficiently understood. This survey offers a systems-level analysis of LLM-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction. We present a structured taxonomy of modeling approaches (including prompt-based, supervised, retrieval-augmented, and alignment-optimized approaches), and synthesize empirical findings across existing benchmarks. We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation. From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross-domain generalization. By reframing automated peer review as a high-stakes, multi-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI-assisted scientific evaluation systems.

source & further reading

arxiv.org — original article

── more in #large-language-models 4 stories · sorted by recency

byteiota.com · 25 Jun · #large-language-models

Alibaba Ran 29M Fake Claude Queries to Steal AI Capabilities

dissenter.com · 25 Jun · #large-language-models

Meta Pauses Employee Spyware After Exposing Workers' Private Data

arxiv.org · 25 Jun · #large-language-models

Small edits, large models: How Wikipedia advocacy shapes LLM values

arxiv.org · 25 Jun · #large-language-models

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required