{"slug": "does-your-paper-really-suck", "title": "Does Your Paper Really Suck?", "summary": "Oded Rechavi of QED Science claims that papers not in the top 1% of their QED score 'suck', but a review of the company's white paper finds insufficient evidence that the AI-generated score accurately measures scientific quality. The QED score, produced by prompting LLMs to evaluate originality and validity, is compared to journal rank in three validation studies, but methodological flaws and lack of transparency undermine the claims of superior accuracy and reduced bias.", "body_md": "# Does your paper really suck?\n\nOded Rechavi, at QED Science, believes that if your paper is not in the\ntop 1% of their QED score then it\n[\"sucks\"](https://x.com/qedScience/status/2070119903693078850?s=20). But what is this QED score and what is its purpose? Does it really\nmeasure scientific quality? If a paper is not in the 1% does it really\nsuck?\n\nThese are important questions because scientists are increasingly overwhelmed with the volume of new work posted on preprint servers and published in journals. As a result, traditional quality signals used for triaging papers, such as journal, conference venue, and institution, are becoming less reliable. AI further compounds this problem by making it easy to produce plausible scientific writing at scale. Papers are longer, figures are denser, and the existence of a paper is no longer sufficient evidence that it represents substantial scientific work.\n\nIn response, companies like QED Science are building AI tools to help scientists identify quality work. QED uses Large Language Models (LLMs) to review scientific papers and provide AI feedback. Many scientists report that the feedback is useful and often resembles comments received during human peer review.\n\nQED recently released a\n[white paper](https://www.qedscience.com/blog/qed-score-a-validated-ai-based-quality-metric)\nthat goes one step further and describes the \"QED Score\", a single\nnumber that is intended to measure a paper's quality. The QED score is\ngenerated by prompting a collection of LLMs to review a paper for\n\"originality\" and \"validity\". The resulting evaluations are combined\ninto a single score, the QED score. In their white paper, the authors\nclaim that the QED score is a \"more accurate, faster, and less biased\nestimate of paper quality than journal rank.\" The authors present three\nvalidation studies, all of which compare the QED score against the\n[SCImago Journal Rank (SJR)](https://en.wikipedia.org/wiki/SCImago_Journal_Rank), a journal-level metric based on citation data. The first study\ncompares QED and SJR against a corpus of expert-assigned labels\n(\"Limited\", \"Satisfactory\", and \"Strong\"). The second compares QED\nscores for 2,879 bioRxiv preprints with the SJR of the journals in which\nthose papers were eventually published. The third asks experts to choose\nbetween pairs of papers where QED and SJR disagree most strongly.\n\nIn this review, I evaluate the evidence supporting the QED score as a measure of scientific quality. While QED clearly provides a much faster review than traditional peer review, I find that the evidence presented does not support the authors' claims that the QED score is a more accurate or less biased measure of scientific quality.\n\n## Case study 1 is methodologically opaque and does not effectively demonstrate that the QED score measures quality\n\nIn case study 1, the authors obtain a curated dataset of 975 published papers labelled \"Limited\", \"Satisfactory\", or \"Strong\" by a panel of expert reviewers whose identities are not disclosed. Each paper received a label based on validity and originality, the same criteria used to generate the QED score. The authors then asked whether the QED or the SJR score better predicted these labels. QED achieved an AUC of 0.863 versus SJR's 0.804 for distinguishing \"Limited\" from \"Satisfactory + Strong\" papers, and 0.782 versus 0.774 for distinguishing \"Strong\" from \"Satisfactory + Limited\" papers.\n\nThese values cannot be meaningfully interpreted without the underlying data and methodology. The paper does not report the distribution of labels, whether the expert reviewers who generated the benchmark labels were blinded to journal, author, or institutional identity, nor do they provide any data or code to reproduce the analysis. The authors also provide no guarantee that these papers were excluded from the training data of the LLMs used to evaluate them. Therefore, case study 1 does not establish that the QED score accurately measures scientific quality.\n\n## Case study 2 provides inconsistent evidence that the QED score measures quality\n\nThe second case study compares QED scores for 2,879 bioRxiv preprints with the SJR score of the journals where those preprints were eventually published. Across all fields, the authors report a Spearman correlation of 0.63. Within individual fields, however, the correlations ranged from 0.78 (Genetics) to 0.39 (Systems Biology).\n\nThe authors describe the overall agreement as \"substantial\", but explain weaker agreement in some fields by arguing that the SJR score is a noisy proxy for quality. This argument is internally inconsistent. If the SJR score is a reasonable proxy for scientific quality, then the weaker agreement across fields suggests that the QED score is a weak proxy for quality. If the SJR score is a noisy proxy for scientific quality, then agreement with the SJR score cannot be used to validate the QED score. Either way, by the authors' own admission, this analysis does not establish the QED score as an accurate measure of quality.\n\n## Case study 3 contains several uncontrolled and unexplained sources of variation that may bias the QED score's validation\n\nThe third study asks 15 domain experts to compare papers where the QED and SJR score disagree most strongly. For each paper the authors subtract log(SJR + 1) from the QED score, compute pairwise contradictions, and retain the 100 strongest disagreements. Only 70 of these pairs were reported with \"confident\" expert judgments; the remaining 30 were discarded. Among the retained pairs, experts preferred the higher QED-scored paper roughly three times as often as the higher SJR-scored paper.\n\nThis experiment introduces several uncontrolled and unexplained sources of variation. First, the QED score is a paper-level metric assigned to a preprint, whereas the SJR score is a journal-level metric assigned after peer review. Second, comparisons are made between two different papers where expert preference may depend on writing style, topic, or familiarity with the field rather than scientific quality. Finally, the authors do not explain how \"confidence\" was defined or why 30% of comparisons were excluded. Consequently, case study 3 does not provide sufficient evidence for the superiority of the QED score.\n\n## The QED score exhibits geographical bias\n\nThe QED score is not just an internal metric. QED publicly released rankings of the top 1% of bioRxiv preprints and this public release reveals substantial geographic bias against African and South American scientists. (Side note that although the white paper states that QED scored 57,455 bioRxiv preprints, the publicly accessible website contains 53,938 domain-assigned preprints (571 in the top 1% and 53,367 in the remaining 99%). The discrepancy is not explained.)\n\nThe QED website assigns papers geographic regions based on author affiliations. A paper may belong to multiple regions (e.g. North America, Europe, Asia, Australia, South America, Africa), meaning a single author is sufficient for a paper to be classified as African. Filtering papers by geographical region on the QED website produces a striking result: only three papers in the top 1% are classified as African. Yet none is led primarily by African institutions.\n\nThe first paper in the top 1%,\n[TENM4 is an essential transduction component for touch](https://www.biorxiv.org/content/10.1101/2024.10.10.617546v2), has 20 authors with primary affiliations in Germany; it is classified\nas African because one author has a secondary affiliation in Egypt. The\nsecond paper,\n[Memory Regulatory T Cells Reprogram into Protective Tfh-like\nEffectors in Recurrent Malaria](https://www.biorxiv.org/content/10.1101/2025.10.15.682462v1), has ten authors, only one of whom has an African affiliation. The\nthird,\n[Modular and redundant genomic architecture underlies combinatorial\nmechanism of speciation and adaptive radiation](https://www.biorxiv.org/content/10.1101/2025.07.07.663194v2), has eleven authors, again with only one African-affiliated author. In\nother words, the top 1% contains no paper led primarily by African\ninstitutions.\n\nIn contrast,\n[Inflammatory Biomarkers of Asymptomatic and Symptomatic\nTuberculosis](https://www.biorxiv.org/content/10.1101/2025.10.26.684319v1)\naddresses a disease that disproportionately affects sub-Saharan Africa\nand includes 28 authors with primary African affiliations and only six\nwith primary European or North American affiliations. Despite being far\nmore representative of African science, it was ranked in the bottom 99%.\n\nTaking this a step further, the regional classifications exhibit significant biases. Using the regional classifications reported by QED, African classifications (3 vs. 933; p = 0.004) and South American classifications (11 vs. 2,204; p = 0.00055) are significantly underrepresented among papers in the top 1% relative to the remaining 99%.\n\n*Note: Ran Blekhman published\na complementary analysis\ndemonstrating that the QED top 1% reproduces familiar institutional\nbiases.*\n\n## An important sanity check\n\nAs an informal experiment, I submitted the QED white paper to QED itself. The system assigned it a QED score of 46 and identified several methodological concerns. While this is not a formal validation - the system was not designed to evaluate methodological white papers - it is an interesting observation that QED itself identified some, but not all, of the methodological concerns discussed throughout this review. I've included a link to the review report generated by QED here:\n\n[QED review report for the QED Score white paper (PDF)](qed-score-review-report.pdf)\n\n## How do we effectively triage papers?\n\nThe rapid growth of scientific publishing is a real problem. AI has lowered the cost of producing convincing scientific writing, making it increasingly difficult to identify work worth reading. My concerns with QED are not with its use of AI, but rather that the evidence presented does not justify the claims about the score. We scientists need better systems for organizing, evaluating, and consuming scientific literature.\n\nI believe AI can be part of that solution. Many researchers, myself\nincluded, have found LLMs useful for\n[indexing scientific papers and providing structured feedback](https://www.biorxiv.org/content/10.64898/2026.01.30.702911v1). But assigning every paper a single number is a much stronger claim\nthan generating useful feedback. Compressing years of scientific work\ninto a single number inevitably discards too much information that\nscientists care about. Such a score should not be treated as a measure\nof scientific quality without transparent methodology and rigorous\nindependent validation.\n\nAs this critique explains, the QED score has not been rigorously validated. The white paper's three case studies do not demonstrate that it is a more accurate or less biased measure of scientific quality, and the released 1% rankings exhibit significant biases. Moreover, the authors explicitly acknowledge that \"precision of ranking within the top 1% has not been independently validated,\" despite presenting the top 1% as the central output of the system. By the standards the authors apply to other scientific papers, the QED white paper apparently \"sucks.\" Should scientists trust a score whose own validation does not meet the standards it claims to enforce?", "url": "https://wpnews.pro/news/does-your-paper-really-suck", "canonical_source": "https://www.sina.bio/posts/does-your-paper-really-suck.html", "published_at": "2026-06-28 15:32:44+00:00", "updated_at": "2026-06-28 15:35:17.022352+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-tools", "ai-research"], "entities": ["QED Science", "Oded Rechavi", "SCImago Journal Rank", "bioRxiv"], "alternates": {"html": "https://wpnews.pro/news/does-your-paper-really-suck", "markdown": "https://wpnews.pro/news/does-your-paper-really-suck.md", "text": "https://wpnews.pro/news/does-your-paper-really-suck.txt", "jsonld": "https://wpnews.pro/news/does-your-paper-really-suck.jsonld"}}