cd/entity/SWE-bench· home› entities› SWE-bench

grep -l @swe-bench /news/*.json | wc -l → 36

SWE-bench

mentions 36 type Organization page 1/2 feed RSS

// recent coverage 36 mentions

06:27

2026-07-09

letsdatascience.com

artificial-intelligence

AI Benchmark Scores Overstate Model Performance

A PlainEnglish article warns that AI benchmark scores such as MMLU, HumanEval, and HellaSwag can overstate production readiness when leaderboard numbers are treated as proof of model quality. The comm…

02:22

2026-07-09

github.com

ai-tools

I was tired of all false claims, so I built Atelier, honest 30% saving on CC.

Atelier, a 30-second install tool that sits underneath Claude Code to reduce token waste, claims 30% savings on AI coding costs by providing better search, shorter file reads, compact command output, …

22:03

2026-07-07

sourcefeed.dev

large-language-models

Claude Opus 4.7: Engineering the Agentic Loop

Anthropic released Claude Opus 4.7 on April 16, 2026, a targeted upgrade focused on making agentic coding loops and multi-step tool use production-ready. The model achieves 87.6% on SWE-bench Verified…

22:01

2026-07-07

dev.to

large-language-models

GLM 5.2 vs Claude Fable 5: agentic coding at a fraction of the cost

Z.ai's open-weight GLM 5.2 completed an agentic coding task in 17 minutes at $2.76, while Anthropic's Claude Fable 5 finished in 9 minutes at over $10, with comparable output quality. The open model c…

19:39

2026-07-07

dev.to

large-language-models

Claude Opus 4.7: Benchmarks, xhigh Effort Level, and What Changed

Anthropic released Claude Opus 4.7 on April 16, 2026, as its most capable public model, featuring a 13% improvement on coding benchmarks, a new xhigh effort level for agentic tasks, and high-resolutio…

14:09

2026-07-07

byteiota.com

large-language-models

Claude Sonnet 5 Migration: Three Breaking API Changes

Claude Sonnet 5 launched June 30 as the default model across Claude Code, Free, and Pro plans, but developers face three breaking API changes: sampling parameters (temperature, top_p, top_k) now retur…

13:06

2026-07-02

sourcefeed.dev

artificial-intelligence

Beyond Bug Fixing: The Rise of Senior-Level AI Coding Benchmarks

New AI coding benchmarks, including Snorkel AI's Senior SWE-Bench and Scale AI's SWE-Bench Pro, are replacing older evaluations like HumanEval and original SWE-bench to test senior-level engineering s…

14:31

2026-07-01

devblogs.microsoft.com

artificial-intelligence

What AI benchmarks are not telling you

Public AI benchmarks like SWE-bench measure performance on popular open-source repositories but fail to predict how models will perform on proprietary codebases, team-specific conventions, and real-wo…

15:17

2026-06-30

byteiota.com

large-language-models

Gemini 2.5 Pro Deep Think: What the Benchmarks Mean

Google's Gemini 2.5 Pro with Deep Think reasoning mode topped coding and reasoning benchmarks this week, scoring 82.4% on GPQA Diamond and 94.1% on HumanEval+, but the mode multiplies token costs by r…

17:16

2026-06-29

dev.to

large-language-models

RAG for codebases is hard. Trusting the answer is harder.

A developer argues that retrieval-augmented generation (RAG) for codebases improves context but not verifiability, citing that 30% of failed SWE-agent runs still claimed success. They introduce 'truth…

23:20

2026-06-25

letsdatascience.com

ai-tools

GitHub evaluates Copilot agentic harness performance

GitHub announced two harness-level improvements to Copilot agentic sessions—prompt caching achieving 94% cache hit rates and deferred tool loading—plus a new Auto model selection feature using its HyD…

12:30

2026-06-24

andrewjesson.com

ai-agents

When Does Data Help Automated Context Engineering?

Claude Code can improve other AI agents without training data in four of seven tested applications, performing as well as with data. Data helps only where Claude Code's prior knowledge of the task run…

03:27

2026-06-22

github.com

large-language-models

Show HN: I applied Lyapunov stability theory to detect when LLM agents spiral

A developer released state-harness, an open-source Python library that uses Lyapunov stability theory to detect and classify failure patterns in multi-turn LLM agents without extra LLM calls. The tool…

00:00

2026-06-22

epics.tech

ai-policy

The Outcry Grows, but the Capital Keeps Flowing

Institutional backlash against AI is mounting across public sector, media, and education, with New York City council members urging a classroom AI pause and a German newsroom caught using AI to write …

06:51

2026-06-19

unsiloed.ai

developer-tools

Claude Code and Codex as one pipeline

A technical guide argues that developers should run both Claude Code and OpenAI Codex as a single pipeline rather than choosing one, based on two months of testing on large codebases. Benchmarks show …

02:35

2026-06-19

deepswe.datacurve.ai

ai-research

DeepSWE v1.1

DeepSWE v1.1 updates the benchmark for long-horizon engineering tasks with isolated verification and structured test reports, making results more reproducible and harder to game. Pass rates remain clo…

17:37

2026-06-18

mroczek.dev

developer-tools

The Token Compression Illusion: Why I'm Skeptical of RTK

RTK, a tool that compresses terminal output for LLM agents, claims to cut token usage by 60-90% but faces skepticism due to misleading savings metrics, silent failure risks, lack of accuracy benchmark…

14:00

2026-06-18

github.com

developer-tools

clawmark: open-source CLAUDE.md A/B Testing CLI tool

Clawmark, an open-source Rust CLI tool, enables A/B testing of CLAUDE.md files by evaluating two variants against five SWE-bench Lite tasks using Claude and Docker. The tool generates a comparison rep…

22:16

2026-06-16

github.com

artificial-intelligence

Synaxi-predict: I'm trying to predict token cost before it happens

Synaxi-predict, a new tool from Synaxi, predicts the token cost, turn count, and pass rate of a Claude Code task before execution, enabling users to select the optimal model and avoid wasted tokens. T…

15:31

2026-06-16

pub.towardsai.net

large-language-models

TAI #209: Claude Fable 5 Arrived, Then the US Government Took It Offline

Anthropic released Claude Fable 5 on June 9, but the U.S. Commerce Department issued an export-control directive barring foreign nationals from accessing it, forcing Anthropic to take both Fable 5 and…

page 1 / 2 next →

// co-occurs with top 8 entities

Anthropic 15 Claude Code 9 GitHub 8 HumanEval 6 Claude Fable 5 5 Claude Opus 4.8 4 Claude Opus 4.7 3 OpenAI 3