cd/entity/HumanEval· home› entities› HumanEval

grep -l @humaneval /news/*.json | wc -l → 23

HumanEval

mentions 23 type Organization page 1/2 feed RSS

// recent coverage 23 mentions

06:27

2026-07-09

letsdatascience.com

artificial-intelligence

AI Benchmark Scores Overstate Model Performance

A PlainEnglish article warns that AI benchmark scores such as MMLU, HumanEval, and HellaSwag can overstate production readiness when leaderboard numbers are treated as proof of model quality. The comm…

01:11

2026-07-08

byteiota.com

large-language-models

NVIDIA Nemotron TwoTower: Run LLMs 2.42x Faster Now

NVIDIA open-sourced Nemotron-Labs-TwoTower, a diffusion language model that generates text 2.42x faster than its autoregressive counterpart without retraining original weights. The model achieves 98.7…

04:00

2026-07-07

arxiv.org

large-language-models

TACG: Trajectory-Aware Commit Gating for Diffusion Language Model Decoding

Researchers propose Trajectory-Aware Commit Gating (TACG), a training-free decoder for diffusion language models that uses trajectory-aware signals to decide when to commit tokens, improving accuracy …

13:06

2026-07-02

sourcefeed.dev

artificial-intelligence

Beyond Bug Fixing: The Rise of Senior-Level AI Coding Benchmarks

New AI coding benchmarks, including Snorkel AI's Senior SWE-Bench and Scale AI's SWE-Bench Pro, are replacing older evaluations like HumanEval and original SWE-bench to test senior-level engineering s…

05:00

2026-07-01

dev.to

machine-learning

RL-driven data mixing boosts evaluation scores

A reinforcement learning-driven data scheduler, AC-ODM, boosts MMLU performance by 27.5% relative and HumanEval pass@1 by 2.23× on a Pythia-1B model with only a 0.4% per-step wall-clock increase and 2…

15:17

2026-06-30

byteiota.com

large-language-models

Gemini 2.5 Pro Deep Think: What the Benchmarks Mean

Google's Gemini 2.5 Pro with Deep Think reasoning mode topped coding and reasoning benchmarks this week, scoring 82.4% on GPQA Diamond and 94.1% on HumanEval+, but the mode multiplies token costs by r…

05:00

2026-06-29

dev.to

artificial-intelligence

AI/ML Research Digest — Jun 27, 2026

Recent AI research introduces RL-driven agentic optimization using dense token-level supervision and progress advantage signals to stabilize training. PhysiFormer injects 3D geometric reasoning into d…

04:00

2026-06-26

arxiv.org

large-language-models

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Researchers proposed Dynamic-dLLM, a training-free framework to accelerate Diffusion Large Language Models (dLLMs) by dynamically allocating cache budgets and calibrating decoding thresholds. The meth…

04:54

2026-06-24

dev.to

developer-tools

"Prove your AI-written code — or get the exact input that breaks it"

A developer released ishvacerto, an open-source Python tool that verifies AI-generated code by running it against doctests, user tests, or a reference implementation, returning a counterexample if the…

23:17

2026-06-18

blog.r-lopes.com

ai-safety

The Line Vibe Coding Can't Cross

Vibe coding—prompting an AI agent and shipping unread output—introduces a measurable defect tax that makes it unsuitable for mission-critical systems, with 45% of AI-generated code containing security…

00:00

2026-06-18

mindstudio.ai

artificial-intelligence

How to Compare AI Models Side by Side: Build Your Own Personal Model Leaderboard

A new guide explains how to build a personal AI model leaderboard by running blind comparisons and tracking results over time, arguing that public benchmarks are insufficient for task-specific perform…

16:07

2026-06-17

danlevy.net

large-language-models

LLM benchmarks are answering someone else's question

LLM benchmarks like MMLU and HumanEval are irrelevant for most businesses building AI products, as they measure generic performance rather than specific system tasks. Teams should instead build custom…

15:03

2026-06-17

dev.to

large-language-models

Claude 3.5 Sonnet Isn't Just an Upgrade. It's a New Baseline.

Anthropic released Claude 3.5 Sonnet, a new AI model that outperforms the previous top-tier Claude 3 Opus in intelligence, speed, and cost. The model achieves a 64% solve rate on internal agentic codi…

04:59

2026-06-17

dev.to

large-language-models

Kog hits 3K t/s on MI300X, no kernel switches — test it now

Kog AI achieved over 3,000 output tokens per second per request for an FP16 2B model on a single 8× MI300X node using a monokernel that eliminates per-token kernel launches. The technique collapses th…

09:35

2026-06-14

dev.to

large-language-models

Running Chinese LLMs at Scale: A Cloud Architect's Notes

A cloud architect evaluated four Chinese LLM families—DeepSeek, Qwen, Kimi, and GLM—in a multi-region production pipeline serving thousands of requests per second via Global API's unified endpoint. De…

20:09

2026-06-04

mendral.com

ai-agents

How we know if our agent is right

Mendral, an AI DevOps agent developer, cannot provide a single accuracy metric for its CI failure diagnosis agent despite processing 36,564 investigations across 5.7 million CI jobs and 14.4 billion l…

04:00

2026-06-03

arxiv.org

large-language-models

Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference

Researchers have developed Fast-dLLM++, a training-free extension to diffusion large language models that accelerates inference by selecting parallel token commit sets based on the full sorted confide…

00:00

2026-06-03

mindstudio.ai

ai-agents

How to Use AI Agents to Build and Test LLM Benchmarks: Lessons from Claude Opus 4.8

Anthropic's Claude Opus 4.8 autonomously built a full economic simulation benchmark from scratch, designing test scenarios, generating ground-truth outcomes, and writing evaluation rubrics without hum…

17:11

2026-05-30

github.com

ai-agents

The future will be millions agents running task everyday?

A new benchmark comparing agent runtime performance across C++, Python, TypeScript, and Rust found that C++ achieved a peak memory footprint of approximately 93 MiB while running 100 concurrent coding…

00:00

2026-05-28

mindstudio.ai

large-language-models

Claude Opus 4.7 vs GPT 5.5 on the DeepSuite Benchmark: Real-World Coding Results

Claude Opus 4.7 outperformed GPT 5.5 on repository understanding and multi-file generation tasks in the DeepSuite benchmark, a new coding evaluation designed to mirror real-world developer workflows. …

page 1 / 2 next →

// co-occurs with top 8 entities

MMLU 8 SWE-bench 6 MBPP 5 GSM8K 4 GitHub 3 OpenAI 3 DeepSuite 2 MATH 2