cd/entity/Braintrust· home› entities› Braintrust

grep -l @braintrust /news/*.json | wc -l → 29

Braintrust

mentions 29 type Organization page 1/2 feed RSS

// recent coverage 29 mentions

20:06

2026-07-25

dev.to

artificial-intelligence

I built a tool to prove my multi-agent harness was worth it. It told me it wasn't.

A developer built a tool to measure whether multi-agent scaffolding improves coding task performance, only to find that adding a planner, two drafters, and a judge made results worse (80% vs. 95%) at …

00:00

2026-07-24

chaliy.name

ai-tools

You Do Not Need a Server for Evals

Evals for AI projects like coding agents and sandboxed bash do not require a dedicated server or platform, according to developer Everruns. Datasets, runners, and results can be stored and versioned d…

19:27

2026-07-22

dev.to

artificial-intelligence

An LLM judge is a biased instrument, not a measurement

A developer found that an LLM judge gave opposite results for the same eval run on consecutive days due to position bias, one of three systematic biases documented in the 2023 paper "Judging LLM-as-a-…

18:34

2026-07-20

news.ycombinator.com

artificial-intelligence

Ask HN: How do I reliably eval my AI models

A Hacker News user asks how to reliably evaluate AI models, mentioning Braintrust and Arize as potential tools, and inquires about sourcing experts to create gold datasets.…

12:00

2026-07-14

machinelearningmastery.com

large-language-models

LLM Evaluation Frameworks Compared: How to Actually Measure What Your Model Does

A new comparison of LLM evaluation frameworks RAGAS, DeepEval, and Promptfoo reveals that the LLM-as-a-judge mechanism they all rely on has measurable biases—position bias, self-preference bias, and v…

22:40

2026-07-13

dev.to

ai-agents

Your AI agent says "done." Who checks that from outside the agent?

A developer at nokaze highlights a failure mode in AI agents where they report completion without actually finishing the task, citing research showing 45-48% of failures on tau2-bench were confidently…

16:03

2026-07-13

dev.to

artificial-intelligence

The Evaluation Debt You Don't Know You Have: Why Agent Evals Fail in Production

A developer warns that 38% of AI teams cite 'evaluation debt' as their primary blocker, where offline agent evals fail to catch production failures because they measure past data rather than shifting …

09:31

2026-07-12

startupfortune.com

ai-agents

How to Evaluate AI Agents Before You Ship Them to Real Users

Most founders shipping AI agents lack systematic evaluation methods, leading to public failures like Chevrolet's chatbot agreeing to sell a car for $1 and McDonald's AI drive-thru adding bacon to ice …

21:19

2026-07-07

rightmodeler.com

large-language-models

Get recommended a cheaper model with this skill

Rightmodeler replays real agent traces through cheaper models, judges outputs against shipped versions, and shows cost-saving opportunities with evidence. The tool supports traces from Claude Code, Co…

17:26

2026-07-07

dev.to

large-language-models

When an LLM answer is wrong, the trace is where you look. Some tools make that easy.

A developer evaluates six LLM observability tools—Helicone, LangSmith, Langfuse, Future AGI, and Braintrust—for debugging hallucinated answers in production. The key requirement is the ability to quic…

00:00

2026-07-02

pydantic.dev

ai-agents

Observability tools agents want

Pydantic Logfire and other observability platforms are shipping MCP servers, CLIs, and SDKs that allow AI agents to directly inspect traces, logs, prompts, evals, and dashboards, shifting the focus fr…

18:05

2026-07-01

newsletter.port.io

ai-agents

How to build a context lake that saves you 80% on token costs

Port's experiment found that routing AI agents through a structured context lake instead of direct-to-MCPs cut token costs by 58%, and adding a skill file brought savings to 80%. The context lake pre-…

00:00

2026-06-30

1password.com

ai-safety

Braintrust's Ankur Goyal: Code review doesn't cover prompts

Braintrust CEO Ankur Goyal warned that prompt changes in AI agents often bypass security review, creating a gap where behavior-shaping updates can alter what agents do, what data they access, and whic…

21:37

2026-06-26

dev.to

large-language-models

The Langfuse migration that cost us a sprint: how I now budget LLM observability

A developer running reliability for a small team shipping LLM features compared six Langfuse alternatives—Helicone, Arize Phoenix, LangSmith, Braintrust, Laminar, and Future AGI traceAI—focusing on bo…

00:00

2026-06-26

vercel.com

ai-agents

Trace and debug eve agent sessions with Vercel Observability

Vercel launched Agent Runs, a new observability feature for eve agent sessions that provides curated views of triggers, duration, token usage, and step-level details without OpenTelemetry setup. The t…

17:51

2026-06-25

dev.to

large-language-models

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

A developer evaluated six LLM-as-judge tools—DeepEval, Confident AI, Evidently, Braintrust, Promptfoo, and Future AGI—and found that none of them prioritize validating judge outputs against human labe…

22:56

2026-06-18

dev.to

large-language-models

LLM observability tools are blind to the voice layer. Here is what I checked 6 of them for.

A developer evaluated six LLM observability tools—Langfuse, Helicone, Arize Phoenix, LangSmith, Braintrust, and Laminar—for their ability to monitor the audio layer in voice agents, not just LLM calls…

04:40

2026-06-16

discuss.huggingface.co

large-language-models

Metrics for Text Generation from T5 Model

A user training a T5 model asked for alternative metrics to Exact Match for evaluating text generation. Community members suggested ROUGE-1, ROUGE-2, and BLEU, and recommended Braintrust for running e…

12:04

2026-06-15

lennysnewsletter.com

ai-agents

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Ankur Goyal, founder and CEO of Braintrust, explained how AI agents can run exhaustive benchmarks and perform deep technical work like database optimization, and argued that evals are essential for sh…

09:25

2026-06-13

dev.to

ai-agents

AI Agent Architecture: Why Process-Level Resilience Beats Proxy Gateways

A developer argues that embedded SDKs for AI agent reliability outperform proxy gateways by eliminating network latency and operational overhead. The comparison shows embedded SDKs add ~0ms latency ve…

page 1 / 2 next →

// co-occurs with top 8 entities

LangSmith 15 Langfuse 9 Helicone 6 OpenTelemetry 5 Arize 4 Arize Phoenix 4 DeepEval 4 Cursor 3

// topics top 6 topics

ai agents 18 ai tools 17 large language models 16 developer tools 16 ai infrastructure 14 ai products 9