An eval is just a test that returns a value

wpnews.pro

cd /news/developer-tools/an-eval-is-just-a-test-that-returns-… · home › topics › developer-tools › article

[ARTICLE · art-39195] src=dev.to ↗ pub=2026-06-25T10:41Z topic=developer-tools verified=true sentiment=↑ positive

An eval is just a test that returns a value

A developer created apte, an async-first test framework for Python that unifies unit tests and LLM evals in the same engine with shared fixtures. The framework uses typed dependencies with Annotated and Use, supports native async and parallelism, and runs pytest suites up to 1.73x faster. It also provides eval-specific features like scoring, aggregation, and markdown artifacts.

read4 min views1 publishedJun 25, 2026

apte is an async-first test framework for Python where your unit tests and your LLM evals run in the same engine, with the same fixtures.

The whole design follows from one idea. An eval is just a test that returns a value. A unit test asserts and passes or fails. An eval returns an output, scores it, records the score. Same machinery underneath, same fixtures feeding both. There is no real reason they should live in separate tools with separate CLIs and no shared setup.

It started with pytest's fixtures, not with evals.

I like pytest, but fixtures are resolved by name, with no types and no Ctrl+Click. You write def test_x(db):

and nothing tells you where db

came from. The IDE can't follow it, a new reader can't either, and a typo in the name surfaces at runtime.

I wanted dependencies declared in the signature, the way FastAPI does it:

def test_user(db: Annotated[Database, Use(database)]): ...

Use(database)

points at the fixture. Ctrl+Click jumps to it, the type is right there, and a wrong name is an error at collection time instead of a confusing failure later.

Then LLM features landed in my work and split my tooling in two. pytest for the code, a separate eval library for the model parts. Two config systems, two CLIs, and the expensive setup my tests already had (a client, a database, a pipeline) was not reachable from the eval side. That separation never made sense, so apte runs both.

A deterministic test and an LLM eval, same file, same fixtures:

from typing import Annotated
from apte import ApteSession, Use, ForEach, From
from apte.evals import EvalCase, EvalSuite
from apte.evals.evaluators import contains_keywords

session = ApteSession()

def make_agent() -> Agent:
    return Agent(model="your-model")

@session.test()
async def test_parsing(agent: Annotated[Agent, Use(make_agent)]):
    assert agent.parse("2+2") == 4

chatbot = EvalSuite("chatbot")
session.add_suite(chatbot)

cases = ForEach([
    EvalCase(name="capital_fr", inputs="Capital of France?", expected="Paris"),
])

@chatbot.eval(evaluators=[contains_keywords(keywords=["paris"])])
async def eval_answer(
    case: Annotated[EvalCase, From(cases)],
    agent: Annotated[Agent, Use(make_agent)],
) -> str:
    return await agent.ask(case.inputs)
apte run app:session      # tests, pass or fail
apte eval app:session     # evals, scored, recorded to .apte/history.jsonl

The same make_agent

feeds the test and the eval. Evals return scores, get aggregated (mean, p50, p95), and each case writes a markdown artifact for when something regresses. Beyond simple checks like keyword matching, you can score with an LLM judge and typed metrics. The history file compares a run against the previous one.

The rest comes from being its own runner: scoped fixtures (session, suite, test), tag propagation (tag a fixture, every test that uses it inherits the tag), native async with no plugin, parallelism with -n

, and a clear split between a failed assertion (your bug) and a fixture error (your infra).

apte is faster than pytest on real suites, for two reasons. Async tests run as coroutines on a single event loop instead of pytest-asyncio's loop-per-test, and parallelism is built in.

Ported 1:1 and run with -n8

, httpx's full suite (1285 tests) runs about 1.6x faster than pytest, and starlette's (505 tests, ~94% sync) about 1.73x, both passing the same tests as the originals. Single-threaded the two are close; the gap opens with parallelism and grows with how much each test waits on I/O. On a synthetic I/O-bound benchmark it goes much higher, but the real suites are the number to trust. The async comparison is against the standard pytest plus pytest-asyncio setup, which is what most people run. The full benchmark, with methodology and a one-command repro, lives in the repo.

For evals none of this matters much. An LLM call dwarfs framework overhead. It just means apte does not ask you to trade speed for the eval features.

apte is its own runner, not a pytest plugin. That is the central tradeoff. You get explicit DI, native async and built-in evals, but you leave pytest's plugin ecosystem and fifteen years of community behind. If implicit name-based fixtures work for you, there is no reason to switch.

It is also young and small. One main author, a few months old, an API that still moves. The test engine is the solid part. The eval layer works but is thinner than dedicated tools like DeepEval or Opik: keyword and judge-based scoring with history are there, but things like trajectory or tool-use evaluation are not yet. Treat it as a credible foundation to build on rather than a finished product, and budget for reading the source if it becomes load-bearing.

Use it if you want explicit, typed dependencies, native async without a plugin, and your evals next to your tests, and you are fine being an early user. Skip it if you have a large pytest suite and team that lean on the ecosystem, or you need a recognized standard.

pip install apte

Repo: https://github.com/renaudcepre/apte

Docs: https://renaudcepre.github.io/apte/

It is alpha, which is when feedback is worth the most, especially on the eval API.

The name: protest

was taken on PyPI, so it became apte

, for Async Python Tests and Evals.

source & further reading

dev.to — original article Vibe-Memory: AI Semantic Memory That Fixes ChatGPT's Amnesia 7 Free In-Browser AI Prompt Engineering Tools (No Sign-Up, No Servers) AI Goes to War

~/api · this article 200

$curl api.wpnews.pro/v1/news/an-eval-is-just-a-test-t…

Read original on dev.to → dev.to/renaudcepre/an-eval-is-just-a-test-that-r…

mentioned entities

apte

pytest

FastAPI

httpx

starlette

metadata

slugan-eval-is-just-a-test-that-returns-a-value

topic#developer-tools

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevIBM unveils world’s first 0.7nm …

next →iOS 27’s Shortcuts is AI at its …

── more in #developer-tools 4 stories · sorted by recency

dev.to · 25 Jun · #developer-tools

Python vs PHP in 2026: An Honest Take for Developers Who Are Tired of Vague Answers

dev.to · 25 Jun · #developer-tools

How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes

dev.to · 25 Jun · #developer-tools

The Real Reason Prompt Engineering Isn't Going Away

dev.to · 25 Jun · #developer-tools

Vibe-Memory: AI Semantic Memory That Fixes ChatGPT's Amnesia

── more on @apte 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required