cd /news/large-language-models/i-spent-hours-writing-unit-tests-so-… · home topics large-language-models article
[ARTICLE · art-22081] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

I spent hours writing unit tests – so I made an LLM do it (and learned what not to do)

A developer built an automated test generation system using an LLM after spending hours writing repetitive unit tests for 30+ similar validation functions. The engineer created a Python script that feeds function source code into an LLM API, requests comprehensive pytest test cases covering normal, edge, and error scenarios, then validates the output with `ast.parse()` before writing it to a test file. The approach successfully generated tests for a `validate_email` function, producing cases for valid emails, plus-addressing, missing @ symbols, and empty strings.

read4 min publishedJun 5, 2026

About a month ago I hit that point in a project where the business logic was solid, the API endpoints were clean, but the test file was a pathetic stub. I had 30+ similar validation functions – each one a slight variation on “does this field exist?”, “is it the right type?”, “does it pass this custom rule?”. The manual approach would mean copying the same assert

pattern dozens of times, changing only the function name and the test input. My brain started melting just thinking about it.

I’m a big believer in testing, but I’m also a big believer in not doing boring work twice. So I started looking for ways to automate test generation.

My first instinct was to write a Python generator that parsed the function signatures and spat out basic asserts. Something like:

def generate_test(func_name, params):
    lines = [f"def test_{func_name}():"]
    for p in params:
        lines.append(f"    assert {func_name}({p}) is not None")
    return "\n".join(lines)

This worked only for the most trivial cases. As soon as the functions had side effects, required fixtures, or needed specific edge-case values, the template became a nightmare of conditionals. Plus, what about the negative tests – the inputs that should raise errors? My generator didn’t know anything about the domain logic.

Next I tried a rule‑based approach with regular expressions. I wrote about 200 lines of heuristics to infer parameter types from docstrings. It sort of worked for one function, then broke completely on the next. I felt like I was rebuilding a tiny compiler for a language nobody uses.

I had a hunch that an LLM could do better if I gave it the right context. The idea was simple: feed the function source (plus docstring) into a language model, ask it to produce pytest

test functions, and then validate the output before writing it to a file.

Here’s the core loop I ended up with:

import json
import ast
import requests

BASE_URL = "https://ai.interwestinfo.com/v1"  # Example: LLM API
API_KEY = "your-key"

def generate_tests(source_code: str, max_retries=2):
    prompt = f"""You are an expert Python tester. Given the function below, write comprehensive pytest test functions covering:
- Normal cases
- Edge cases (empty, None, large values)
- Error cases (wrong types, out-of-range)

Do NOT use external libraries beyond pytest. Return ONLY valid Python code (no explanations).

Function:

python

{source_code}

"""

    for attempt in range(max_retries):
        response = requests.post(
            f"{BASE_URL}/chat/completions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "model": "gpt-4o-mini",  # Or whatever model you prefer
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.3,
            }
        )
        response.raise_for_status()
        content = response.json()["choices"][0]["message"]["content"]

        try:
            ast.parse(content)
            return content
        except SyntaxError:
            if attempt == max_retries - 1:
                raise
            continue
    return content  # Fallback (shouldn't happen)

python

The validation step is crucial. LLMs love adding markdown fences, random comments, or incomplete brackets. By parsing the output with ast.parse

I catch those before I write bad code to my test file.

I pointed this at a validate_email

function with three lines of logic. The LLM returned:

import pytest
from validation import validate_email

def test_valid_email():
    assert validate_email("user@example.com") is True

def test_valid_email_with_plus():
    assert validate_email("user+tag@example.com") is True

def test_no_at_symbol():
    assert validate_email("userexample.com") is False

def test_empty_string():
    assert validate_email("") is False

def test_none_input():
    with pytest.raises(TypeError):
        validate_email(None)

Not bad – it even guessed I wanted a TypeError

for None

(which my function did raise). I ran the tests and they passed. Success.

But it wasn’t all roses. For a complex function that involved a database query, the LLM generated tests that mocked things incorrectly. It assumed the function would call db.fetch()

when in reality it used an async ORM. The generated tests were syntactically valid but semantically wrong.

Use LLMs for boilerplate, not for domain-specific logic. If your function requires deep knowledge of your database schema or business rules, the generated tests will be too generic. You’re better off hand‑writing those or providing a schema context in the prompt.

Prompt engineering matters more than the model. Adding "Do NOT include imports that don't exist in your project."

and "Use pytest.raises for exceptions."

dramatically improved the output quality.

Always validate the output. I parse the response with ast.parse

and also run a quick pytest --collect-only

on the generated file to catch any syntax or import errors before the full test run.

Temperature 0.2 – 0.4 is the sweet spot. Too high and it invents random test cases; too low and it repeats the same pattern ad nauseam.

For my validation functions, this approach saved about 20 minutes per function. Over 30 functions, that’s 10 hours I got back. The generated tests aren’t perfect – I still review every file – but they catch the obvious stuff, which is where many bugs hide.

I’d write a small CLI tool that takes a list of function names (or reads a module) and generates a test file for each, then opens a diff viewer so I can accept/reject chunks. That’s the next weekend project.

Now I’m curious: How do you handle the boring parts of testing? Do you use any code generation, or do you just accept the grind? Let me know in the comments – I’d love to steal your ideas.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-spent-hours-writin…] indexed:0 read:4min 2026-06-05 ·