# I spent hours writing unit tests – so I made an LLM do it (and learned what not to do)

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to-do-4d16>
> Published: 2026-06-05 01:01:23+00:00

About a month ago I hit that point in a project where the business logic was solid, the API endpoints were clean, but the test file was a pathetic stub. I had 30+ similar validation functions – each one a slight variation on “does this field exist?”, “is it the right type?”, “does it pass this custom rule?”. The manual approach would mean copying the same `assert`

pattern dozens of times, changing only the function name and the test input. My brain started melting just thinking about it.

I’m a big believer in testing, but I’m also a big believer in not doing boring work twice. So I started looking for ways to automate test generation.

My first instinct was to write a Python generator that parsed the function signatures and spat out basic asserts. Something like:

``` python
def generate_test(func_name, params):
    lines = [f"def test_{func_name}():"]
    for p in params:
        lines.append(f"    assert {func_name}({p}) is not None")
    return "\n".join(lines)
```

This worked only for the most trivial cases. As soon as the functions had side effects, required fixtures, or needed specific edge-case values, the template became a nightmare of conditionals. Plus, what about the *negative* tests – the inputs that should raise errors? My generator didn’t know anything about the domain logic.

Next I tried a rule‑based approach with regular expressions. I wrote about 200 lines of heuristics to infer parameter types from docstrings. It sort of worked for one function, then broke completely on the next. I felt like I was rebuilding a tiny compiler for a language nobody uses.

I had a hunch that an LLM could do better if I gave it the right context. The idea was simple: feed the function source (plus docstring) into a language model, ask it to produce `pytest`

test functions, and then validate the output before writing it to a file.

Here’s the core loop I ended up with:

``` python
import json
import ast
import requests

# For demo purposes – replace with your own endpoint
BASE_URL = "https://ai.interwestinfo.com/v1"  # Example: LLM API
API_KEY = "your-key"

def generate_tests(source_code: str, max_retries=2):
    prompt = f"""You are an expert Python tester. Given the function below, write comprehensive pytest test functions covering:
- Normal cases
- Edge cases (empty, None, large values)
- Error cases (wrong types, out-of-range)

Do NOT use external libraries beyond pytest. Return ONLY valid Python code (no explanations).

Function:
```

python

{source_code}

```
"""

    for attempt in range(max_retries):
        response = requests.post(
            f"{BASE_URL}/chat/completions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "model": "gpt-4o-mini",  # Or whatever model you prefer
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.3,
            }
        )
        response.raise_for_status()
        content = response.json()["choices"][0]["message"]["content"]

        # Validate that the output is parseable Python
        try:
            ast.parse(content)
            return content
        except SyntaxError:
            if attempt == max_retries - 1:
                raise
            continue
    return content  # Fallback (shouldn't happen)
```

python

The validation step is crucial. LLMs love adding markdown fences, random comments, or incomplete brackets. By parsing the output with `ast.parse`

I catch those before I write bad code to my test file.

I pointed this at a `validate_email`

function with three lines of logic. The LLM returned:

``` python
import pytest
from validation import validate_email

def test_valid_email():
    assert validate_email("user@example.com") is True

def test_valid_email_with_plus():
    assert validate_email("user+tag@example.com") is True

def test_no_at_symbol():
    assert validate_email("userexample.com") is False

def test_empty_string():
    assert validate_email("") is False

def test_none_input():
    with pytest.raises(TypeError):
        validate_email(None)
```

Not bad – it even guessed I wanted a `TypeError`

for `None`

(which my function did raise). I ran the tests and they passed. Success.

But it wasn’t all roses. For a complex function that involved a database query, the LLM generated tests that mocked things incorrectly. It assumed the function would call `db.fetch()`

when in reality it used an async ORM. The generated tests were syntactically valid but semantically wrong.

**Use LLMs for boilerplate, not for domain-specific logic.** If your function requires deep knowledge of your database schema or business rules, the generated tests will be too generic. You’re better off hand‑writing those or providing a schema context in the prompt.

**Prompt engineering matters more than the model.** Adding `"Do NOT include imports that don't exist in your project."`

and `"Use pytest.raises for exceptions."`

dramatically improved the output quality.

**Always validate the output.** I parse the response with `ast.parse`

and also run a quick `pytest --collect-only`

on the generated file to catch any syntax or import errors before the full test run.

**Temperature 0.2 – 0.4 is the sweet spot.** Too high and it invents random test cases; too low and it repeats the same pattern ad nauseam.

For my validation functions, this approach saved about 20 minutes per function. Over 30 functions, that’s 10 hours I got back. The generated tests aren’t perfect – I still review every file – but they catch the obvious stuff, which is where many bugs hide.

I’d write a small CLI tool that takes a list of function names (or reads a module) and generates a test file for each, then opens a diff viewer so I can accept/reject chunks. That’s the next weekend project.

Now I’m curious: **How do you handle the boring parts of testing? Do you use any code generation, or do you just accept the grind?** Let me know in the comments – I’d love to steal your ideas.