{"slug": "i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to", "title": "I spent hours writing unit tests – so I made an LLM do it (and learned what not to do)", "summary": "A developer built an automated test generation system using an LLM after spending hours writing repetitive unit tests for 30+ similar validation functions. The engineer created a Python script that feeds function source code into an LLM API, requests comprehensive pytest test cases covering normal, edge, and error scenarios, then validates the output with `ast.parse()` before writing it to a test file. The approach successfully generated tests for a `validate_email` function, producing cases for valid emails, plus-addressing, missing @ symbols, and empty strings.", "body_md": "About a month ago I hit that point in a project where the business logic was solid, the API endpoints were clean, but the test file was a pathetic stub. I had 30+ similar validation functions – each one a slight variation on “does this field exist?”, “is it the right type?”, “does it pass this custom rule?”. The manual approach would mean copying the same `assert`\n\npattern dozens of times, changing only the function name and the test input. My brain started melting just thinking about it.\n\nI’m a big believer in testing, but I’m also a big believer in not doing boring work twice. So I started looking for ways to automate test generation.\n\nMy first instinct was to write a Python generator that parsed the function signatures and spat out basic asserts. Something like:\n\n``` python\ndef generate_test(func_name, params):\n    lines = [f\"def test_{func_name}():\"]\n    for p in params:\n        lines.append(f\"    assert {func_name}({p}) is not None\")\n    return \"\\n\".join(lines)\n```\n\nThis worked only for the most trivial cases. As soon as the functions had side effects, required fixtures, or needed specific edge-case values, the template became a nightmare of conditionals. Plus, what about the *negative* tests – the inputs that should raise errors? My generator didn’t know anything about the domain logic.\n\nNext I tried a rule‑based approach with regular expressions. I wrote about 200 lines of heuristics to infer parameter types from docstrings. It sort of worked for one function, then broke completely on the next. I felt like I was rebuilding a tiny compiler for a language nobody uses.\n\nI had a hunch that an LLM could do better if I gave it the right context. The idea was simple: feed the function source (plus docstring) into a language model, ask it to produce `pytest`\n\ntest functions, and then validate the output before writing it to a file.\n\nHere’s the core loop I ended up with:\n\n``` python\nimport json\nimport ast\nimport requests\n\n# For demo purposes – replace with your own endpoint\nBASE_URL = \"https://ai.interwestinfo.com/v1\"  # Example: LLM API\nAPI_KEY = \"your-key\"\n\ndef generate_tests(source_code: str, max_retries=2):\n    prompt = f\"\"\"You are an expert Python tester. Given the function below, write comprehensive pytest test functions covering:\n- Normal cases\n- Edge cases (empty, None, large values)\n- Error cases (wrong types, out-of-range)\n\nDo NOT use external libraries beyond pytest. Return ONLY valid Python code (no explanations).\n\nFunction:\n```\n\npython\n\n{source_code}\n\n```\n\"\"\"\n\n    for attempt in range(max_retries):\n        response = requests.post(\n            f\"{BASE_URL}/chat/completions\",\n            headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n            json={\n                \"model\": \"gpt-4o-mini\",  # Or whatever model you prefer\n                \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n                \"temperature\": 0.3,\n            }\n        )\n        response.raise_for_status()\n        content = response.json()[\"choices\"][0][\"message\"][\"content\"]\n\n        # Validate that the output is parseable Python\n        try:\n            ast.parse(content)\n            return content\n        except SyntaxError:\n            if attempt == max_retries - 1:\n                raise\n            continue\n    return content  # Fallback (shouldn't happen)\n```\n\npython\n\nThe validation step is crucial. LLMs love adding markdown fences, random comments, or incomplete brackets. By parsing the output with `ast.parse`\n\nI catch those before I write bad code to my test file.\n\nI pointed this at a `validate_email`\n\nfunction with three lines of logic. The LLM returned:\n\n``` python\nimport pytest\nfrom validation import validate_email\n\ndef test_valid_email():\n    assert validate_email(\"user@example.com\") is True\n\ndef test_valid_email_with_plus():\n    assert validate_email(\"user+tag@example.com\") is True\n\ndef test_no_at_symbol():\n    assert validate_email(\"userexample.com\") is False\n\ndef test_empty_string():\n    assert validate_email(\"\") is False\n\ndef test_none_input():\n    with pytest.raises(TypeError):\n        validate_email(None)\n```\n\nNot bad – it even guessed I wanted a `TypeError`\n\nfor `None`\n\n(which my function did raise). I ran the tests and they passed. Success.\n\nBut it wasn’t all roses. For a complex function that involved a database query, the LLM generated tests that mocked things incorrectly. It assumed the function would call `db.fetch()`\n\nwhen in reality it used an async ORM. The generated tests were syntactically valid but semantically wrong.\n\n**Use LLMs for boilerplate, not for domain-specific logic.** If your function requires deep knowledge of your database schema or business rules, the generated tests will be too generic. You’re better off hand‑writing those or providing a schema context in the prompt.\n\n**Prompt engineering matters more than the model.** Adding `\"Do NOT include imports that don't exist in your project.\"`\n\nand `\"Use pytest.raises for exceptions.\"`\n\ndramatically improved the output quality.\n\n**Always validate the output.** I parse the response with `ast.parse`\n\nand also run a quick `pytest --collect-only`\n\non the generated file to catch any syntax or import errors before the full test run.\n\n**Temperature 0.2 – 0.4 is the sweet spot.** Too high and it invents random test cases; too low and it repeats the same pattern ad nauseam.\n\nFor my validation functions, this approach saved about 20 minutes per function. Over 30 functions, that’s 10 hours I got back. The generated tests aren’t perfect – I still review every file – but they catch the obvious stuff, which is where many bugs hide.\n\nI’d write a small CLI tool that takes a list of function names (or reads a module) and generates a test file for each, then opens a diff viewer so I can accept/reject chunks. That’s the next weekend project.\n\nNow I’m curious: **How do you handle the boring parts of testing? Do you use any code generation, or do you just accept the grind?** Let me know in the comments – I’d love to steal your ideas.", "url": "https://wpnews.pro/news/i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to-do-4d16", "published_at": "2026-06-05 01:01:23+00:00", "updated_at": "2026-06-05 01:41:40.064057+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-tools", "ai-products", "artificial-intelligence"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to", "markdown": "https://wpnews.pro/news/i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to.md", "text": "https://wpnews.pro/news/i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to.txt", "jsonld": "https://wpnews.pro/news/i-spent-hours-writing-unit-tests-so-i-made-an-llm-do-it-and-learned-what-not-to.jsonld"}}