{"slug": "show-hn-promptloop-create-run-and-improve-prompt-evals-from-the-terminal", "title": "Show HN: Promptloop – create, run, and improve prompt evals from the terminal", "summary": "Promptloop, a new interactive CLI tool built on LangChain's deepagents framework, enables developers to create, run, and improve prompt evaluations entirely from the terminal. The tool saves methodology, test cases, reports, prompt history, and chat checkpoints under a `.evals/` directory in the target project, supporting metrics including latency, JSON schema validation, fuzzy matching, and LLM judge scoring. Promptloop allows users to register prompts, add test cases, run evaluations, generate failure analysis reports, and approve prompt diffs without leaving the command line.", "body_md": "An interactive CLI agent for the full prompt-eval loop: create test cases, run evals, generate reports, and approve prompt diffs without leaving your terminal.\n\nBuilt on LangChain [deepagents](https://github.com/langchain-ai/deepagents).\n\nAgent harnesses are getting better, but prompts still shape what they do. promptloop turns a prompt and eval intent into a repeatable loop:\n\nIt saves the methodology, test cases, reports, prompt history, and chat checkpoints under `.evals/`\n\nin the target project.\n\n```\n.evals/\n  prompts/        # registered prompts + version history\n  test_cases/     # per-prompt test suites\n  eval_configs/   # methodology (metrics, models, judges)\n  results/        # eval runs and reports\n  chat.db         # SQLite checkpoint of conversation threads\n```\n\nExample metrics:\n\n`latency`\n\n: response time`json_schema`\n\n: validates structured output`fuzzy_match`\n\n: compares text similarity`llm_judge`\n\n: scores output with a judge prompt\n\nSuppose your project has a prompt at `prompts/summarize.md`\n\n:\n\n```\nSummarize the user's note in three bullets.\nReturn JSON.\n```\n\nStart promptloop and describe the behavior you want to test:\n\n``` bash\n$ uv run promptloop --project-dir ~/work/notes-app\n\npromptloop> Evaluate the prompt at prompts/summarize.md.\n\nRegistered prompt 'summarize' (v1)\nSource: /Users/me/work/notes-app/prompts/summarize.md\n\npromptloop> Add a test case where the note includes action items, dates, and unrelated chatter.\n\nAdded test case 'tc_action_items' for prompt 'summarize'\n(metrics: json_schema, llm_judge).\n\npromptloop> Run the eval.\n\nRun complete - ID: run_20260529_091214_a3f2\nResults: 2 passed / 1 failed / 3 total\nAvg latency: 1840ms\nMax concurrency: 3\n\n  passed [tc_basic_summary] anthropic:claude-sonnet-4-6\n    json_schema: valid JSON matching schema | llm_judge: 0.86\n  failed [tc_action_items] anthropic:claude-sonnet-4-6\n    json_schema: schema mismatch: 'action_items' is a required property\n  passed [tc_noise] anthropic:claude-sonnet-4-6\n    json_schema: valid JSON matching schema | llm_judge: 0.82\n```\n\nAsk for a fix, and promptloop proposes a diff instead of editing blindly:\n\n```\npromptloop> Propose a prompt change for the failing action-items case.\n\nProposed changes to 'summarize' from v1:\n--- summarize (current)\n+++ summarize (proposed)\n@@\n-Summarize the user's note in three bullets.\n-Return JSON.\n+Summarize the user's note in three bullets.\n+If the note contains follow-up tasks, extract them into an action_items array.\n+Each action item should include a task, owner if mentioned, and due_date if mentioned.\n+\n+Return only valid JSON with this shape:\n+{\n+  \"summary\": [\"...\", \"...\", \"...\"],\n+  \"action_items\": [\n+    {\"task\": \"...\", \"owner\": \"...\", \"due_date\": \"...\"}\n+  ]\n+}\n```\n\nIt also generates a report you can inspect before approving the change:\n\n```\n# Prompt Eval Report: summarize\n\n**Run:** run_20260529_091214_a3f2\n**Models:** anthropic:claude-sonnet-4-6\n**Pass rate:** 67% (2/3)\n**Avg latency:** 1840ms\n\n## Failure Analysis\n\nThe action-items case failed because the prompt only requested \"three bullets\"\nand \"JSON\"; it did not define a required JSON shape or explain how to handle\ndates, owners, and follow-up tasks.\n\n## Recommendations\n\n1. Add an explicit `action_items` field to the schema.\n2. Tell the model to preserve due dates and owners when present.\n3. Require JSON-only output so downstream parsing is stable.\ngit clone <this repo>\ncd promptloop\nuv sync\nuv run promptloop --project-dir /path/to/your/project\n```\n\nYou'll get an interactive chat. Try things like:\n\n*\"Evaluate the prompt at*`src/prompts/summarize.txt`\n\n\"*\"Add three more test cases for edge cases\"**\"Re-run with*`openai:gpt-4o-mini`\n\nand compare to the last run\"*\"Propose a fix for the failing JSON schema cases\"*\n\n| Command | Description |\n|---|---|\n`/help` |\nShow help |\n`/clear` |\nStart a new conversation thread |\n`/threads` |\nList saved threads |\n`/thread <id>` |\nSwitch to a thread in-session |\n`/quit` |\nExit |\n\nResume past sessions with `promptloop --thread <id>`\n\n. Press **Esc** to interrupt a streaming response.\n\nThe agent has a small set of typed tools on top of deepagents' filesystem access:\n\n`register_prompt`\n\n,`propose_prompt_changes`\n\n,`apply_prompt_changes`\n\n,`show_prompt_history`\n\n`add_test_case`\n\n,`infer_json_schema`\n\n,`save_eval_config`\n\n`run_eval`\n\n,`list_eval_runs`\n\n`generate_report`\n\n,`read_report`\n\n,`compare_runs`\n\nFor more detail on the agent runtime behind this project, see [The Harness Behind Deep Agent](/Bella3202019/promptloop/blob/main/docs/The_Harness_Behind_Deep_Agent.md).\n\nEarly / experimental. Feedback and issues welcome.", "url": "https://wpnews.pro/news/show-hn-promptloop-create-run-and-improve-prompt-evals-from-the-terminal", "canonical_source": "https://github.com/Bella3202019/promptloop", "published_at": "2026-05-29 16:06:46+00:00", "updated_at": "2026-05-29 16:18:37.117863+00:00", "lang": "en", "topics": ["ai-tools", "ai-agents", "large-language-models", "mlops", "ai-products"], "entities": ["Promptloop", "LangChain", "deepagents"], "alternates": {"html": "https://wpnews.pro/news/show-hn-promptloop-create-run-and-improve-prompt-evals-from-the-terminal", "markdown": "https://wpnews.pro/news/show-hn-promptloop-create-run-and-improve-prompt-evals-from-the-terminal.md", "text": "https://wpnews.pro/news/show-hn-promptloop-create-run-and-improve-prompt-evals-from-the-terminal.txt", "jsonld": "https://wpnews.pro/news/show-hn-promptloop-create-run-and-improve-prompt-evals-from-the-terminal.jsonld"}}