Show HN: Promptloop – create, run, and improve prompt evals from the terminal

Promptloop, a new interactive CLI tool built on LangChain's deepagents framework, enables developers to create, run, and improve prompt evaluations entirely from the terminal. The tool saves methodology, test cases, reports, prompt history, and chat checkpoints under a `.evals/` directory in the target project, supporting metrics including latency, JSON schema validation, fuzzy matching, and LLM judge scoring. Promptloop allows users to register prompts, add test cases, run evaluations, generate failure analysis reports, and approve prompt diffs without leaving the command line.

An interactive CLI agent for the full prompt-eval loop: create test cases, run evals, generate reports, and approve prompt diffs without leaving your terminal. Built on LangChain deepagents https://github.com/langchain-ai/deepagents . Agent harnesses are getting better, but prompts still shape what they do. promptloop turns a prompt and eval intent into a repeatable loop: It saves the methodology, test cases, reports, prompt history, and chat checkpoints under .evals/ in the target project. .evals/ prompts/ registered prompts + version history test cases/ per-prompt test suites eval configs/ methodology metrics, models, judges results/ eval runs and reports chat.db SQLite checkpoint of conversation threads Example metrics: latency : response time json schema : validates structured output fuzzy match : compares text similarity llm judge : scores output with a judge prompt Suppose your project has a prompt at prompts/summarize.md : Summarize the user's note in three bullets. Return JSON. Start promptloop and describe the behavior you want to test: bash $ uv run promptloop --project-dir ~/work/notes-app promptloop Evaluate the prompt at prompts/summarize.md. Registered prompt 'summarize' v1 Source: /Users/me/work/notes-app/prompts/summarize.md promptloop Add a test case where the note includes action items, dates, and unrelated chatter. Added test case 'tc action items' for prompt 'summarize' metrics: json schema, llm judge . promptloop Run the eval. Run complete - ID: run 20260529 091214 a3f2 Results: 2 passed / 1 failed / 3 total Avg latency: 1840ms Max concurrency: 3 passed tc basic summary anthropic:claude-sonnet-4-6 json schema: valid JSON matching schema | llm judge: 0.86 failed tc action items anthropic:claude-sonnet-4-6 json schema: schema mismatch: 'action items' is a required property passed tc noise anthropic:claude-sonnet-4-6 json schema: valid JSON matching schema | llm judge: 0.82 Ask for a fix, and promptloop proposes a diff instead of editing blindly: promptloop Propose a prompt change for the failing action-items case. Proposed changes to 'summarize' from v1: --- summarize current +++ summarize proposed @@ -Summarize the user's note in three bullets. -Return JSON. +Summarize the user's note in three bullets. +If the note contains follow-up tasks, extract them into an action items array. +Each action item should include a task, owner if mentioned, and due date if mentioned. + +Return only valid JSON with this shape: +{ + "summary": "...", "...", "..." , + "action items": + {"task": "...", "owner": "...", "due date": "..."} + +} It also generates a report you can inspect before approving the change: Prompt Eval Report: summarize Run: run 20260529 091214 a3f2 Models: anthropic:claude-sonnet-4-6 Pass rate: 67% 2/3 Avg latency: 1840ms Failure Analysis The action-items case failed because the prompt only requested "three bullets" and "JSON"; it did not define a required JSON shape or explain how to handle dates, owners, and follow-up tasks. Recommendations 1. Add an explicit action items field to the schema. 2. Tell the model to preserve due dates and owners when present. 3. Require JSON-only output so downstream parsing is stable. git clone <this repo cd promptloop uv sync uv run promptloop --project-dir /path/to/your/project You'll get an interactive chat. Try things like: "Evaluate the prompt at src/prompts/summarize.txt " "Add three more test cases for edge cases" "Re-run with openai:gpt-4o-mini and compare to the last run" "Propose a fix for the failing JSON schema cases" | Command | Description | |---|---| /help | Show help | /clear | Start a new conversation thread | /threads | List saved threads | /thread <id | Switch to a thread in-session | /quit | Exit | Resume past sessions with promptloop --thread <id . Press Esc to interrupt a streaming response. The agent has a small set of typed tools on top of deepagents' filesystem access: register prompt , propose prompt changes , apply prompt changes , show prompt history add test case , infer json schema , save eval config run eval , list eval runs generate report , read report , compare runs For more detail on the agent runtime behind this project, see The Harness Behind Deep Agent /Bella3202019/promptloop/blob/main/docs/The Harness Behind Deep Agent.md . Early / experimental. Feedback and issues welcome.