cd /news/ai-tools/show-hn-promptloop-create-run-and-im… · home topics ai-tools article
[ARTICLE · art-17826] src=github.com pub= topic=ai-tools verified=true sentiment=↑ positive

Show HN: Promptloop – create, run, and improve prompt evals from the terminal

Promptloop, a new interactive CLI tool built on LangChain's deepagents framework, enables developers to create, run, and improve prompt evaluations entirely from the terminal. The tool saves methodology, test cases, reports, prompt history, and chat checkpoints under a `.evals/` directory in the target project, supporting metrics including latency, JSON schema validation, fuzzy matching, and LLM judge scoring. Promptloop allows users to register prompts, add test cases, run evaluations, generate failure analysis reports, and approve prompt diffs without leaving the command line.

read3 min publishedMay 29, 2026

An interactive CLI agent for the full prompt-eval loop: create test cases, run evals, generate reports, and approve prompt diffs without leaving your terminal.

Built on LangChain deepagents.

Agent harnesses are getting better, but prompts still shape what they do. promptloop turns a prompt and eval intent into a repeatable loop:

It saves the methodology, test cases, reports, prompt history, and chat checkpoints under .evals/

in the target project.

.evals/
  prompts/        # registered prompts + version history
  test_cases/     # per-prompt test suites
  eval_configs/   # methodology (metrics, models, judges)
  results/        # eval runs and reports
  chat.db         # SQLite checkpoint of conversation threads

Example metrics:

latency

: response timejson_schema

: validates structured outputfuzzy_match

: compares text similarityllm_judge

: scores output with a judge prompt

Suppose your project has a prompt at prompts/summarize.md

:

Summarize the user's note in three bullets.
Return JSON.

Start promptloop and describe the behavior you want to test:

$ uv run promptloop --project-dir ~/work/notes-app

promptloop> Evaluate the prompt at prompts/summarize.md.

Registered prompt 'summarize' (v1)
Source: /Users/me/work/notes-app/prompts/summarize.md

promptloop> Add a test case where the note includes action items, dates, and unrelated chatter.

Added test case 'tc_action_items' for prompt 'summarize'
(metrics: json_schema, llm_judge).

promptloop> Run the eval.

Run complete - ID: run_20260529_091214_a3f2
Results: 2 passed / 1 failed / 3 total
Avg latency: 1840ms
Max concurrency: 3

  passed [tc_basic_summary] anthropic:claude-sonnet-4-6
    json_schema: valid JSON matching schema | llm_judge: 0.86
  failed [tc_action_items] anthropic:claude-sonnet-4-6
    json_schema: schema mismatch: 'action_items' is a required property
  passed [tc_noise] anthropic:claude-sonnet-4-6
    json_schema: valid JSON matching schema | llm_judge: 0.82

Ask for a fix, and promptloop proposes a diff instead of editing blindly:

promptloop> Propose a prompt change for the failing action-items case.

Proposed changes to 'summarize' from v1:
--- summarize (current)
+++ summarize (proposed)
@@
-Summarize the user's note in three bullets.
-Return JSON.
+Summarize the user's note in three bullets.
+If the note contains follow-up tasks, extract them into an action_items array.
+Each action item should include a task, owner if mentioned, and due_date if mentioned.
+
+Return only valid JSON with this shape:
+{
+  "summary": ["...", "...", "..."],
+  "action_items": [
+    {"task": "...", "owner": "...", "due_date": "..."}
+  ]
+}

It also generates a report you can inspect before approving the change:


**Run:** run_20260529_091214_a3f2
**Models:** anthropic:claude-sonnet-4-6
**Pass rate:** 67% (2/3)
**Avg latency:** 1840ms

## Failure Analysis

The action-items case failed because the prompt only requested "three bullets"
and "JSON"; it did not define a required JSON shape or explain how to handle
dates, owners, and follow-up tasks.

## Recommendations

1. Add an explicit `action_items` field to the schema.
2. Tell the model to preserve due dates and owners when present.
3. Require JSON-only output so downstream parsing is stable.
git clone <this repo>
cd promptloop
uv sync
uv run promptloop --project-dir /path/to/your/project

You'll get an interactive chat. Try things like:

"Evaluate the prompt atsrc/prompts/summarize.txt

""Add three more test cases for edge cases"**"Re-run withopenai:gpt-4o-mini

and compare to the last run""Propose a fix for the failing JSON schema cases"

Command Description
/help
Show help
/clear
Start a new conversation thread
/threads
List saved threads
/thread <id>
Switch to a thread in-session
/quit
Exit

Resume past sessions with promptloop --thread <id>

. Press Esc to interrupt a streaming response.

The agent has a small set of typed tools on top of deepagents' filesystem access:

register_prompt

,propose_prompt_changes

,apply_prompt_changes

,show_prompt_history

add_test_case

,infer_json_schema

,save_eval_config

run_eval

,list_eval_runs

generate_report

,read_report

,compare_runs

For more detail on the agent runtime behind this project, see The Harness Behind Deep Agent.

Early / experimental. Feedback and issues welcome.

── more in #ai-tools 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-promptloop-c…] indexed:0 read:3min 2026-05-29 ·