Multi-model Ollama comparison, benchmarking, and evaluation — in your terminal.
Zero dependencies. One file. Standard library only.
wget https://raw.githubusercontent.com/whonixnetworks/prompter/main/prompter.py
chmod +x prompter.py
python3 prompter.py
Python 3.7+. Ollama running locally. That is it.
Prompter is a terminal-based multi-model evaluation tool for Ollama. Run the same prompt through multiple models simultaneously and watch responses stream in side by side. Go beyond simple comparison with structured evaluation modes: self-review loops, multi-model debate panels, and adversarial interrogation.
Results are saved as collapsible markdown files with full stats — timing, token counts, tool call traces, and response text.
Ollama is excellent for running one model at a time. Prompter fills the gaps when you need to evaluate, compare, or stress-test local models.
| Ollama alone | Prompter |
|---|---|
| One model per session | Stream multiple models side by side |
| Single chat thread | Structured debates (Council, Tribunal) |
| Manual quality checks | Automated 20-test benchmark battery |
| No built-in evaluation | Self-review loops with similarity scoring |
| Ad-hoc comparison | Side-by-side streaming with tool orchestration |
Compare models live. Stream responses from multiple models simultaneously. Spot differences in reasoning, accuracy, and tone without switching terminals. - Structured evaluation modes. Council mode brings together Domain Expert, Skeptic, and Devil's Advocate personas to debate a question. Tribunal mode puts claims under adversarial scrutiny with an arbiter. These aren't things you can do in a single chat session. - Capability benchmarking. Run 20 standardised tests across every model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Per-model markdown reports with pass/fail summaries.
Prompter is not just a side-by-side viewer. It has four evaluation modes, each designed for a different kind of question.
Default -- One prompt, multiple models, live streaming. The simplest mode. Good for comparing how different models handle the same question.
Ralph -- Self-review loop. A model answers, then critiques and improves its own answer across multiple rounds. Useful for seeing how much a model can refine its own reasoning.
Council -- Multi-model debate. Three personas (Domain Expert, Skeptic, Devil's Advocate) each give their view on a question, then critique each other. A synthesis verdict follows. Good for complex decisions and risk assessments.
Tribunal -- Adversarial interrogation. One model defends a position while another challenges it across multiple rounds. An arbiter rules on each challenge. Useful for stress-testing claims and spotting weak reasoning.
Benchmark -- Runs 20 standardised tests against each selected model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Results are saved as per-model markdown files with pass/fail summaries.
Here is what real output from Prompter looks like.
Tribunal mode: fact-checking a historical claim #
Three models run a tribunal session on the prompt: "Explain how John F. Kennedy was president in 1995."
Opening Defence (model 1) immediately flags the factual error, provides a corrected timeline, and includes a reference table.
Opening Prosecution (model 2) states the same correction more directly: "This is an incorrect statement... JFK served from 1961 to 1963. Therefore, it is impossible for him to have been president in 1995."
Round 1 -- Prosecution challenges with exact dates and notes Bill Clinton as the correct president for 1995. Defence concedes: "You are absolutely correct -- and thank you for the clear, well-structured correction."
Round 2 -- Prosecution deepens the challenge with additional context on the 32-year gap and the origin of the confusion. Defence further strengthens the concession with a detailed breakdown table and sources.
Arbiter verdict: JFK was not president in 1995. He served from 1961 to 1963. Bill Clinton was president from 1993 to 2001, which includes 1995.
Tribunal mode caught the deliberate error in the prompt and produced a well-structured correction with sources, timelines, and clear concessions.
Benchmark mode: testing a model across 20 capability tests #
A benchmark run against nemotron-3-nano:4b
produced this summary:
| Test | Category | Result | Tokens | Time |
|---|---|---|---|---|
| Count r's in "strawberry" | Basic Counting | Pass | 321 | 8.62s |
| Count e's in "Tennessee" | Basic Counting | Fail | 120 | 1.54s |
| Square root of 1764 | Calculator | Pass | 165 | 2.32s |
| 2 to the power of 32 | Calculator | Fail | 126 | 1.59s |
| Current PM of Australia | Web Search | Fail | 2297 | 10.21s |
| Fetch example.com | URL Fetch | Pass | 1692 | 3.21s |
| Read /etc/hostname | File Read | Pass | 1770 | 6.56s |
| Shell: pwd | Shell Command | Pass | 1496 | 2.12s |
| Shell: date | Shell Command | Pass | 1549 | 2.47s |
| Python: Fibonacci | Python Exec | Pass | 1734 | 4.60s |
| Diff comparison | Diff | Pass | 1978 | 4.90s |
| Token counting | Token Counter | Pass | 4320 | 17.95s |
| Embeddings similarity | Embeddings | Pass | 1984 | 9.95s |
| Hacker News RSS feed | RSS Feed | Pass | 2449 | 13.01s |
| Summarise example.com | Summarise URL | Pass | 1905 | 9.39s |
Result: 14 out of 20 tests passed.
Failures clustered around counting edge cases, basic arithmetic, and web search (the model answered without calling the search tool). The full markdown file includes each test in a collapsible section with the exact response, tool calls, and timing data.
Prompter can give models access to external tools during evaluation. All tools are opt-in and enabled per session.
Web Search via SearXNG or DuckDuckGoURL Fetch to read and summarise web pagesFile Reader to access files within a configurable rootShell Commands via an open-terminal sandboxPython Execution in the same sandboxCalculator for precise arithmeticDiff for comparing text blocksToken Counter using Ollama's tokenisationEmbeddings for similarity scoringRSS Feed parsing
Tools must be running and accessible at startup to appear as options.
All configuration is done via environment variables.
| Variable | Default | Description |
|---|---|---|
OLLAMA_URL |
||
http://localhost:11434 |
||
| Ollama API endpoint | ||
SEARXNG_URL |
||
http://localhost:8580 |
||
| SearXNG instance for web search | ||
OPEN_TERMINAL_URL |
||
http://localhost:8000 |
||
| Sandbox for shell and Python | ||
OPEN_TERMINAL_API_KEY |
||
| (none) | ||
| API key for the sandbox | ||
OUTPUT_DIR |
||
| current directory | Where responses/ is written |
|
FILE_READER_ROOT |
||
| home directory | Filesystem root for file reader |
SearXNG requires json
to be added to search.formats
in settings.yml
.
These work on every screen.
| Key | Action |
|---|---|
up / down or j / k |
|
| Navigate | |
space |
|
| Toggle selection | |
enter |
|
| Confirm or proceed | |
q |
|
| Quit | |
b |
|
| Go back | |
A |
|
| Select all |
During streaming, p
s the auto-advance countdown between models. enter
or n
skips it.
Results are saved under responses/
in your output directory, organised by mode:
responses/
default/ # Side-by-side comparisons
ralph/ # Self-review loops
council/ # Multi-model debates
bully/ # Tribunal sessions
benchmarks/ # Per-model benchmark files
Each file includes the prompt, model stats (timing, tokens, speed), tool call details when tools were used, and the full response text. All sections use collapsible markdown blocks so the files are easy to navigate in any markdown viewer.
- Python 3.7 or later
- Ollama running and accessible
- Zero pip dependencies (standard library only)
Optional for extended features:
- A SearXNG instance for web search
- An open-terminal sandbox for shell and Python execution
Developers running local LLMs who want to pick the right model for a taskTeams evaluating open-weight models before deploymentResearchers comparing model behaviour across prompts and toolsAnyone with Ollama who's tired of copying prompts between terminal tabs
Prompter is part of the local LLM evaluation ecosystem. Compare with:
Ollama— the runtime Prompter builds onOpen WebUI— browser-based chatpromptfoo— CLI prompt testing frameworkwhichllm— model ranking by benchmarks
Prompter sits between ad-hoc ollama run
and heavyweight eval frameworks — fast terminal-first comparison with zero setup.
Found a bug or want to add a feature? PRs welcome.
git clone https://github.com/whonixnetworks/prompter.git
cd prompter
python3 scripts/bundle.py
python3 -m unittest discover tests
See AGENTS.md for development conventions.
MIT License. See LICENSE for details.
Built by whonixnetworks · prompter.whonix.net