Prompter – Compare and benchmark Ollama models side-by-side in your terminal

wpnews.pro

Multi-model Ollama comparison, benchmarking, and evaluation — in your terminal.

Zero dependencies. One file. Standard library only.

wget https://raw.githubusercontent.com/whonixnetworks/prompter/main/prompter.py
chmod +x prompter.py
python3 prompter.py

Python 3.7+. Ollama running locally. That is it.

Prompter is a terminal-based multi-model evaluation tool for Ollama. Run the same prompt through multiple models simultaneously and watch responses stream in side by side. Go beyond simple comparison with structured evaluation modes: self-review loops, multi-model debate panels, and adversarial interrogation.

Results are saved as collapsible markdown files with full stats — timing, token counts, tool call traces, and response text.

Ollama is excellent for running one model at a time. Prompter fills the gaps when you need to evaluate, compare, or stress-test local models.

Ollama alone	Prompter
One model per session	Stream multiple models side by side
Single chat thread	Structured debates (Council, Tribunal)
Manual quality checks	Automated 20-test benchmark battery
No built-in evaluation	Self-review loops with similarity scoring
Ad-hoc comparison	Side-by-side streaming with tool orchestration

Compare models live. Stream responses from multiple models simultaneously. Spot differences in reasoning, accuracy, and tone without switching terminals. - Structured evaluation modes. Council mode brings together Domain Expert, Skeptic, and Devil's Advocate personas to debate a question. Tribunal mode puts claims under adversarial scrutiny with an arbiter. These aren't things you can do in a single chat session. - Capability benchmarking. Run 20 standardised tests across every model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Per-model markdown reports with pass/fail summaries.

Prompter is not just a side-by-side viewer. It has four evaluation modes, each designed for a different kind of question.

Default -- One prompt, multiple models, live streaming. The simplest mode. Good for comparing how different models handle the same question.

Ralph -- Self-review loop. A model answers, then critiques and improves its own answer across multiple rounds. Useful for seeing how much a model can refine its own reasoning.

Council -- Multi-model debate. Three personas (Domain Expert, Skeptic, Devil's Advocate) each give their view on a question, then critique each other. A synthesis verdict follows. Good for complex decisions and risk assessments.

Tribunal -- Adversarial interrogation. One model defends a position while another challenges it across multiple rounds. An arbiter rules on each challenge. Useful for stress-testing claims and spotting weak reasoning.

Benchmark -- Runs 20 standardised tests against each selected model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Results are saved as per-model markdown files with pass/fail summaries.

Here is what real output from Prompter looks like.

Tribunal mode: fact-checking a historical claim #

Three models run a tribunal session on the prompt: "Explain how John F. Kennedy was president in 1995."

Opening Defence (model 1) immediately flags the factual error, provides a corrected timeline, and includes a reference table.

Opening Prosecution (model 2) states the same correction more directly: "This is an incorrect statement... JFK served from 1961 to 1963. Therefore, it is impossible for him to have been president in 1995."

Round 1 -- Prosecution challenges with exact dates and notes Bill Clinton as the correct president for 1995. Defence concedes: "You are absolutely correct -- and thank you for the clear, well-structured correction."

Round 2 -- Prosecution deepens the challenge with additional context on the 32-year gap and the origin of the confusion. Defence further strengthens the concession with a detailed breakdown table and sources.

Arbiter verdict: JFK was not president in 1995. He served from 1961 to 1963. Bill Clinton was president from 1993 to 2001, which includes 1995.

Tribunal mode caught the deliberate error in the prompt and produced a well-structured correction with sources, timelines, and clear concessions.

Benchmark mode: testing a model across 20 capability tests #

A benchmark run against nemotron-3-nano:4b

produced this summary:

Test	Category	Result	Tokens	Time
Count r's in "strawberry"	Basic Counting	Pass	321	8.62s
Count e's in "Tennessee"	Basic Counting	Fail	120	1.54s
Square root of 1764	Calculator	Pass	165	2.32s
2 to the power of 32	Calculator	Fail	126	1.59s
Current PM of Australia	Web Search	Fail	2297	10.21s
Fetch example.com	URL Fetch	Pass	1692	3.21s
Read /etc/hostname	File Read	Pass	1770	6.56s
Shell: pwd	Shell Command	Pass	1496	2.12s
Shell: date	Shell Command	Pass	1549	2.47s
Python: Fibonacci	Python Exec	Pass	1734	4.60s
Diff comparison	Diff	Pass	1978	4.90s
Token counting	Token Counter	Pass	4320	17.95s
Embeddings similarity	Embeddings	Pass	1984	9.95s
Hacker News RSS feed	RSS Feed	Pass	2449	13.01s
Summarise example.com	Summarise URL	Pass	1905	9.39s

Result: 14 out of 20 tests passed.

Failures clustered around counting edge cases, basic arithmetic, and web search (the model answered without calling the search tool). The full markdown file includes each test in a collapsible section with the exact response, tool calls, and timing data.

Prompter can give models access to external tools during evaluation. All tools are opt-in and enabled per session.

Web Search via SearXNG or DuckDuckGoURL Fetch to read and summarise web pagesFile Reader to access files within a configurable rootShell Commands via an open-terminal sandboxPython Execution in the same sandboxCalculator for precise arithmeticDiff for comparing text blocksToken Counter using Ollama's tokenisationEmbeddings for similarity scoringRSS Feed parsing

Tools must be running and accessible at startup to appear as options.

All configuration is done via environment variables.

Variable	Default	Description
`OLLAMA_URL`
`http://localhost:11434`
Ollama API endpoint
`SEARXNG_URL`
`http://localhost:8580`
SearXNG instance for web search
`OPEN_TERMINAL_URL`
`http://localhost:8000`
Sandbox for shell and Python
`OPEN_TERMINAL_API_KEY`
(none)
API key for the sandbox
`OUTPUT_DIR`
current directory	Where `responses/` is written
`FILE_READER_ROOT`
home directory	Filesystem root for file reader

SearXNG requires json

to be added to search.formats

in settings.yml

.

These work on every screen.

Key	Action
`up` / `down` or `j` / `k`
Navigate
`space`
Toggle selection
`enter`
Confirm or proceed
`q`
Quit
`b`
Go back
`A`
Select all

During streaming, p

s the auto-advance countdown between models. enter

or n

skips it.

Results are saved under responses/

in your output directory, organised by mode:

responses/
  default/     # Side-by-side comparisons
  ralph/       # Self-review loops
  council/     # Multi-model debates
  bully/       # Tribunal sessions
benchmarks/    # Per-model benchmark files

Each file includes the prompt, model stats (timing, tokens, speed), tool call details when tools were used, and the full response text. All sections use collapsible markdown blocks so the files are easy to navigate in any markdown viewer.

Python 3.7 or later
Ollama running and accessible
Zero pip dependencies (standard library only)

Optional for extended features:

A SearXNG instance for web search
An open-terminal sandbox for shell and Python execution

Developers running local LLMs who want to pick the right model for a taskTeams evaluating open-weight models before deploymentResearchers comparing model behaviour across prompts and toolsAnyone with Ollama who's tired of copying prompts between terminal tabs

Prompter is part of the local LLM evaluation ecosystem. Compare with:

Ollama— the runtime Prompter builds onOpen WebUI— browser-based chatpromptfoo— CLI prompt testing frameworkwhichllm— model ranking by benchmarks

Prompter sits between ad-hoc ollama run

and heavyweight eval frameworks — fast terminal-first comparison with zero setup.

Found a bug or want to add a feature? PRs welcome.

git clone https://github.com/whonixnetworks/prompter.git
cd prompter
python3 scripts/bundle.py
python3 -m unittest discover tests

See AGENTS.md for development conventions.

MIT License. See LICENSE for details.

Built by whonixnetworks · prompter.whonix.net

source & further reading

github.com — original article

Prompter – Compare and benchmark Ollama models side-by-side in your terminal

Tribunal mode: fact-checking a historical claim #

Benchmark mode: testing a model across 20 capability tests #

Run your AI side-project on zahid.host