cd /news/large-language-models/prompter-compare-and-benchmark-ollam… · home topics large-language-models article
[ARTICLE · art-14321] src=github.com pub= topic=large-language-models verified=true sentiment=↑ positive

Prompter – Compare and benchmark Ollama models side-by-side in your terminal

Prompter, a new terminal-based tool for Ollama, enables users to compare and benchmark multiple AI models side-by-side using a single Python file. The tool offers structured evaluation modes including self-review loops, multi-model debates, and adversarial interrogation, with results saved as markdown files containing timing, token counts, and response data. Prompter addresses the limitation of Ollama's single-session model by allowing simultaneous streaming, automated benchmarking across 20 standardized tests, and structured persona-based debates for complex decision-making.

read7 min publishedMay 26, 2026

Multi-model Ollama comparison, benchmarking, and evaluation — in your terminal.

Zero dependencies. One file. Standard library only.

wget https://raw.githubusercontent.com/whonixnetworks/prompter/main/prompter.py
chmod +x prompter.py
python3 prompter.py

Python 3.7+. Ollama running locally. That is it.

Prompter is a terminal-based multi-model evaluation tool for Ollama. Run the same prompt through multiple models simultaneously and watch responses stream in side by side. Go beyond simple comparison with structured evaluation modes: self-review loops, multi-model debate panels, and adversarial interrogation.

Results are saved as collapsible markdown files with full stats — timing, token counts, tool call traces, and response text.

Ollama is excellent for running one model at a time. Prompter fills the gaps when you need to evaluate, compare, or stress-test local models.

Ollama alone Prompter
One model per session Stream multiple models side by side
Single chat thread Structured debates (Council, Tribunal)
Manual quality checks Automated 20-test benchmark battery
No built-in evaluation Self-review loops with similarity scoring
Ad-hoc comparison Side-by-side streaming with tool orchestration

Compare models live. Stream responses from multiple models simultaneously. Spot differences in reasoning, accuracy, and tone without switching terminals. - Structured evaluation modes. Council mode brings together Domain Expert, Skeptic, and Devil's Advocate personas to debate a question. Tribunal mode puts claims under adversarial scrutiny with an arbiter. These aren't things you can do in a single chat session. - Capability benchmarking. Run 20 standardised tests across every model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Per-model markdown reports with pass/fail summaries.

Prompter is not just a side-by-side viewer. It has four evaluation modes, each designed for a different kind of question.

Default -- One prompt, multiple models, live streaming. The simplest mode. Good for comparing how different models handle the same question.

Ralph -- Self-review loop. A model answers, then critiques and improves its own answer across multiple rounds. Useful for seeing how much a model can refine its own reasoning.

Council -- Multi-model debate. Three personas (Domain Expert, Skeptic, Devil's Advocate) each give their view on a question, then critique each other. A synthesis verdict follows. Good for complex decisions and risk assessments.

Tribunal -- Adversarial interrogation. One model defends a position while another challenges it across multiple rounds. An arbiter rules on each challenge. Useful for stress-testing claims and spotting weak reasoning.

Benchmark -- Runs 20 standardised tests against each selected model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Results are saved as per-model markdown files with pass/fail summaries.

Here is what real output from Prompter looks like.

Tribunal mode: fact-checking a historical claim #

Three models run a tribunal session on the prompt: "Explain how John F. Kennedy was president in 1995."

Opening Defence (model 1) immediately flags the factual error, provides a corrected timeline, and includes a reference table.

Opening Prosecution (model 2) states the same correction more directly: "This is an incorrect statement... JFK served from 1961 to 1963. Therefore, it is impossible for him to have been president in 1995."

Round 1 -- Prosecution challenges with exact dates and notes Bill Clinton as the correct president for 1995. Defence concedes: "You are absolutely correct -- and thank you for the clear, well-structured correction."

Round 2 -- Prosecution deepens the challenge with additional context on the 32-year gap and the origin of the confusion. Defence further strengthens the concession with a detailed breakdown table and sources.

Arbiter verdict: JFK was not president in 1995. He served from 1961 to 1963. Bill Clinton was president from 1993 to 2001, which includes 1995.

Tribunal mode caught the deliberate error in the prompt and produced a well-structured correction with sources, timelines, and clear concessions.

Benchmark mode: testing a model across 20 capability tests #

A benchmark run against nemotron-3-nano:4b

produced this summary:

Test Category Result Tokens Time
Count r's in "strawberry" Basic Counting Pass 321 8.62s
Count e's in "Tennessee" Basic Counting Fail 120 1.54s
Square root of 1764 Calculator Pass 165 2.32s
2 to the power of 32 Calculator Fail 126 1.59s
Current PM of Australia Web Search Fail 2297 10.21s
Fetch example.com URL Fetch Pass 1692 3.21s
Read /etc/hostname File Read Pass 1770 6.56s
Shell: pwd Shell Command Pass 1496 2.12s
Shell: date Shell Command Pass 1549 2.47s
Python: Fibonacci Python Exec Pass 1734 4.60s
Diff comparison Diff Pass 1978 4.90s
Token counting Token Counter Pass 4320 17.95s
Embeddings similarity Embeddings Pass 1984 9.95s
Hacker News RSS feed RSS Feed Pass 2449 13.01s
Summarise example.com Summarise URL Pass 1905 9.39s

Result: 14 out of 20 tests passed.

Failures clustered around counting edge cases, basic arithmetic, and web search (the model answered without calling the search tool). The full markdown file includes each test in a collapsible section with the exact response, tool calls, and timing data.

Prompter can give models access to external tools during evaluation. All tools are opt-in and enabled per session.

Web Search via SearXNG or DuckDuckGoURL Fetch to read and summarise web pagesFile Reader to access files within a configurable rootShell Commands via an open-terminal sandboxPython Execution in the same sandboxCalculator for precise arithmeticDiff for comparing text blocksToken Counter using Ollama's tokenisationEmbeddings for similarity scoringRSS Feed parsing

Tools must be running and accessible at startup to appear as options.

All configuration is done via environment variables.

Variable Default Description
OLLAMA_URL
http://localhost:11434
Ollama API endpoint
SEARXNG_URL
http://localhost:8580
SearXNG instance for web search
OPEN_TERMINAL_URL
http://localhost:8000
Sandbox for shell and Python
OPEN_TERMINAL_API_KEY
(none)
API key for the sandbox
OUTPUT_DIR
current directory Where responses/ is written
FILE_READER_ROOT
home directory Filesystem root for file reader

SearXNG requires json

to be added to search.formats

in settings.yml

.

These work on every screen.

Key Action
up / down or j / k
Navigate
space
Toggle selection
enter
Confirm or proceed
q
Quit
b
Go back
A
Select all

During streaming, p

s the auto-advance countdown between models. enter

or n

skips it.

Results are saved under responses/

in your output directory, organised by mode:

responses/
  default/     # Side-by-side comparisons
  ralph/       # Self-review loops
  council/     # Multi-model debates
  bully/       # Tribunal sessions
benchmarks/    # Per-model benchmark files

Each file includes the prompt, model stats (timing, tokens, speed), tool call details when tools were used, and the full response text. All sections use collapsible markdown blocks so the files are easy to navigate in any markdown viewer.

  • Python 3.7 or later
  • Ollama running and accessible
  • Zero pip dependencies (standard library only)

Optional for extended features:

  • A SearXNG instance for web search
  • An open-terminal sandbox for shell and Python execution

Developers running local LLMs who want to pick the right model for a taskTeams evaluating open-weight models before deploymentResearchers comparing model behaviour across prompts and toolsAnyone with Ollama who's tired of copying prompts between terminal tabs

Prompter is part of the local LLM evaluation ecosystem. Compare with:

Ollama— the runtime Prompter builds onOpen WebUI— browser-based chatpromptfoo— CLI prompt testing frameworkwhichllm— model ranking by benchmarks

Prompter sits between ad-hoc ollama run

and heavyweight eval frameworks — fast terminal-first comparison with zero setup.

Found a bug or want to add a feature? PRs welcome.

git clone https://github.com/whonixnetworks/prompter.git
cd prompter
python3 scripts/bundle.py
python3 -m unittest discover tests

See AGENTS.md for development conventions.

MIT License. See LICENSE for details.

Built by whonixnetworks · prompter.whonix.net

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/prompter-compare-and…] indexed:0 read:7min 2026-05-26 ·