Prompter – Compare and benchmark Ollama models side-by-side in your terminal

Prompter, a new terminal-based tool for Ollama, enables users to compare and benchmark multiple AI models side-by-side using a single Python file. The tool offers structured evaluation modes including self-review loops, multi-model debates, and adversarial interrogation, with results saved as markdown files containing timing, token counts, and response data. Prompter addresses the limitation of Ollama's single-session model by allowing simultaneous streaming, automated benchmarking across 20 standardized tests, and structured persona-based debates for complex decision-making.

Multi-model Ollama comparison, benchmarking, and evaluation — in your terminal. Zero dependencies. One file. Standard library only. wget https://raw.githubusercontent.com/whonixnetworks/prompter/main/prompter.py chmod +x prompter.py python3 prompter.py Python 3.7+. Ollama running locally. That is it. Prompter is a terminal-based multi-model evaluation tool for Ollama . Run the same prompt through multiple models simultaneously and watch responses stream in side by side. Go beyond simple comparison with structured evaluation modes: self-review loops, multi-model debate panels, and adversarial interrogation. Results are saved as collapsible markdown files with full stats — timing, token counts, tool call traces, and response text. Ollama is excellent for running one model at a time. Prompter fills the gaps when you need to evaluate, compare, or stress-test local models. | Ollama alone | Prompter | |---|---| | One model per session | Stream multiple models side by side | | Single chat thread | Structured debates Council, Tribunal | | Manual quality checks | Automated 20-test benchmark battery | | No built-in evaluation | Self-review loops with similarity scoring | | Ad-hoc comparison | Side-by-side streaming with tool orchestration | - Compare models live. Stream responses from multiple models simultaneously. Spot differences in reasoning, accuracy, and tone without switching terminals. - Structured evaluation modes. Council mode brings together Domain Expert, Skeptic, and Devil's Advocate personas to debate a question. Tribunal mode puts claims under adversarial scrutiny with an arbiter. These aren't things you can do in a single chat session. - Capability benchmarking. Run 20 standardised tests across every model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Per-model markdown reports with pass/fail summaries. Prompter is not just a side-by-side viewer. It has four evaluation modes, each designed for a different kind of question. Default -- One prompt, multiple models, live streaming. The simplest mode. Good for comparing how different models handle the same question. Ralph -- Self-review loop. A model answers, then critiques and improves its own answer across multiple rounds. Useful for seeing how much a model can refine its own reasoning. Council -- Multi-model debate. Three personas Domain Expert, Skeptic, Devil's Advocate each give their view on a question, then critique each other. A synthesis verdict follows. Good for complex decisions and risk assessments. Tribunal -- Adversarial interrogation. One model defends a position while another challenges it across multiple rounds. An arbiter rules on each challenge. Useful for stress-testing claims and spotting weak reasoning. Benchmark -- Runs 20 standardised tests against each selected model: counting, maths, web search, URL fetching, file reading, shell commands, Python execution, RSS parsing, embeddings, and more. Results are saved as per-model markdown files with pass/fail summaries. Here is what real output from Prompter looks like. Tribunal mode: fact-checking a historical claim Three models run a tribunal session on the prompt: "Explain how John F. Kennedy was president in 1995." Opening Defence model 1 immediately flags the factual error, provides a corrected timeline, and includes a reference table. Opening Prosecution model 2 states the same correction more directly: "This is an incorrect statement... JFK served from 1961 to 1963. Therefore, it is impossible for him to have been president in 1995." Round 1 -- Prosecution challenges with exact dates and notes Bill Clinton as the correct president for 1995. Defence concedes: "You are absolutely correct -- and thank you for the clear, well-structured correction." Round 2 -- Prosecution deepens the challenge with additional context on the 32-year gap and the origin of the confusion. Defence further strengthens the concession with a detailed breakdown table and sources. Arbiter verdict: JFK was not president in 1995. He served from 1961 to 1963. Bill Clinton was president from 1993 to 2001, which includes 1995. Tribunal mode caught the deliberate error in the prompt and produced a well-structured correction with sources, timelines, and clear concessions. Benchmark mode: testing a model across 20 capability tests A benchmark run against nemotron-3-nano:4b produced this summary: | Test | Category | Result | Tokens | Time | |---|---|---|---|---| | Count r's in "strawberry" | Basic Counting | Pass | 321 | 8.62s | | Count e's in "Tennessee" | Basic Counting | Fail | 120 | 1.54s | | Square root of 1764 | Calculator | Pass | 165 | 2.32s | | 2 to the power of 32 | Calculator | Fail | 126 | 1.59s | | Current PM of Australia | Web Search | Fail | 2297 | 10.21s | | Fetch example.com | URL Fetch | Pass | 1692 | 3.21s | | Read /etc/hostname | File Read | Pass | 1770 | 6.56s | | Shell: pwd | Shell Command | Pass | 1496 | 2.12s | | Shell: date | Shell Command | Pass | 1549 | 2.47s | | Python: Fibonacci | Python Exec | Pass | 1734 | 4.60s | | Diff comparison | Diff | Pass | 1978 | 4.90s | | Token counting | Token Counter | Pass | 4320 | 17.95s | | Embeddings similarity | Embeddings | Pass | 1984 | 9.95s | | Hacker News RSS feed | RSS Feed | Pass | 2449 | 13.01s | | Summarise example.com | Summarise URL | Pass | 1905 | 9.39s | Result: 14 out of 20 tests passed. Failures clustered around counting edge cases, basic arithmetic, and web search the model answered without calling the search tool . The full markdown file includes each test in a collapsible section with the exact response, tool calls, and timing data. Prompter can give models access to external tools during evaluation. All tools are opt-in and enabled per session. Web Search via SearXNG or DuckDuckGo URL Fetch to read and summarise web pages File Reader to access files within a configurable root Shell Commands via an open-terminal sandbox Python Execution in the same sandbox Calculator for precise arithmetic Diff for comparing text blocks Token Counter using Ollama's tokenisation Embeddings for similarity scoring RSS Feed parsing Tools must be running and accessible at startup to appear as options. All configuration is done via environment variables. | Variable | Default | Description | |---|---|---| OLLAMA URL | http://localhost:11434 | Ollama API endpoint | SEARXNG URL | http://localhost:8580 | SearXNG instance for web search | OPEN TERMINAL URL | http://localhost:8000 | Sandbox for shell and Python | OPEN TERMINAL API KEY | none | API key for the sandbox | OUTPUT DIR | current directory | Where responses/ is written | FILE READER ROOT | home directory | Filesystem root for file reader | SearXNG requires json to be added to search.formats in settings.yml . These work on every screen. | Key | Action | |---|---| up / down or j / k | Navigate | space | Toggle selection | enter | Confirm or proceed | q | Quit | b | Go back | A | Select all | During streaming, p pauses the auto-advance countdown between models. enter or n skips it. Results are saved under responses/ in your output directory, organised by mode: responses/ default/ Side-by-side comparisons ralph/ Self-review loops council/ Multi-model debates bully/ Tribunal sessions benchmarks/ Per-model benchmark files Each file includes the prompt, model stats timing, tokens, speed , tool call details when tools were used, and the full response text. All sections use collapsible markdown blocks so the files are easy to navigate in any markdown viewer. - Python 3.7 or later - Ollama running and accessible - Zero pip dependencies standard library only Optional for extended features: - A SearXNG instance for web search - An open-terminal sandbox for shell and Python execution Developers running local LLMs who want to pick the right model for a task Teams evaluating open-weight models before deployment Researchers comparing model behaviour across prompts and tools Anyone with Ollama who's tired of copying prompts between terminal tabs Prompter is part of the local LLM evaluation ecosystem. Compare with: Ollama https://github.com/ollama/ollama — the runtime Prompter builds on Open WebUI https://github.com/open-webui/open-webui — browser-based chat promptfoo https://github.com/promptfoo/promptfoo — CLI prompt testing framework whichllm https://github.com/Andyyyy64/whichllm — model ranking by benchmarks Prompter sits between ad-hoc ollama run and heavyweight eval frameworks — fast terminal-first comparison with zero setup. Found a bug or want to add a feature? PRs welcome. git clone https://github.com/whonixnetworks/prompter.git cd prompter Source modules are under src/prompter/ Build the single-file distribution with: python3 scripts/bundle.py Run tests: python3 -m unittest discover tests See AGENTS.md /whonixnetworks/prompter/blob/main/AGENTS.md for development conventions. MIT License. See LICENSE /whonixnetworks/prompter/blob/main/LICENSE for details. Built by whonixnetworks · prompter.whonix.net