Show HN: Fast CPU summarize, eli5, fact-check or translate any text

wpnews.pro

Summarize, explain, fact-check, or translate any text, URL, or file. No GPU. No cloud. One command.

fftext s https://en.wikipedia.org/wiki/Llama.cpp

Three bullet points, streamed to your terminal, generated on your CPU. No API key. No round-trip to anyone's server.

⚡ Fast on CPU. Powered by a quantized 0.8B Qwen3.5 (Q4_K_M GGUF, ~500 MB) running throughllama.cpp

. Streams tokens as they're generated so you see the answer build, not a spinner. No CUDA. No Metal-only tricks. Plain old cores. - 🌐 Files, URLs, or raw strings. Point it at a.txt

, paste an article URL, or just type the text inline. URLs get fetched, run throughreadability-lxml

for main-content extraction, and stripped to clean prose before the model sees them. - 📴 Offline after first run. The model downloads once to your Hugging Face cache and stays there. Your text never leaves your machine (except forcheck

, which needs the web — see below). - 🪶 Lean deps.llama-cpp-python

,requests

,beautifulsoup4

,readability-lxml

,lxml

. That's it. No PyTorch, no LangChain, no cloud SDKs. - 🧠 Four tasks, four prompts, one binary. Summarize, explain like I'm five, fact-check against the live web, or translate into any language or register you can describe. Each task is a separate, focused prompt — not one mega-prompt trying to do everything. - 🗣 Translate into anything you can describe.--lang "Castilian Spanish"

,--lang "casual Japanese"

,--lang "Moroccan Darija"

— whatever string you pass goes straight into the prompt. You drive the register and dialect. - 🔍 Fact-check with citations.fftext check

extracts claims, ranks them, web-searches each one (Mojeek and Startpage, rotated), and labels them SUPPORTED, REFUTED, CONFLICTING, or INSUFFICIENT — with a source URL per claim. CPU-only, no API key, no Google.

pip install .

fftext s notes.txt                                       # summarize a file
fftext e https://en.wikipedia.org/wiki/Photosynthesis    # ELI5 a URL
fftext c "The Eiffel Tower was built in 1822."           # fact-check a string
fftext t --lang "French" "How are you today?"            # translate

First run downloads ~500 MB of model weights. Every run after is offline (except check

, which searches the web).

Subcommand	Alias(es)	What it does
`summarize`
`s`
Three short bullet points. Concrete and specific, no preamble.
`explain`
`e` , `eli5`
Plain-language explanation, 4–6 sentences, like to a curious kid.
`check`
`c`
Extract claims → web-search each → label SUPPORTED / REFUTED / CONFLICTING / INSUFFICIENT.
`translate`
`t`
Translate into any language/register you describe via `--lang` .

Every task accepts the same three input shapes — file, URL, or raw string — resolved in that order.

fftext s notes.txt
fftext s https://example.com/post
fftext s "Paste a long block of text right here on the command line."

fftext e paper.pdf.txt
fftext eli5 https://en.wikipedia.org/wiki/Quantum_entanglement

fftext c article.txt
fftext c "The Roman Empire fell in 476 AD."
fftext c --debug article.txt          # show ranking, queries, snippets, raw verdicts

fftext t hello.txt                                            # defaults to English
fftext t --lang "Castilian Spanish" hello.txt
fftext t --lang "casual Japanese" "How are you today?"
fftext t --lang "polite Brazilian Portuguese" letter.txt
fftext t -l "Moroccan Darija in Latin script" "Where is the train station?"

<input>

for any subcommand is resolved in this order:

Starts with→ fetched withhttp://

orhttps://

requests

, parsed withreadability-lxml

to isolate the main article body, then stripped to plain text with paragraph breaks preserved. Falls back to a light tag-strip if readability can't find an article (common on docs pages and indexes).Looks like an existing file path→ read as UTF-8 (errors replaced).** Anything else**→ treated literally as a string.

Long inputs are head-and-tail clipped to ~10,000 characters (~2,500 tokens) so prompt + generation + chat template fit comfortably in the 4,096-token context. You'll see a [note: input clipped...]

line on stderr when that happens. The clip keeps the start and end of the document, which preserves intros and conclusions — what summaries and explanations care about most.

Streamed to stdout as it's generated. Notes and timing info go to stderr, so you can pipe just the answer:

fftext s long-doc.txt > summary.txt
fftext t --lang French letter.txt | tee letter.fr.txt
- The author argues that small local models are now good enough for routine text tasks.
- Speed gains come from quantization and streaming, not better hardware.
- The main remaining gap is multilingual quality below 7B parameters.
A neural network is like a giant calculator that learns by example. You show it lots
of pictures of cats and dogs, and it slowly figures out which patterns mean "cat" and
which mean "dog." Each time it gets one wrong, it nudges its internal numbers a tiny
bit so it'll do better next time. After millions of nudges, it gets pretty good.

One line per claim, with a verdict label and the top supporting URL:

SUPPORTED     The Eiffel Tower was completed in 1889.  [https://en.wikipedia.org/wiki/Eiffel_Tower]
REFUTED       It was built by Thomas Edison.  [https://www.britannica.com/biography/Gustave-Eiffel]
INSUFFICIENT  It is currently the tallest structure in Paris.  [-]

Run with -v

for timings and --debug

to see ranked claims, generated search queries, raw snippets, and the model's reasoning before each verdict.

The translation, and nothing else. No "Here's the translation:" preamble, no original text echoed back, no transliteration unless the target language genuinely calls for it. Paragraph breaks and markdown formatting are preserved.

Flag	Description
`-v` , `--verbose`
Print timing info to stderr (token rate, per-stage timings on `check` ).
`-d` , `--debug`
`check` only. Dump claims, queries, snippets, verdicts, and dropped reasons.
`-l` , `--lang`
`translate` only. Target language description. Default: English.
`-h` , `--help`
Show usage and exit.

Flags can appear anywhere on the command line. The subcommand has to come first.

One LLM call, streamed. The whole trick is keeping the system prompt short — a 0.8B model gets confused by long instructions and burns tokens echoing them back. Each task has its own tight system prompt (3–4 lines) and a sane max_tokens

cap so the model doesn't ramble. Sampling is temperature=0.3, top_p=0.9, repeat_penalty=1.1

— faithful, not creative.

Per run:

Extract claims. LLM emits a JSON array of factual statements (names, numbers, dates, roles, events). Robust parser tolerates trailing commas, smart quotes, missing brackets, and falls through to a numbered-list scrape if the JSON is hopeless. Deduped against normalized lowercase + whitespace.Rank. LLM picks the top three most fact-checkable claims out of up to twelve. Each surviving claim costs ~4 more LLM calls, so ranking 9→3 saves ~24 calls.Rewrite as keyword queries. One LLM call per claim turns"James Talarico is a Presbyterian seminarian."

into"James Talarico" Presbyterian seminarian

. Real search engines weight rare tokens; sending whole sentences with stopwords tanks recall. Heuristic stopword-strip fallback if the rewrite looks suspicious.Search. Mojeek and Startpage, rotated by claim index, with fallback to the other on empty. Jittered sleeps and a generic desktop UA. Sanitized queries to avoid tripping WAFs on$

, backticks, pipes, etc. Eight-thread pool, ~8s timeout per request.Summarize evidence. LLM compresses each snippet into one sentence about the claim. Irrelevant snippets are dropped here, not at the judge stage.Synthesize. LLM lays out what supports, what contradicts, and what's missing — short and structured.Evaluate. Deterministic shortcuts handle the obvious cases (no support → REFUTED; nothing either way → INSUFFICIENT). Genuinely mixed evidence goes to one more LLM call with<think>

reasoning enabled, picking one of four labels.

Per-claim total: about four LLM calls and one search round-trip. The ranker keeps the bill from exploding on long inputs.

Threads. Detected fromos.cpu_count()

and halved —os.cpu_count()

returns logical cores, and oversubscribing hyperthreads runs slower than just using the physical ones. Override withQWEN_THREADS=N

if you know your physical core count and want to skip the heuristic.Context. Fixed at 4,096 tokens. Per-token generation cost scales withfilledcontext, not the cap, so the cap itself is nearly free — what costs you is filling it via bigger inputs. The 10,000-character clip keeps that under control.Streaming. Matters more on CPU than on GPU. Total latency is what it is, but perceived latency drops a lot when the first token arrives in under a second.C-level log silencing.llama.cpp

prints warnings via Cprintf

that bypass Python'sverbose=False

. fftext installs a null log callback to kill then_ctx_seq < n_ctx_train

nag and friends. Trade-off: real C-level errors get swallowed too, but Python-level exceptions still propagate fine.

The default model is unsloth/Qwen3.5-0.8B-GGUF

(Qwen3.5-0.8B-Q4_K_M.gguf

, ~500 MB), downloaded on first run via huggingface-hub

to your standard HF cache:

macOS / Linux—~/.cache/huggingface/hub/

Windows—%USERPROFILE%\.cache\huggingface\hub\

To use a different GGUF model, edit load_model()

in llm.py

and swap the repo_id

/ filename

. Anything llama.cpp

≥ the bundled version can load (Qwen, Llama, Mistral, Gemma, Phi, etc.) should work, but the prompt templates and stop sequences are tuned for the Qwen3.5 chat format — your mileage on other families will vary.

Before fftext had subcommands it was a small wrapper around llama-cpp-python

for testing. Those modes still work:

python main.py                            # canned demo prompt
python main.py "your prompt here"         # one-shot
python main.py -i                         # interactive chat (Ctrl-C to quit)

Mostly useful for sanity-checking the model load and sampling parameters when you change something.

0.8B is small. It's good enough for the four tasks above, and it's fast enough to actually be useful on a laptop. But it's not GPT-4. Long, complex documents get clipped, and the model occasionally hallucinates on edge-case claims.check

exists precisely because the model can't be trusted as a one-shot oracle — let it propose, let the web dispose.Mojeek and Startpage rotate, with jittered sleeps and a desktop UA, but if both go down or both start serving captchas you'll see empty results andcheck

depends on scraping.INSUFFICIENT

verdicts. Run with--debug

to confirm whether you're being blocked vs. just hitting a thin topic.Translation works best between major languages. A 0.8B model handles English ↔ French, Spanish, German, Italian, Portuguese, and Chinese well; smaller languages and complex register requests degrade more.URL parsing is best-effort.readability-lxml

is strong on articles, weaker on docs pages, listings, and SPAs. The fallback tag-strip catches the rest. If you get garbage out of a particular URL, save the page as text first and pass the file.

Apache-2.0 for this project. The Qwen3.5 model is distributed under its own license — see the model card. Powered by llama.cpp via llama-cpp-python, with URL parsing courtesy of readability-lxml.

source & further reading

github.com — original article

Show HN: Fast CPU summarize, eli5, fact-check or translate any text

Run your AI side-project on zahid.host