Most AI résumé tools have the same flaw: they hallucinate. Ask them to tailor your résumé for a job requiring "Rust experience" and they'll happily invent a Rust project you never worked on. It reads great — until the technical interview.
I wanted the opposite. So I built Citevault: a local-first résumé tailoring tool where every claim is either grounded in your own evidence, or refused and flagged as a gap.
No fabrication. No API keys. Runs entirely on your laptop. (Model weights are pulled from Hugging Face once on first boot; after that, no outbound connections.)
Every bullet in your résumé starts as a claim. Citevault processes each one through a pipeline:
SUPPORTS
, PARTIAL
, UNCLEAR
, or CONTRADICTS
SUPPORTS
→ the claim is verified and cited; PARTIAL
→ rewritten to match only what the evidence actually says; UNCLEAR
→ a rewrite is attempted, and if it still can't be grounded, refused and gap-reported; CONTRADICTS
→ refused immediately and gap-reportedThe result is a résumé where every bullet has a [^sp-...]
footnote traceable back to a specific span in your source material.
Toggle "Compare with naive AI" before starting a tailoring run. Citevault runs its grounded pipeline and a second single-pass run — same model, same evidence, same task description, no verification loop. The only difference is the grounded pipeline checks every claim against its source before including it.
The diff is striking:
[Candidate Name]
and invented achievements that never appeared in the evidence| Component | Role |
|---|---|
Gemma 4 E4B (gemma4:e4b ) via Ollama |
Claim drafting, verification, cover letter composition |
BGE-small-en-v1.5 |
Dense embeddings for semantic retrieval |
BGE cross-encoder |
Re-ranking retrieved candidates |
BM25 + SQLite FTS5 |
Keyword retrieval (hybrid RAG) |
sqlite-vec |
Vector store — no external database required |
Gemma 4 E4B was chosen specifically for this role: it is instruction-tuned well enough to return consistent structured JSON verdicts, small enough to run on CPU without a GPU, and open-weight so no API key or data exposure is involved. The e4b
tag is the Q4_K_M quantised build — the best size/quality tradeoff for local inference via Ollama.
The entire stack runs on CPU. Measured on a 4-core/8-thread laptop with 32 GB RAM and no discrete GPU: 3–8 tokens/second generation speed, 20–30 minutes per tailoring run; add another 10–20 minutes if naive comparison is enabled. Slower than a cloud API, but zero cost, zero data exposure, and no dependency on an upstream service staying alive.
Structured generation is the hard part. Getting Gemma 4 to consistently return structured JSON verdicts from the verifier took more prompt iteration than anything else. The final verifier prompt is tightly constrained: it gives the model a specific rubric, a strict output format, and a worked example. It still occasionally returns malformed output — those claims are logged and omitted from the output rather than silently passed through.
Hybrid RAG matters. Pure dense search misses exact keyword matches. Pure BM25 misses semantic similarity. On the five-case golden eval set, the hybrid combination recovered ~15 percentage points in first-pass grounding rate over either retrieval strategy alone — enough to tip borderline claims from UNCLEAR to SUPPORTS.
Eval-driven development pays off. I built a golden evaluation set of five synthetic candidates and ran the pipeline against it after every significant change. The final first-pass grounding rate is 98.2% — but more importantly, I caught two regressions that looked fine in manual testing.
Local-first is a real constraint, not a marketing line. Your career data is sensitive. Résumés contain salary history, reasons for leaving, private project details. I didn't want to be a data controller. Building local-first forced specific architectural decisions — no cloud storage, no async job queue, no third-party embedding API.
docker compose up -d ollama
docker compose exec ollama ollama pull gemma4:e4b
docker compose up -d
Upload your evidence, paste a job posting, and watch the grounding happen in real time via SSE stream.
Heads up — this runs on CPU.On a 4-core laptop without a GPU, expect 20–30 minutes per tailoring run. With naive comparison enabled, add another 10–20 minutes for the second pass. It is slow by cloud-API standards, but fully offline and costs nothing after the first model pull.
The best test: pick a role where you have a genuine skill gap — that is where the gap report is most useful.
The full architecture (hexagonal layout, RAG pipeline, Docker Compose stack) is documented in docs/architecture.md in the repo.
The code is on GitHub: ** github.com/jaberoma/citevault** — MIT licensed, no account required, runs on any laptop with Docker.
Citevault's contract is simple: every claim in your résumé either links to a source span in your own evidence, or it does not appear. No exceptions.