Skills vs. MCP vs. prompts: which agent setup works best? A new benchmark from the Agent Voyager Project found that a simple "double-check your work" instruction added to an AI agent's prompt boosted its accuracy from 82% to 95% on a PDF-to-webpage conversion task. The winning setup, which used a four-step plan plus a self-verification step, achieved perfect pass rates on all 10 test pages at a cost of $0.33 per run. The test compared prompt-only, skills-based, and MCP-based agent configurations using the same Claude Haiku model, isolating the impact of setup design on performance. View full AVP JSON. , Welcome to the Captain's Log, where we break down the voyages our agents undertook each week. For this inaugural run, we set out to test how different agent setups skills vs MCPs vs prompts compare. The task: read a PDF page and rebuild it as a webpage from ParseBench https://huggingface.co/datasets/llamaindex/ParseBench , a public LlamaIndex benchmark . Every setup uses the same model claude-haiku-4-5 , so any differences come from how the agent is set up, not the model itself. How we score it: each page gets a structural-fidelity score from 0 to 100% column headers, row count, cell content, merged-cell topology, compared against the reference HTML . We report two numbers per run: accuracy , the average of those per-page scores, and pass rate , the share of pages that cleared the “good-enough” threshold. A run can have a high pass rate but a middling accuracy if it gets most pages over the line with rough answers, and vice versa. All of this is made possible by the Agent Voyager Project AVP /set-sail , a free, open, and platform-agnostic standard that records every step an agent takes in a format anyone can read. step-by-step 95% accuracy · $0.33 /run · 2.5 turns · varied prompt + self-check Our winner used a clear four-step plan download the PDF, read it, rebuild as HTML, hand back , plus one extra line at the end of the prompt that no other setup had: double-check your work. This alone boosted average structural fidelity to 95% and got all 10 pages over the pass threshold. Each agent voyage maps out as a constellation in the sky. This one sailed under clear skies and calm seas . ␃WPNCODE0␃ Step-by-step cleared the bar on all 10 of 10 pages and averaged 95% structural fidelity, the highest on the board, for $0.33 a page. The double-check actually paid for itself because the agent didn't need any retries, so this setup ended up cheaper than the runs that skipped it, all from one extra line in the prompt. The prompt was a four-step plan download, read, rebuild as HTML, hand back plus one extra sentence: re-read the original and verify before submitting. Your job: render one PDF page as semantic HTML, optimized for table reconstruction. Workflow: 1. Download the PDF locally with curl silent, fail-on-error : curl -fsSL -o /tmp/page.pdf '