{"slug": "skills-vs-mcp-vs-prompts-which-agent-setup-works-best", "title": "Skills vs. MCP vs. prompts: which agent setup works best?", "summary": "A new benchmark from the Agent Voyager Project found that a simple \"double-check your work\" instruction added to an AI agent's prompt boosted its accuracy from 82% to 95% on a PDF-to-webpage conversion task. The winning setup, which used a four-step plan plus a self-verification step, achieved perfect pass rates on all 10 test pages at a cost of $0.33 per run. The test compared prompt-only, skills-based, and MCP-based agent configurations using the same Claude Haiku model, isolating the impact of setup design on performance.", "body_md": "## View full AVP JSON.\n\n```\n,\n```\n\nWelcome to the Captain's Log, where we break down the voyages our agents undertook each week. For this inaugural run, we set out to test how different agent setups (skills vs MCPs vs prompts) compare.\n\n**The task:** read a PDF page and rebuild it as a webpage (from [ParseBench](https://huggingface.co/datasets/llamaindex/ParseBench), a public LlamaIndex benchmark). Every setup uses the same model (`claude-haiku-4-5`\n\n), so any differences come from how the agent is set up, not the model itself.\n\n**How we score it:** each page gets a structural-fidelity score from 0 to 100% (column headers, row count, cell content, merged-cell topology, compared against the reference HTML). We report two numbers per run: *accuracy*, the average of those per-page scores, and *pass rate*, the share of pages that cleared the “good-enough” threshold. A run can have a high pass rate but a middling accuracy if it gets most pages over the line with rough answers, and vice versa.\n\nAll of this is made possible by the [ Agent Voyager Project (AVP)](/set-sail), a free, open, and platform-agnostic standard that records every step an agent takes in a format anyone can read.\n\n`step-by-step`\n\n**95%** accuracy · **$0.33**/run · **2.5** turns · varied *prompt + self-check*\n\nOur winner used a clear four-step plan (download the PDF, read it, rebuild as HTML, hand back), plus one extra line at the end of the prompt that no other setup had: double-check your work. This alone boosted average structural fidelity to 95% and got all 10 pages over the pass threshold.\n\nEach agent voyage maps out as a constellation in the sky. This one sailed under **clear skies and calm seas**.\n\n␃WPNCODE0␃\n\nStep-by-step cleared the bar on all **10 of 10** pages and averaged **95%** structural fidelity, the highest on the board, for **$0.33** a page. The double-check actually paid for itself because the agent didn't need any retries, so this setup ended up cheaper than the runs that skipped it, all from one extra line in the prompt.\n\n**The prompt** was a four-step plan (download, read, rebuild as HTML, hand back) plus one extra sentence: re-read the original and verify before submitting.\n\n```\nYour job: render one PDF page as semantic HTML, optimized for table reconstruction.\n\nWorkflow:\n  1. Download the PDF locally with curl (silent, fail-on-error):\n     curl -fsSL -o /tmp/page.pdf '<url>'\n  2. Inspect the PDF with the Read tool. Claude Code's Read tool handles PDF natively, pass the file path directly.\n  3. Identify the page's primary content (table / chart / form / paragraph). For this benchmark, expect category=table.\n  4. Translate to HTML:\n       - Tables: <table><thead><tr><th>…</th></tr></thead>\n                 <tbody><tr><td>…</td></tr></tbody></table>\n                 Honor merged cells with colspan/rowspan.\n       - Multi-line cell text: separate with <br>.\n       - Preserve verbatim text content; do not paraphrase.\n  5. Self-check: is every column header in your output a column header in the source? Every row in the source a row in your output? Re-read the PDF if unsure.\n  6. Reply with the final HTML alone. No preamble, no fences, no trailing commentary.\n```\n\n`baseline`\n\n**82%** accuracy · **$0.35**/run · **2.7** turns · varied *prompt only*\n\nFor the control run, the agent got a fairly straightforward prompt asking it to download the page, read it, rebuild it as HTML, and hand it back, without any examples or extra tools to work with.\n\nA **fair-skies, light-breeze** voyage:\n\n```\n,\n```\n\nBaseline cleared the bar on **9 of 10** pages and averaged **82%** structural fidelity, for **$0.35** a page.\n\n**The prompt** was four steps: download the PDF, read it, rebuild as HTML, hand back only the HTML.\n\n```\nYou are parsing one page of a PDF document into HTML markdown for a benchmark. Follow these steps exactly:\n\n1. Run: curl -fsSL -o /tmp/page.pdf '<url>'\n2. Use your Read tool to open /tmp/page.pdf.\n3. Reproduce the page's structure as HTML. For tables, produce a single <table>...</table> element with <tr>, <th>, <td>, and colspan/rowspan as needed.\n4. Output ONLY the final HTML.\n```\n\n`with-anthropic-pdf-skill`\n\n**81%** accuracy · **$0.31**/run · **2.8** turns · varied *+ anthropic-pdf skill*\n\nSame instructions as baseline, plus one of Anthropic's bundled *Skills* handed over at startup. A Skill is a topic cheat sheet (in this case, for handling PDFs) that the agent reads before starting. It's Anthropic's recommended way to teach an agent about a subject without standing up a separate server.\n\nAnother **fair-skies, light-breeze** run:\n\n```\n,\n```\n\nThe agent with the PDF Skill cleared **8 of 10** pages and averaged **81%** fidelity, for **$0.31** a page. On accuracy that's effectively a tie with baseline (82%); on pass rate it's one page worse, which at n=10 is inside the noise. Net: the Skill didn't help, but it didn't hurt either, worth noting given how often Skills get pitched as the fix for a struggling prompt.\n\n**The prompt** was exactly the baseline prompt. The only change is the cheat sheet alongside it.\n\n```\nYou are parsing one page of a PDF document into HTML markdown for a benchmark. Follow these steps exactly:\n\n1. Run: curl -fsSL -o /tmp/page.pdf '<url>'\n2. Use your Read tool to open /tmp/page.pdf.\n3. Reproduce the page's structure as HTML. For tables, produce a single <table>...</table> element with <tr>, <th>, <td>, and colspan/rowspan as needed.\n4. Output ONLY the final HTML.\n```\n\n`with-pdfplumber-mcp`\n\n**70%** accuracy · **$0.20**/run · **2.2** turns · varied *+ pdf-tools MCP*\n\nA one-sentence prompt ( *“parse the PDF at <url> into HTML and output the HTML only”*) paired with a specialty external tool for pulling tables out of PDFs (a server called `pdf-tools`\n\n, built on the `pdfplumber`\n\nPython library). The idea: skip the step-by-step instructions and let a dedicated tool do the work.\n\n**Choppy seas, overcast skies** on this one:\n\n```\n,\n```\n\nThe agent with the MCP server was the cheapest run on the page at **$0.20** per page. **9 of 10** pages squeaked past the threshold (same pass rate as baseline), but average fidelity dropped to **70%** because the specialty tool stumbles on complex layouts and the one-sentence prompt didn't give the agent enough guidance to catch the misses. Most pages got through, but the answers were noticeably sloppier than baseline.\n\n**The prompt** was a single sentence. No workflow, no examples. The external tool is doing the actual work.\n\n```\nParse the PDF at <url> into HTML and output the HTML only.\n```\n\n`terse-prompt`\n\n**68%** accuracy · **$0.59**/run · **3.4** turns · varied *prompt: one sentence*\n\nOne sentence, nothing else: *“parse the PDF at <url> into HTML and output the HTML only.”* The agent had to figure out the rest on its own.\n\nAnd figure it out it tried, in **heavy seas and stormy skies**:\n\n```\n,\n```\n\nTerse-prompt cleared **7 of 10** pages and averaged **68%** fidelity, at **$0.59** a page, almost twice baseline. Without real instructions the agent wandered, retried, and second-guessed itself, which is what improvising tends to cost you.\n\n**The prompt** was one sentence. No steps, no structure, no examples.\n\n```\nParse the PDF at <url> into HTML and output the HTML only.\n```\n\n`few-shot`\n\n**67%** accuracy · **$0.82**/run · **3.4** turns · varied *+ worked example*\n\nBaseline instructions plus a worked example: a 2-row, 3-column “Revenue by region” table with the exact expected answer. The textbook prompt move: show, don't just tell.\n\nThis one **capsized in a heavy storm**, the voyage didn't make it back to port:\n\n```\n,\n```\n\nFew-shot backfired badly. **8 of 10** pages technically cleared the threshold, but average fidelity landed at **67%** (the worst on the board) and each page cost **$0.82** (the most expensive run by far), because the agent latched onto the example's shape and got thrown by pages that didn't match it. When an example is too specific it stops being a hint and turns into a cage.\n\n**The prompt** was baseline instructions plus one fully worked-out table example.\n\n```\nParse the PDF at <url> (category=table) into HTML.\n\nExample input (a 2-row, 3-column table titled \"Revenue by region\"):\n\n    Region    | Q1   | Q2\n    ----------|------|------\n    Americas  | $1.2M| $1.5M\n    EMEA      | $0.9M| $1.1M\n\nExample expected output (and nothing else):\n\n    <table>\n      <thead><tr><th>Region</th><th>Q1</th><th>Q2</th></tr></thead>\n      <tbody>\n        <tr><td>Americas</td><td>$1.2M</td><td>$1.5M</td></tr>\n        <tr><td>EMEA</td><td>$0.9M</td><td>$1.1M</td></tr>\n      </tbody>\n    </table>\n\nNow do the same for the PDF above. Output the HTML only.\n```\n\nThe setup that won didn't add anything fancy, it just added one sentence asking the agent to look at the original again before submitting, and that was enough to top the board. Because the agent stopped shipping bad answers it also ended up costing less than most of the runs that skipped the double-check.\n\nThe Skill and the specialty external tool (the two upgrades most often pitched as the fix for a struggling agent) didn't really help. The Skill tied the no-Skill version, and the external tool made runs cheaper but noticeably worse, so neither was the silver bullet on this task.\n\nThe two clever prompt tweaks both lost ground: improvising cost almost twice as much for worse answers, and the worked example boxed the agent in by giving it a shape it tried to copy onto pages that didn't match.\n\nThe takeaway is that the cheap, boring move (asking the agent to double-check its own work) beat every fancier setup we tried, which is worth keeping in mind before reaching for a new tool.\n\nOne honest caveat: this is 10 pages per setup, one snapshot, one task. A one- or two-page swing on either axis is inside the noise band, so read these as directional signals, not settled findings. The pattern we'd bet on across more runs is the shape of the board (self-check at the top, worked-example at the bottom), not the exact gaps between adjacent rows.", "url": "https://wpnews.pro/news/skills-vs-mcp-vs-prompts-which-agent-setup-works-best", "canonical_source": "https://www.agentvoyagerproject.com/captains-log/1", "published_at": "2026-05-26 14:29:08+00:00", "updated_at": "2026-05-26 14:38:43.337484+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-tools", "ai-research", "ai-products"], "entities": ["Claude Haiku", "LlamaIndex", "ParseBench", "Agent Voyager Project", "AVP"], "alternates": {"html": "https://wpnews.pro/news/skills-vs-mcp-vs-prompts-which-agent-setup-works-best", "markdown": "https://wpnews.pro/news/skills-vs-mcp-vs-prompts-which-agent-setup-works-best.md", "text": "https://wpnews.pro/news/skills-vs-mcp-vs-prompts-which-agent-setup-works-best.txt", "jsonld": "https://wpnews.pro/news/skills-vs-mcp-vs-prompts-which-agent-setup-works-best.jsonld"}}