Skills vs. MCP vs. prompts: which agent setup works best?

A new benchmark from the Agent Voyager Project found that a simple "double-check your work" instruction added to an AI agent's prompt boosted its accuracy from 82% to 95% on a PDF-to-webpage conversion task. The winning setup, which used a four-step plan plus a self-verification step, achieved perfect pass rates on all 10 test pages at a cost of $0.33 per run. The test compared prompt-only, skills-based, and MCP-based agent configurations using the same Claude Haiku model, isolating the impact of setup design on performance.

View full AVP JSON. , Welcome to the Captain's Log, where we break down the voyages our agents undertook each week. For this inaugural run, we set out to test how different agent setups skills vs MCPs vs prompts compare. The task: read a PDF page and rebuild it as a webpage from ParseBench https://huggingface.co/datasets/llamaindex/ParseBench , a public LlamaIndex benchmark . Every setup uses the same model claude-haiku-4-5 , so any differences come from how the agent is set up, not the model itself. How we score it: each page gets a structural-fidelity score from 0 to 100% column headers, row count, cell content, merged-cell topology, compared against the reference HTML . We report two numbers per run: accuracy , the average of those per-page scores, and pass rate , the share of pages that cleared the “good-enough” threshold. A run can have a high pass rate but a middling accuracy if it gets most pages over the line with rough answers, and vice versa. All of this is made possible by the Agent Voyager Project AVP /set-sail , a free, open, and platform-agnostic standard that records every step an agent takes in a format anyone can read. step-by-step 95% accuracy · $0.33 /run · 2.5 turns · varied prompt + self-check Our winner used a clear four-step plan download the PDF, read it, rebuild as HTML, hand back , plus one extra line at the end of the prompt that no other setup had: double-check your work. This alone boosted average structural fidelity to 95% and got all 10 pages over the pass threshold. Each agent voyage maps out as a constellation in the sky. This one sailed under clear skies and calm seas . ␃WPNCODE0␃ Step-by-step cleared the bar on all 10 of 10 pages and averaged 95% structural fidelity, the highest on the board, for $0.33 a page. The double-check actually paid for itself because the agent didn't need any retries, so this setup ended up cheaper than the runs that skipped it, all from one extra line in the prompt. The prompt was a four-step plan download, read, rebuild as HTML, hand back plus one extra sentence: re-read the original and verify before submitting. Your job: render one PDF page as semantic HTML, optimized for table reconstruction. Workflow: 1. Download the PDF locally with curl silent, fail-on-error : curl -fsSL -o /tmp/page.pdf '<url ' 2. Inspect the PDF with the Read tool. Claude Code's Read tool handles PDF natively, pass the file path directly. 3. Identify the page's primary content table / chart / form / paragraph . For this benchmark, expect category=table. 4. Translate to HTML: - Tables: <table <thead <tr <th …</th </tr </thead <tbody <tr <td …</td </tr </tbody </table Honor merged cells with colspan/rowspan. - Multi-line cell text: separate with <br . - Preserve verbatim text content; do not paraphrase. 5. Self-check: is every column header in your output a column header in the source? Every row in the source a row in your output? Re-read the PDF if unsure. 6. Reply with the final HTML alone. No preamble, no fences, no trailing commentary. baseline 82% accuracy · $0.35 /run · 2.7 turns · varied prompt only For the control run, the agent got a fairly straightforward prompt asking it to download the page, read it, rebuild it as HTML, and hand it back, without any examples or extra tools to work with. A fair-skies, light-breeze voyage: , Baseline cleared the bar on 9 of 10 pages and averaged 82% structural fidelity, for $0.35 a page. The prompt was four steps: download the PDF, read it, rebuild as HTML, hand back only the HTML. You are parsing one page of a PDF document into HTML markdown for a benchmark. Follow these steps exactly: 1. Run: curl -fsSL -o /tmp/page.pdf '<url ' 2. Use your Read tool to open /tmp/page.pdf. 3. Reproduce the page's structure as HTML. For tables, produce a single <table ...</table element with <tr , <th , <td , and colspan/rowspan as needed. 4. Output ONLY the final HTML. with-anthropic-pdf-skill 81% accuracy · $0.31 /run · 2.8 turns · varied + anthropic-pdf skill Same instructions as baseline, plus one of Anthropic's bundled Skills handed over at startup. A Skill is a topic cheat sheet in this case, for handling PDFs that the agent reads before starting. It's Anthropic's recommended way to teach an agent about a subject without standing up a separate server. Another fair-skies, light-breeze run: , The agent with the PDF Skill cleared 8 of 10 pages and averaged 81% fidelity, for $0.31 a page. On accuracy that's effectively a tie with baseline 82% ; on pass rate it's one page worse, which at n=10 is inside the noise. Net: the Skill didn't help, but it didn't hurt either, worth noting given how often Skills get pitched as the fix for a struggling prompt. The prompt was exactly the baseline prompt. The only change is the cheat sheet alongside it. You are parsing one page of a PDF document into HTML markdown for a benchmark. Follow these steps exactly: 1. Run: curl -fsSL -o /tmp/page.pdf '<url ' 2. Use your Read tool to open /tmp/page.pdf. 3. Reproduce the page's structure as HTML. For tables, produce a single <table ...</table element with <tr , <th , <td , and colspan/rowspan as needed. 4. Output ONLY the final HTML. with-pdfplumber-mcp 70% accuracy · $0.20 /run · 2.2 turns · varied + pdf-tools MCP A one-sentence prompt “parse the PDF at <url into HTML and output the HTML only” paired with a specialty external tool for pulling tables out of PDFs a server called pdf-tools , built on the pdfplumber Python library . The idea: skip the step-by-step instructions and let a dedicated tool do the work. Choppy seas, overcast skies on this one: , The agent with the MCP server was the cheapest run on the page at $0.20 per page. 9 of 10 pages squeaked past the threshold same pass rate as baseline , but average fidelity dropped to 70% because the specialty tool stumbles on complex layouts and the one-sentence prompt didn't give the agent enough guidance to catch the misses. Most pages got through, but the answers were noticeably sloppier than baseline. The prompt was a single sentence. No workflow, no examples. The external tool is doing the actual work. Parse the PDF at <url into HTML and output the HTML only. terse-prompt 68% accuracy · $0.59 /run · 3.4 turns · varied prompt: one sentence One sentence, nothing else: “parse the PDF at <url into HTML and output the HTML only.” The agent had to figure out the rest on its own. And figure it out it tried, in heavy seas and stormy skies : , Terse-prompt cleared 7 of 10 pages and averaged 68% fidelity, at $0.59 a page, almost twice baseline. Without real instructions the agent wandered, retried, and second-guessed itself, which is what improvising tends to cost you. The prompt was one sentence. No steps, no structure, no examples. Parse the PDF at <url into HTML and output the HTML only. few-shot 67% accuracy · $0.82 /run · 3.4 turns · varied + worked example Baseline instructions plus a worked example: a 2-row, 3-column “Revenue by region” table with the exact expected answer. The textbook prompt move: show, don't just tell. This one capsized in a heavy storm , the voyage didn't make it back to port: , Few-shot backfired badly. 8 of 10 pages technically cleared the threshold, but average fidelity landed at 67% the worst on the board and each page cost $0.82 the most expensive run by far , because the agent latched onto the example's shape and got thrown by pages that didn't match it. When an example is too specific it stops being a hint and turns into a cage. The prompt was baseline instructions plus one fully worked-out table example. Parse the PDF at <url category=table into HTML. Example input a 2-row, 3-column table titled "Revenue by region" : Region | Q1 | Q2 ----------|------|------ Americas | $1.2M| $1.5M EMEA | $0.9M| $1.1M Example expected output and nothing else : <table <thead <tr <th Region</th <th Q1</th <th Q2</th </tr </thead <tbody <tr <td Americas</td <td $1.2M</td <td $1.5M</td </tr <tr <td EMEA</td <td $0.9M</td <td $1.1M</td </tr </tbody </table Now do the same for the PDF above. Output the HTML only. The setup that won didn't add anything fancy, it just added one sentence asking the agent to look at the original again before submitting, and that was enough to top the board. Because the agent stopped shipping bad answers it also ended up costing less than most of the runs that skipped the double-check. The Skill and the specialty external tool the two upgrades most often pitched as the fix for a struggling agent didn't really help. The Skill tied the no-Skill version, and the external tool made runs cheaper but noticeably worse, so neither was the silver bullet on this task. The two clever prompt tweaks both lost ground: improvising cost almost twice as much for worse answers, and the worked example boxed the agent in by giving it a shape it tried to copy onto pages that didn't match. The takeaway is that the cheap, boring move asking the agent to double-check its own work beat every fancier setup we tried, which is worth keeping in mind before reaching for a new tool. One honest caveat: this is 10 pages per setup, one snapshot, one task. A one- or two-page swing on either axis is inside the noise band, so read these as directional signals, not settled findings. The pattern we'd bet on across more runs is the shape of the board self-check at the top, worked-example at the bottom , not the exact gaps between adjacent rows.