A web scraping and content extraction toolkit for TypeScript/Bun.
Pass a URL, get clean markdown. That's the core. But LinkLoom also handles the cases that break simple scrapers: JavaScript-heavy pages rendered through a stealth browser, PDFs parsed into structured text, iframes pulled from nested frames, HTML tables converted to markdown tables, links extracted and classified. It exposes a library API, a CLI, and an MCP server β so you can use it from code, from the terminal, or from an AI client like Claude Desktop or Cursor.
The full list: URL-to-markdown conversion, HTML-to-markdown via Readability + Turndown, PDF-to-markdown via pdf.js, headless browser rendering through Camoufox (stealth Firefox on Playwright), iframe extraction with configurable wait strategies, link extraction and classification, table scraping, text embeddings via OpenAI or Gemini, a CLI for every feature, and an MCP server for AI tool-use workflows.
Built with Bun, Camoufox, JSDOM, Readability, Turndown, and pdf.js-extract. Optional embedding support through LangChain.
import { convertLinkToMarkdown } from "linkloom";
const markdown = await convertLinkToMarkdown("https://example.com");
That's it. One import, one call. The function auto-detects whether the URL points to an HTML page or a PDF and routes it to the right converter. You get back a string of clean markdown β no boilerplate, no configuration objects, no setup ceremony.
The CLI equivalent:
bunx @boris.barac/linkloom scrape https://example.com
Same result, different interface. Pipe it, redirect it, pass -o output.md
to write to a file.
But plenty of pages don't hand you their content on the first request. They render everything with JavaScript β SPAs, dashboards, dynamically loaded articles. A simple fetch returns an empty shell. LinkLoom handles this through headless browser rendering via Camoufox, a stealth Firefox build on Playwright that avoids bot detection.
import { renderers } from "linkloom";
const browser = await renderers.puppeterRendered.initialize();
const result = await renderers.puppeterRendered.renderPage(browser, url, {
timeout: 15000,
waitUntil: "networkidle",
viewport: { width: 1920, height: 1080 },
frames: { enabled: true, timeout: 5000 },
});
await browser.close();
The renderPage
function loads the URL in a real browser, waits for the network to settle (or for a specific event), and returns the rendered HTML. The frames
option tells it to also extract content from nested iframes β with its own timeout, because iframes load on their own schedule and you don't want one slow frame to block everything.
The CLI version:
bunx @boris.barac/linkloom render https://example.com --wait-until networkidle --timeout 15000
Add --selector "table.stats"
to extract only a specific element instead of the full page. Useful when you know exactly what you're after.
Then there are PDFs. Research papers, technical reports, product documentation β a surprising amount of the web's useful content lives in PDFs, not HTML pages. The same convertLinkToMarkdown
call handles both, but you can also convert PDFs directly:
import { pdfConverter } from "linkloom";
import { readFile } from "node:fs/promises";
const buffer = await readFile("document.pdf");
const markdown = await pdfConverter.convertPdfToMarkdown(buffer);
const text = await pdfConverter.convertPdfToText(buffer);
Two output modes: convertPdfToMarkdown
preserves structure (headings, lists, formatting), while convertPdfToText
strips everything down to plain text. Pick whichever fits your pipeline.
The CLI:
bunx @boris.barac/linkloom pdf document.pdf -o output.md
Under the hood it uses pdf.js-extract to parse the binary, so there's no external dependency on system tools like pdftotext
. It works out of the box.
Content conversion is half the job. The other half is pulling structured data out of pages β links, tables, the things that aren't prose.
Link extraction finds and classifies URLs from plain text or HTML. Feed it a string and it returns every link, tagged as a PDF or a regular page:
import { linkExtraction } from "linkloom";
const links = linkExtraction.extractLinks("check https://example.com/doc.pdf");
const pdfLinks = await linkExtraction.extractDownloadLinksFromHtml(htmlContent);
extractLinks
works on raw text β it finds URLs and classifies them. extractDownloadLinksFromHtml
parses an HTML document and pulls out links that point to downloadable files (PDFs, mostly). Useful when you're crawling a page and want to know which links lead to documents worth converting.
Table extraction renders a page in the headless browser and pulls out HTML tables as structured data:
import { tableExtraction, renderers } from "linkloom";
const browser = await renderers.puppeterRendered.initialize();
const data = await tableExtraction.extractTableData(browser, url, "table");
const md = tableExtraction.tableDataToMarkdownTable(data);
await browser.close();
The third argument is a CSS selector β pass "table"
for all tables, or "table.stats"
for a specific one. The output is a markdown table string, ready to drop into a document.
The CLI shortcuts:
bunx @boris.barac/linkloom links https://example.com
bunx @boris.barac/linkloom tables https://example.com/data --selector "table.stats"
All of this is also available as an MCP server. If you use Claude Desktop, Cursor, or any MCP-compatible client, you can expose LinkLoom's tools without writing code β the AI calls them directly.
Six tools: scrape
, html_to_markdown
, pdf_to_markdown
, render_page
, extract_links
, extract_tables
. Same capabilities as the library and CLI, but surfaced as tool calls an AI agent can use autonomously.
Configuration is a few lines of JSON. For Claude Desktop, edit ~/Library/Application Support/Claude/claude_desktop_config.json
:
{
"mcpServers": {
"linkloom": {
"command": "bun",
"args": ["x", "@boris.barac/linkloom", "mcp"]
}
}
}
For Cursor, add the same block to .cursor/mcp.json
in your project or ~/.cursor/mcp.json
globally. Any MCP client β point it at bun x @boris.barac/linkloom mcp
and it works.
The server communicates over stdio. It reads JSON-RPC from stdin and writes responses to stdout. You don't run it directly; MCP clients spawn it as a child process. If you want to test it interactively, there's the MCP Inspector:
bunx @modelcontextprotocol/inspector bunx @boris.barac/linkloom mcp
That opens a web UI where you can browse the available tools, call them with custom parameters, and inspect the JSON-RPC messages going back and forth.
bun add @boris.barac/linkloom
Or skip the install and use it directly:
bunx @boris.barac/linkloom scrape https://example.com
No API keys needed for the core scraping pipeline. Only the optional text embedding feature requires an OpenAI or Gemini key.