Linkloom - AIWebReader

LinkLoom, a new TypeScript/Bun toolkit, provides a unified API, CLI, and MCP server for converting web pages, PDFs, and iframes into clean markdown. The tool handles JavaScript-heavy pages through a stealth headless browser, extracts content from nested iframes, and converts PDFs to structured text or plain text without external dependencies.

A web scraping and content extraction toolkit for TypeScript/Bun. Pass a URL, get clean markdown. That's the core. But LinkLoom also handles the cases that break simple scrapers: JavaScript-heavy pages rendered through a stealth browser, PDFs parsed into structured text, iframes pulled from nested frames, HTML tables converted to markdown tables, links extracted and classified. It exposes a library API, a CLI, and an MCP server — so you can use it from code, from the terminal, or from an AI client like Claude Desktop or Cursor. The full list: URL-to-markdown conversion, HTML-to-markdown via Readability + Turndown, PDF-to-markdown via pdf.js, headless browser rendering through Camoufox stealth Firefox on Playwright , iframe extraction with configurable wait strategies, link extraction and classification, table scraping, text embeddings via OpenAI or Gemini, a CLI for every feature, and an MCP server for AI tool-use workflows. Built with Bun, Camoufox, JSDOM, Readability, Turndown, and pdf.js-extract. Optional embedding support through LangChain. js import { convertLinkToMarkdown } from "linkloom"; const markdown = await convertLinkToMarkdown "https://example.com" ; That's it. One import, one call. The function auto-detects whether the URL points to an HTML page or a PDF and routes it to the right converter. You get back a string of clean markdown — no boilerplate, no configuration objects, no setup ceremony. The CLI equivalent: bunx @boris.barac/linkloom scrape https://example.com Same result, different interface. Pipe it, redirect it, pass -o output.md to write to a file. But plenty of pages don't hand you their content on the first request. They render everything with JavaScript — SPAs, dashboards, dynamically loaded articles. A simple fetch returns an empty shell. LinkLoom handles this through headless browser rendering via Camoufox, a stealth Firefox build on Playwright that avoids bot detection. js import { renderers } from "linkloom"; const browser = await renderers.puppeterRendered.initialize ; const result = await renderers.puppeterRendered.renderPage browser, url, { timeout: 15000, waitUntil: "networkidle", viewport: { width: 1920, height: 1080 }, frames: { enabled: true, timeout: 5000 }, } ; await browser.close ; The renderPage function loads the URL in a real browser, waits for the network to settle or for a specific event , and returns the rendered HTML. The frames option tells it to also extract content from nested iframes — with its own timeout, because iframes load on their own schedule and you don't want one slow frame to block everything. The CLI version: bunx @boris.barac/linkloom render https://example.com --wait-until networkidle --timeout 15000 Add --selector "table.stats" to extract only a specific element instead of the full page. Useful when you know exactly what you're after. Then there are PDFs. Research papers, technical reports, product documentation — a surprising amount of the web's useful content lives in PDFs, not HTML pages. The same convertLinkToMarkdown call handles both, but you can also convert PDFs directly: js import { pdfConverter } from "linkloom"; import { readFile } from "node:fs/promises"; const buffer = await readFile "document.pdf" ; const markdown = await pdfConverter.convertPdfToMarkdown buffer ; const text = await pdfConverter.convertPdfToText buffer ; Two output modes: convertPdfToMarkdown preserves structure headings, lists, formatting , while convertPdfToText strips everything down to plain text. Pick whichever fits your pipeline. The CLI: bunx @boris.barac/linkloom pdf document.pdf -o output.md Under the hood it uses pdf.js-extract to parse the binary, so there's no external dependency on system tools like pdftotext . It works out of the box. Content conversion is half the job. The other half is pulling structured data out of pages — links, tables, the things that aren't prose. Link extraction finds and classifies URLs from plain text or HTML. Feed it a string and it returns every link, tagged as a PDF or a regular page: js import { linkExtraction } from "linkloom"; const links = linkExtraction.extractLinks "check https://example.com/doc.pdf" ; const pdfLinks = await linkExtraction.extractDownloadLinksFromHtml htmlContent ; extractLinks works on raw text — it finds URLs and classifies them. extractDownloadLinksFromHtml parses an HTML document and pulls out links that point to downloadable files PDFs, mostly . Useful when you're crawling a page and want to know which links lead to documents worth converting. Table extraction renders a page in the headless browser and pulls out HTML tables as structured data: js import { tableExtraction, renderers } from "linkloom"; const browser = await renderers.puppeterRendered.initialize ; const data = await tableExtraction.extractTableData browser, url, "table" ; const md = tableExtraction.tableDataToMarkdownTable data ; await browser.close ; The third argument is a CSS selector — pass "table" for all tables, or "table.stats" for a specific one. The output is a markdown table string, ready to drop into a document. The CLI shortcuts: bunx @boris.barac/linkloom links https://example.com bunx @boris.barac/linkloom tables https://example.com/data --selector "table.stats" All of this is also available as an MCP server. If you use Claude Desktop, Cursor, or any MCP-compatible client, you can expose LinkLoom's tools without writing code — the AI calls them directly. Six tools: scrape , html to markdown , pdf to markdown , render page , extract links , extract tables . Same capabilities as the library and CLI, but surfaced as tool calls an AI agent can use autonomously. Configuration is a few lines of JSON. For Claude Desktop, edit ~/Library/Application Support/Claude/claude desktop config.json : { "mcpServers": { "linkloom": { "command": "bun", "args": "x", "@boris.barac/linkloom", "mcp" } } } For Cursor, add the same block to .cursor/mcp.json in your project or ~/.cursor/mcp.json globally. Any MCP client — point it at bun x @boris.barac/linkloom mcp and it works. The server communicates over stdio. It reads JSON-RPC from stdin and writes responses to stdout. You don't run it directly; MCP clients spawn it as a child process. If you want to test it interactively, there's the MCP Inspector: bunx @modelcontextprotocol/inspector bunx @boris.barac/linkloom mcp That opens a web UI where you can browse the available tools, call them with custom parameters, and inspect the JSON-RPC messages going back and forth. bun add @boris.barac/linkloom Or skip the install and use it directly: bunx @boris.barac/linkloom scrape https://example.com No API keys needed for the core scraping pipeline. Only the optional text embedding feature requires an OpenAI or Gemini key.