Linkloom - AIWebReader

wpnews.pro

A web scraping and content extraction toolkit for TypeScript/Bun.

Pass a URL, get clean markdown. That's the core. But LinkLoom also handles the cases that break simple scrapers: JavaScript-heavy pages rendered through a stealth browser, PDFs parsed into structured text, iframes pulled from nested frames, HTML tables converted to markdown tables, links extracted and classified. It exposes a library API, a CLI, and an MCP server — so you can use it from code, from the terminal, or from an AI client like Claude Desktop or Cursor.

The full list: URL-to-markdown conversion, HTML-to-markdown via Readability + Turndown, PDF-to-markdown via pdf.js, headless browser rendering through Camoufox (stealth Firefox on Playwright), iframe extraction with configurable wait strategies, link extraction and classification, table scraping, text embeddings via OpenAI or Gemini, a CLI for every feature, and an MCP server for AI tool-use workflows.

Built with Bun, Camoufox, JSDOM, Readability, Turndown, and pdf.js-extract. Optional embedding support through LangChain.

import { convertLinkToMarkdown } from "linkloom";

const markdown = await convertLinkToMarkdown("https://example.com");

That's it. One import, one call. The function auto-detects whether the URL points to an HTML page or a PDF and routes it to the right converter. You get back a string of clean markdown — no boilerplate, no configuration objects, no setup ceremony.

The CLI equivalent:

bunx @boris.barac/linkloom scrape https://example.com

Same result, different interface. Pipe it, redirect it, pass -o output.md

to write to a file.

But plenty of pages don't hand you their content on the first request. They render everything with JavaScript — SPAs, dashboards, dynamically loaded articles. A simple fetch returns an empty shell. LinkLoom handles this through headless browser rendering via Camoufox, a stealth Firefox build on Playwright that avoids bot detection.

import { renderers } from "linkloom";

const browser = await renderers.puppeterRendered.initialize();
const result = await renderers.puppeterRendered.renderPage(browser, url, {
  timeout: 15000,
  waitUntil: "networkidle",
  viewport: { width: 1920, height: 1080 },
  frames: { enabled: true, timeout: 5000 },
});
await browser.close();

The renderPage

function loads the URL in a real browser, waits for the network to settle (or for a specific event), and returns the rendered HTML. The frames

option tells it to also extract content from nested iframes — with its own timeout, because iframes load on their own schedule and you don't want one slow frame to block everything.

The CLI version:

bunx @boris.barac/linkloom render https://example.com --wait-until networkidle --timeout 15000

Add --selector "table.stats"

to extract only a specific element instead of the full page. Useful when you know exactly what you're after.

Then there are PDFs. Research papers, technical reports, product documentation — a surprising amount of the web's useful content lives in PDFs, not HTML pages. The same convertLinkToMarkdown

call handles both, but you can also convert PDFs directly:

import { pdfConverter } from "linkloom";
import { readFile } from "node:fs/promises";

const buffer = await readFile("document.pdf");
const markdown = await pdfConverter.convertPdfToMarkdown(buffer);
const text = await pdfConverter.convertPdfToText(buffer);

Two output modes: convertPdfToMarkdown

preserves structure (headings, lists, formatting), while convertPdfToText

strips everything down to plain text. Pick whichever fits your pipeline.

The CLI:

bunx @boris.barac/linkloom pdf document.pdf -o output.md

Under the hood it uses pdf.js-extract to parse the binary, so there's no external dependency on system tools like pdftotext

. It works out of the box.

Content conversion is half the job. The other half is pulling structured data out of pages — links, tables, the things that aren't prose.

Link extraction finds and classifies URLs from plain text or HTML. Feed it a string and it returns every link, tagged as a PDF or a regular page:

import { linkExtraction } from "linkloom";

const links = linkExtraction.extractLinks("check https://example.com/doc.pdf");
const pdfLinks = await linkExtraction.extractDownloadLinksFromHtml(htmlContent);

extractLinks

works on raw text — it finds URLs and classifies them. extractDownloadLinksFromHtml

parses an HTML document and pulls out links that point to downloadable files (PDFs, mostly). Useful when you're crawling a page and want to know which links lead to documents worth converting.

Table extraction renders a page in the headless browser and pulls out HTML tables as structured data:

import { tableExtraction, renderers } from "linkloom";

const browser = await renderers.puppeterRendered.initialize();
const data = await tableExtraction.extractTableData(browser, url, "table");
const md = tableExtraction.tableDataToMarkdownTable(data);
await browser.close();

The third argument is a CSS selector — pass "table"

for all tables, or "table.stats"

for a specific one. The output is a markdown table string, ready to drop into a document.

The CLI shortcuts:

bunx @boris.barac/linkloom links https://example.com
bunx @boris.barac/linkloom tables https://example.com/data --selector "table.stats"

All of this is also available as an MCP server. If you use Claude Desktop, Cursor, or any MCP-compatible client, you can expose LinkLoom's tools without writing code — the AI calls them directly.

Six tools: scrape

, html_to_markdown

, pdf_to_markdown

, render_page

, extract_links

, extract_tables

. Same capabilities as the library and CLI, but surfaced as tool calls an AI agent can use autonomously.

Configuration is a few lines of JSON. For Claude Desktop, edit ~/Library/Application Support/Claude/claude_desktop_config.json

:

{
  "mcpServers": {
    "linkloom": {
      "command": "bun",
      "args": ["x", "@boris.barac/linkloom", "mcp"]
    }
  }
}

For Cursor, add the same block to .cursor/mcp.json

in your project or ~/.cursor/mcp.json

globally. Any MCP client — point it at bun x @boris.barac/linkloom mcp

and it works.

The server communicates over stdio. It reads JSON-RPC from stdin and writes responses to stdout. You don't run it directly; MCP clients spawn it as a child process. If you want to test it interactively, there's the MCP Inspector:

bunx @modelcontextprotocol/inspector bunx @boris.barac/linkloom mcp

That opens a web UI where you can browse the available tools, call them with custom parameters, and inspect the JSON-RPC messages going back and forth.

bun add @boris.barac/linkloom

Or skip the install and use it directly:

bunx @boris.barac/linkloom scrape https://example.com

No API keys needed for the core scraping pipeline. Only the optional text embedding feature requires an OpenAI or Gemini key.

source & further reading

dev.to — original article Translate Git Commit Messages Offline Without Rewriting Code OpenAI Presence: Voice Agents You Can't Self-Serve OpenAI Launches ChatGPT Work for Enterprise Teams With Agentic Controls

Linkloom - AIWebReader

Run your AI side-project on zahid.host