# Linkloom - AIWebReader

> Source: <https://dev.to/boris9027/linkloom-aiwebreader-7gl>
> Published: 2026-06-12 16:24:17+00:00

A web scraping and content extraction toolkit for TypeScript/Bun.

Pass a URL, get clean markdown. That's the core. But LinkLoom also handles the cases that break simple scrapers: JavaScript-heavy pages rendered through a stealth browser, PDFs parsed into structured text, iframes pulled from nested frames, HTML tables converted to markdown tables, links extracted and classified. It exposes a library API, a CLI, and an MCP server — so you can use it from code, from the terminal, or from an AI client like Claude Desktop or Cursor.

The full list: URL-to-markdown conversion, HTML-to-markdown via Readability + Turndown, PDF-to-markdown via pdf.js, headless browser rendering through Camoufox (stealth Firefox on Playwright), iframe extraction with configurable wait strategies, link extraction and classification, table scraping, text embeddings via OpenAI or Gemini, a CLI for every feature, and an MCP server for AI tool-use workflows.

Built with Bun, Camoufox, JSDOM, Readability, Turndown, and pdf.js-extract. Optional embedding support through LangChain.

``` js
import { convertLinkToMarkdown } from "linkloom";

const markdown = await convertLinkToMarkdown("https://example.com");
```

That's it. One import, one call. The function auto-detects whether the URL points to an HTML page or a PDF and routes it to the right converter. You get back a string of clean markdown — no boilerplate, no configuration objects, no setup ceremony.

The CLI equivalent:

```
bunx @boris.barac/linkloom scrape https://example.com
```

Same result, different interface. Pipe it, redirect it, pass `-o output.md`

to write to a file.

But plenty of pages don't hand you their content on the first request. They render everything with JavaScript — SPAs, dashboards, dynamically loaded articles. A simple fetch returns an empty shell. LinkLoom handles this through headless browser rendering via Camoufox, a stealth Firefox build on Playwright that avoids bot detection.

``` js
import { renderers } from "linkloom";

const browser = await renderers.puppeterRendered.initialize();
const result = await renderers.puppeterRendered.renderPage(browser, url, {
  timeout: 15000,
  waitUntil: "networkidle",
  viewport: { width: 1920, height: 1080 },
  frames: { enabled: true, timeout: 5000 },
});
await browser.close();
```

The `renderPage`

function loads the URL in a real browser, waits for the network to settle (or for a specific event), and returns the rendered HTML. The `frames`

option tells it to also extract content from nested iframes — with its own timeout, because iframes load on their own schedule and you don't want one slow frame to block everything.

The CLI version:

```
bunx @boris.barac/linkloom render https://example.com --wait-until networkidle --timeout 15000
```

Add `--selector "table.stats"`

to extract only a specific element instead of the full page. Useful when you know exactly what you're after.

Then there are PDFs. Research papers, technical reports, product documentation — a surprising amount of the web's useful content lives in PDFs, not HTML pages. The same `convertLinkToMarkdown`

call handles both, but you can also convert PDFs directly:

``` js
import { pdfConverter } from "linkloom";
import { readFile } from "node:fs/promises";

const buffer = await readFile("document.pdf");
const markdown = await pdfConverter.convertPdfToMarkdown(buffer);
const text = await pdfConverter.convertPdfToText(buffer);
```

Two output modes: `convertPdfToMarkdown`

preserves structure (headings, lists, formatting), while `convertPdfToText`

strips everything down to plain text. Pick whichever fits your pipeline.

The CLI:

```
bunx @boris.barac/linkloom pdf document.pdf -o output.md
```

Under the hood it uses pdf.js-extract to parse the binary, so there's no external dependency on system tools like `pdftotext`

. It works out of the box.

Content conversion is half the job. The other half is pulling structured data out of pages — links, tables, the things that aren't prose.

**Link extraction** finds and classifies URLs from plain text or HTML. Feed it a string and it returns every link, tagged as a PDF or a regular page:

``` js
import { linkExtraction } from "linkloom";

const links = linkExtraction.extractLinks("check https://example.com/doc.pdf");
const pdfLinks = await linkExtraction.extractDownloadLinksFromHtml(htmlContent);
```

`extractLinks`

works on raw text — it finds URLs and classifies them. `extractDownloadLinksFromHtml`

parses an HTML document and pulls out links that point to downloadable files (PDFs, mostly). Useful when you're crawling a page and want to know which links lead to documents worth converting.

**Table extraction** renders a page in the headless browser and pulls out HTML tables as structured data:

``` js
import { tableExtraction, renderers } from "linkloom";

const browser = await renderers.puppeterRendered.initialize();
const data = await tableExtraction.extractTableData(browser, url, "table");
const md = tableExtraction.tableDataToMarkdownTable(data);
await browser.close();
```

The third argument is a CSS selector — pass `"table"`

for all tables, or `"table.stats"`

for a specific one. The output is a markdown table string, ready to drop into a document.

The CLI shortcuts:

```
bunx @boris.barac/linkloom links https://example.com
bunx @boris.barac/linkloom tables https://example.com/data --selector "table.stats"
```

All of this is also available as an MCP server. If you use Claude Desktop, Cursor, or any MCP-compatible client, you can expose LinkLoom's tools without writing code — the AI calls them directly.

Six tools: `scrape`

, `html_to_markdown`

, `pdf_to_markdown`

, `render_page`

, `extract_links`

, `extract_tables`

. Same capabilities as the library and CLI, but surfaced as tool calls an AI agent can use autonomously.

Configuration is a few lines of JSON. For Claude Desktop, edit `~/Library/Application Support/Claude/claude_desktop_config.json`

:

```
{
  "mcpServers": {
    "linkloom": {
      "command": "bun",
      "args": ["x", "@boris.barac/linkloom", "mcp"]
    }
  }
}
```

For Cursor, add the same block to `.cursor/mcp.json`

in your project or `~/.cursor/mcp.json`

globally. Any MCP client — point it at `bun x @boris.barac/linkloom mcp`

and it works.

The server communicates over stdio. It reads JSON-RPC from stdin and writes responses to stdout. You don't run it directly; MCP clients spawn it as a child process. If you want to test it interactively, there's the MCP Inspector:

```
bunx @modelcontextprotocol/inspector bunx @boris.barac/linkloom mcp
```

That opens a web UI where you can browse the available tools, call them with custom parameters, and inspect the JSON-RPC messages going back and forth.

```
bun add @boris.barac/linkloom
```

Or skip the install and use it directly:

```
bunx @boris.barac/linkloom scrape https://example.com
```

No API keys needed for the core scraping pipeline. Only the optional text embedding feature requires an OpenAI or Gemini key.
