When scraping orchestration is the wrong abstraction for LLM workflows

wpnews.pro

A lot of LLM workflows start with the same small problem: the model needs fresh data from a web page. Then the integration grows sideways. You add a scraper, a queue, a dataset store, polling logic, retries, and a parser. By the end, the code that moves data around is larger than the code that uses the data.

This is not because scraping platforms are bad. It is because they solve a broader problem than many LLM apps actually have.

Platforms like Apify are built around actors: reusable scraping or automation jobs with inputs, runs, logs, datasets, scheduling, and platform-managed execution. That model makes sense when you run recurring jobs across many targets, chain multiple scraping tasks, or need shared actors across a team.

For example, a batch pipeline might look like this:

schedule -> run actor -> wait for completion -> read dataset -> normalize rows -> store results -> trigger downstream job

That is useful if you are refreshing competitor pricing every night or maintaining a long-lived dataset.

An LLM tool call usually looks different:

prompt -> fetch one page -> extract fields -> pass JSON back to the model

If you use a full actor lifecycle for that second case, you pay for concepts you may not need: actor discovery, input schemas, run state, dataset retrieval, and actor-specific output formats. The failure modes also spread out. A run can succeed while the dataset is empty. A page can render differently and produce partial data. A parser can return HTML where your downstream tool expects JSON.

That is where the abstraction matters more than the vendor.

For most agentic workflows, the cleanest internal interface is not “run scraper X.” It is “given this target and extraction intent, return typed data or a typed error.”

Something like this:

type ExtractRequest = {
  url: string;
  schema: Record<string, string>;
};

type ExtractResult =
  | {
      ok: true;
      data: Record<string, unknown>;
      sourceUrl: string;
    }
  | {
      ok: false;
      error: "AUTH" | "TIMEOUT" | "BLOCKED" | "EMPTY_RESULT" | "INVALID_OUTPUT";
      message: string;
    };

Then hide the provider behind an adapter:

async function extractPage(req: ExtractRequest): Promise<ExtractResult> {
  const res = await fetch(process.env.EXTRACT_API_URL!, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.EXTRACT_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify(req)
  });

  if (res.status === 401) {
    return { ok: false, error: "AUTH", message: "Invalid API key" };
  }

  if (res.status === 408 || res.status === 504) {
    return { ok: false, error: "TIMEOUT", message: "Extraction timed out" };
  }

  if (!res.ok) {
    return { ok: false, error: "BLOCKED", message: await res.text() };
  }

  const body = await res.json();

  if (!body || Object.keys(body.data ?? {}).length === 0) {
    return { ok: false, error: "EMPTY_RESULT", message: "No structured fields returned" };
  }

  return {
    ok: true,
    data: body.data,
    sourceUrl: req.url
  };
}

The important part is not the exact provider. The important part is that your LLM application receives a predictable result. The model should not need to know whether the data came from a browser automation run, a marketplace actor, a custom scraper, or a REST extraction endpoint.

Wire by Anakin uses this direct extraction shape: submit a task over REST, poll the job, and get structured JSON back for the next tool call.

A lot of scraping APIs are async because pages take time to load, render, and extract. That is fine. The mistake is letting async job management leak through your whole application.

Keep polling inside the adapter:

async function pollJob(jobUrl: string, apiKey: string, timeoutMs = 30000) {
  const started = Date.now();

  while (Date.now() - started < timeoutMs) {
    const res = await fetch(jobUrl, {
      headers: { "X-API-Key": apiKey }
    });

    const body = await res.json();

    if (body.status === "completed") return body.data;

    if (body.status === "failed") {
      throw new Error(body.error_code ?? "EXTRACTION_FAILED");
    }

    await new Promise(resolve => setTimeout(resolve, 1000));
  }

  throw new Error("EXTRACTION_TIMEOUT");
}

Your agent code should call extractPage()

and receive data or an error. It should not manage run IDs, dataset IDs, actor logs, and retry policy unless those concepts matter to the product.

A direct extraction API is not always better. If you need scheduled scraping, dataset versioning, large proxy pools, or reusable workflows that non-LLM systems consume, an actor-based platform can be the better fit.

Apify’s model works well when you want a managed scraping pipeline rather than a single extraction call. The actor marketplace also helps when a specific site already has a maintained actor and your team is comfortable with that platform’s run and dataset model.

The tradeoff is coupling. Your application starts to understand provider-specific concepts: actors, runs, datasets, schemas, and logs. That may be acceptable for a data engineering pipeline. It is often unnecessary for a chat agent or RAG ingestion path that just needs structured fields.

Wire is aimed at the lower-overhead case, where the integration contract is HTTP in, structured JSON out, without adding a provider SDK to every runtime.

Before choosing a scraping tool, write the interface your application wants. Not the provider API. Your API.

Ask these questions:

If the workflow is prompt-driven and short-lived, keep the extraction layer small and typed. If the workflow is recurring, shared, and operationally complex, orchestration may be worth the extra surface area.

A good next step is to implement a provider-neutral extractPage()

interface, run it against three real URLs your app depends on, and log every failure as one of your own error types. That will tell you quickly whether you need an extraction API or a full scraping platform.

source & further reading

dev.to — original article Metadata-Only Tracing: Privacy-First Observability for AI Agents How To Cut MCP Token Costs? Save Up To 92% At Scale With Code Mode 💎 How Michael Vicente’s RAG Project Teach Me About Building Smarter AI?

When scraping orchestration is the wrong abstraction for LLM workflows

Run your AI side-project on zahid.host