How to make your AI Agent 111x cheaper and 2.5x faster at data aggregation

wpnews.pro

cd /news/large-language-models/how-to-make-your-ai-agent-111x-cheap… · home › topics › large-language-models › article

[ARTICLE · art-14131] src=dev.to ↗ pub=2026-05-26T06:08Z topic=large-language-models verified=true sentiment=↑ positive

How to make your AI Agent 111x cheaper and 2.5x faster at data aggregation

A developer benchmarked Google's Gemini 3.5 Flash model on a data aggregation task involving 208 user objects and 72,286 tokens of JSON, finding the bare model delivered mathematically precise results but consumed 72,361 tokens and took 38.89 seconds. Running the same task through the Neonia MCP platform, which used a Wasm-based JQ filter tool to delegate data processing instead of loading raw data into the model's context window, reduced token consumption to 650 tokens (111x cheaper) and execution time to 15.54 seconds (2.5x faster).

read3 min views11 publishedMay 26, 2026

Google recently released an incredibly fast new model — Gemini 3.5 Flash. As someone building infrastructure for autonomous agents, I decided to put it through a rigorous crash test on a real-world data aggregation task to see how it handles massive context loads.

The Benchmark Task

The challenge was simple, but computationally heavy. I fed the model a massive JSON array containing 208 user objects and gave it the following prompt:

"Extract the users, find those who are over 30 years old and have green eyes, and calculate the exact mathematical average of their weight."

First, I ran the bare model, forcing it to read the entire JSON payload and do the maths internally.

I have to give credit where it is due: the fact that Gemini 3.5 Flash was able to parse 72,286 tokens of raw JSON and not make a single mistake with the fractions is an amazing achievement. It delivered an absolutely identical, mathematically precise answer: 78.44684210526316. No hallucinations.

But the cost of this achievement was astronomical:

Execution time: 38.89s
Total consumption: 72,361 tokens for a single request!

Next, I ran the exact same task through the Neonia MCP (Model Context Protocol) platform.

Instead of arbitrarily clogging the agent's context window with raw data, the Neonia infrastructure intercepted the intent and quickly delegated the heavy lifting to a Wasm-based JQ filter tool. The model didn't have to read the data; it just orchestrated the query.

The results for the exact same request:

Execution time: 15.54s (2.5x faster ⚡)
Total consumption: 650 tokens (111x cheaper)

The MCP server usage code (Rust):

    let url = "https://mcp.neonia.io/mcp?tools=neonia_web_json_fetch,neonia_data_jq_filter";
    let neonia_key = env::var("NEONIA_API_KEY").unwrap_or_default();

    // Configure the rmcp transport
    let mut config = rmcp::transport::streamable_http_client::StreamableHttpClientTransportConfig::with_uri(url);
    config.auth_header = Some(format!("Bearer {}", neonia_key));

    let transport = rmcp::transport::StreamableHttpClientTransport::from_config(config);

    let capabilities = ClientCapabilities::default();
    let impl_info = Implementation::new("rig-agent", "0.1.0");
    let client_info = ClientInfo::new(capabilities, impl_info);

    let client = client_info.serve(transport).await?;

    // Fetch Tools
    let tools = client.list_tools(Default::default()).await?.tools;

    // Initialize Rig's Native Gemini Client
    let gemini_client = gemini::Client::from_env().expect("Failed to initialize Gemini client");

    let agent = gemini_client.agent("gemini-3.5-flash")
        .preamble(
            "You are an autonomous agent equipped with Neonia's data processing tools.\n\
            ## Usage Guidelines\n\
            1. When asked to process JSON from a URL, ALWAYS use `neonia_web_json_fetch` first. It will securely store the file and return a `resource_uri` and a TypeScript schema of the data.\n\
            2. Once you have the `resource_uri` and the schema, use `neonia_data_jq_filter` to extract exactly what you need. Pass the `resource_uri` to the `resource_uri` parameter, and formulate a mathematically precise `jq_query` based on the schema.\n\
            3. Write correct JQ queries. Example JQ for math average: `.users | map(select(.age > 30 and .eyeColor == \"Green\")) | map(.weight) | if length > 0 then add / length else 0 end`.\n\
            4. Self-Correction: If a tool returns an error, evaluate your parameters, fix them, and try again."
        )
        .default_max_turns(5)
        .rmcp_tools(tools, client.peer().to_owned())
        .build();

    let target_url = "https://dummyjson.com/users?limit=0";
    let prompt_text = format!("Fetch the JSON from {}, then find all users who are strictly older than 30 ('age' > 30) AND whose 'eyeColor' is exactly 'Green'. Calculate the EXACT mathematical average of their 'weight'. Return ONLY the final average number, without any explanations.", target_url);

    let response = agent.prompt(&prompt_text).await?;

The Takeaway for AI Engineering

Frontier models are becoming phenomenally fast and smart with every new release. But the future of the industry is not about endlessly inflating the context window just because we can. Context bloat is a latency tax.

The Neonia MCP architecture proves that breaking the linear dependency between data volume and the cost of an LLM call is the only way to scale. By using a cohesive platform as a specialised backend for your autonomous agents, you can process megabytes of JSON for the exact same 1,000 tokens. This is true enterprise-grade optimisation.

I’m curious to hear from other engineers building agentic workflows: What are your thoughts on smart delegation of compute versus relying entirely on native LLM context windows? Let’s discuss in the comments.

source & further reading

dev.to — original article The App's Done. So Why Does It Keep Failing Review? AI Builds the App in 5 Minutes. So Why Can't I Ship It? Building Production Prompts That Actually Work (And Why Most Fail)

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-to-make-your-ai-agen…

Read original on dev.to → dev.to/olex_green/how-to-make-your-ai-agent-111x…

mentioned entities

Google

Gemini 3.5 Flash

Neonia

MCP

Model Context Protocol

Wasm

metadata

slughow-to-make-your-ai-agent-111x-cheaper-and-2-5x-faster-at-data-aggregation

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevMulti-Agent LLM Orchestration wi…

next →Im a mentor with the Technovatio…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 10 Jul · #large-language-models

Vercel AI SDK 6: An Agent Is Just a while Loop

arize.com · 10 Jul · #large-language-models

3 production patterns for AI agents and how to evaluate each one

infoq.com · 10 Jul · #large-language-models

GitHub Copilot CLI Gets Tabs and No-Config-File Tool Setup in Redesigned Terminal UI

dev.to · 10 Jul · #large-language-models

I Already Control This Computer. Here's a Tool That Lets Other AIs Do It

── more on @google 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

Anthropic's "J-lens" reveals workspace in Claude mirrors theory of consciousness

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required