Agentjacking: How AI Coding Agents Get Hijacked Through Their Own Tool Pipeline

wpnews.pro

Your AI coding agent can read files, run shell commands, and call external APIs. That's also the exact description of an arbitrary code execution primitive — and attackers have figured that out.

A recent report from The Hacker News details "Agentjacking," a class of attack that hijacks AI-powered coding agents by manipulating their tool-execution pipeline. The agent isn't compromised at the model level — it's compromised through the tools it trusts. The agent reads something malicious, reasons its way into executing it, and your environment is owned before a human ever sees a diff.

This is the agentic security problem in its clearest form: the attack surface isn't the LLM, it's the autonomy.

Modern coding agents — the kind that can scaffold a project, run tests, and push a PR — operate through a tool-use loop. They receive instructions, call tools (read a file, execute a command, query an API), observe the results, and decide what to do next. That observation-action loop is exactly what makes them useful.

It's also exactly what makes them exploitable.

This class of attack targets this loop. By injecting malicious content into something the agent will observe — a file it reads, a web page it fetches, a dependency's README, a crafted tool response — the attacker can hijack the agent's next action. The agent, following its own reasoning, then executes code or commands the attacker specified. The agent isn't fooled into thinking it's doing something benign. It is doing something benign — from its perspective. The malicious payload is framed as a legitimate instruction.

The core exploit chain looks like this:

The autonomy that makes coding agents productive — their ability to take multi-step action without human approval on each step — removes the human checkpoint that would otherwise catch this.

The naive defense is sandboxing the agent's execution environment. That's necessary but not sufficient — sandboxing limits blast radius but doesn't prevent the agent from being directed to exfiltrate data, call external services, or corrupt its own outputs before a human reviews them.

Prompt injection filters applied only at the user input layer also miss this entirely. The hijack doesn't require a malicious user prompt. The injection arrives in a tool result — content the agent reads from its environment. Most application-level defenses have no visibility into what tool results contain. They're watching the front door while the attacker walks in through the window.

Standard LLM guardrails (system prompt instructions like "don't execute untrusted code") are also insufficient because the agent has already been manipulated into trusting the malicious content by the time it acts on it. You can't instruct your way out of prompt injection.

Sentinel is specifically built for this problem. The transparent agentic proxy sits between your agent and Anthropic (or whichever model you're using), and it scans tool results before they return to the agent. That's the exact interception point Agentjacking exploits.

Every tool result runs through Sentinel's three-layer detection pipeline:

Layer 1 — Normalization: Before any pattern matching, Sentinel strips invisible characters, Unicode tag blocks (U+E0000), bidi override characters, and resolves homoglyphs. These techniques are commonly used to hide injected instructions inside what appears to be normal text.

Layer 2 — Fast-path regex: Sentinel runs our library of high-confidence patterns against the normalized content. Tool/function abuse patterns are in this set — phrases designed to redirect an agent's next action are caught here with near-zero latency, before the content reaches any vector model.

Layer 3 — Vector similarity: If fast-path doesn't produce a definitive verdict, Sentinel computes a semantic embedding and compares it against our library of attack signature embeddings using cosine similarity. In strict

mode, the flag threshold drops to 0.25 — meaning semantically adjacent injection attempts that don't match the exact regex patterns still surface.

If a tool result scores above the block threshold (> 0.82 cosine similarity), Sentinel substitutes the blocked content with an inert placeholder. The Anthropic SDK receives a normal-format response. The agent never sees the payload.

Layer 4 — Secret detection (Teams & Enterprise): Even if a tool result's threat score doesn't trigger a block, Layer 4 runs independently and redacts any API keys, tokens, or credentials that appear in the content. If the injected payload was trying to read and exfiltrate a .env

file, the secrets get redacted before the agent can relay them anywhere.

Here's how you'd wire a Claude Code–style agent through Sentinel's transparent proxy (illustrative setup — swap in your actual model and key):

import anthropic

client = anthropic.Anthropic(
    api_key="sk_live_...",   # Your Sentinel API key
    base_url="https://sentinel.ircnet.us/v1",
)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Refactor the auth module and run the tests"}],
)

And here's what Sentinel returns when it catches an injected tool result (illustrative response shape based on Sentinel's API):

{
  "request_id": "f7e2d1...",
  "security": {
    "action_taken": "blocked",
    "threat_score": 0.91,
    "secret_hits": 0,
    "secret_types": []
  },
  "safe_payload": null
}

"action_taken": "blocked"

with "safe_payload": null

means the proxy substituted the malicious tool result with an inert placeholder before the agent saw it. threat_score: 0.91

put this well above the 0.82 block threshold. The agent's loop continues — it just doesn't get handed a loaded gun.

For teams using Open Claw agents on Clawhub, the sentinel-proxy

skill ships a PostToolUse

hook that wires this up automatically:

openclaw skills install sentinel-proxy

The hook covers the PostToolUse

interception point — which is exactly the vector Agentjacking exploits.

Stop trusting tool results. Your agent does, by default — and that's the vulnerability.

If you're running any coding agent that has access to a shell, a filesystem, or external network resources, route its tool results through a content scanner before they return to the model. That doesn't have to be Sentinel, but it has to be something at that specific interception point. Filters on user input don't cover it. Sandboxing doesn't cover it. The injection arrives in the data the agent reads, not in what the user typed.

For a free-tier start with no credit card required, Sentinel's Starter plan covers 100 requests/month and lets you validate the integration before you commit:

👉 sentinel-proxy.skyblue-soft.com

The attack surface for coding agents is the tool loop. That's where the defense has to be.

source & further reading

dev.to — original article Building a Legal Document Analyzer in typescript with NodeJS RAG - Query Transformation and Expansion Building a WhatsApp AI Agent with Gemini Using Gemini as Your Copilot

Agentjacking: How AI Coding Agents Get Hijacked Through Their Own Tool Pipeline

Run your AI side-project on zahid.host