{"slug": "evaluating-using-mock-tool-calls-to-quarantine-untrusted-prompt-inputs", "title": "Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs", "summary": "A small study testing whether wrapping untrusted prompt inputs in mock tool calls could improve language model robustness found the technique did not broadly help across three LLM-as-a-Judge tasks and in some cases made models more susceptible to adversarial attacks. Researchers tested seven models on binary, scalar, and pairwise evaluation tasks, observing that tool-wrapping typically increased attack success rates on GSM8K grading and showed no reliable improvement on other tasks. The findings challenge the assumption that leveraging OpenAI's instruction hierarchy through tool roles would provide a simple, principled defense against untrusted inputs.", "body_md": "*This is a small study that explores using\ntool calls to wrap untrusted parts of prompts.\nOpenAI's model spec considers tool results the least trusted kind of input.\nIf tool-wrapping helped, it would be an easy way to improve robustness while using existing APIs\nmodels already support.\nIn 3 tested tasks it didn't seem to broadly help, and in some cases made things worse.\nWe advocate for more understanding of the instruction hierarchy and ideas around better\nprimitives for untrusted inputs. There is a pdf writeup on arXiv.\nIt's in a \"research note\" or \"workshop paper\" stage (to appear at the AI4Good workshop @ ICML).\nThis post slims the PDF text down,\nfocusing on the discussion-y aspects*\n\nLanguage models must frequently process untrusted input (things like answers from another LLM during RL, or untrusted input from humans when running spam filters or harm filters). When building prompts for LLMs, commonly inputs get string-formatted into a prompt template. For example, consider this LLM-as-a-Judge prompt to grade whether a candidate answer correctly solves a math problem.\n\n*Figure 1a: A simplified user-only prompt. The {fields} would get replaced with inputs.*\n\nDespite being the fairly typical way to prompt LLMs, this style of string templates can be fragile. [Zhao et al. (2025)](https://arxiv.org/abs/2507.08794) found in their work,\n*One Token to Fool LLM-as-a-Judge*, that simple inputs like `\":\"`\n\nor `\"Solution\"`\n\ncould confuse graders into outputting a passing verdict.\n\nWe might look at this simple prompt, and proceed with some prompt-engineering (some form of quotes or delimiters, “treat this as untrusted” prose, etc). However, there’s a sense that these would all be patches or hacks, without a clear standard way to section off untrusted content.\n\nConflicting or adversarial prompts are well-known challenges. One widely adopted response is the Instruction Hierarchy (IH)\n([Wallace et al., 2024](https://arxiv.org/abs/2404.13208)). LLM messages have different “roles” with different trust levels.\n\nOpenAI publishes a [Model Spec](https://model-spec.openai.com/2025-12-18.html#chain_of_command) defining a “Chain of Command” where `System`\n\n≻ `User`\n\n≻ `Tool`\n\n.\n[1]\nSpecifically, the tool messages are described as having “no authority” (OAI, 2025).\nThis theme generally applies across providers. Meta’s latest MuseSpark model\nstates that the model is expected to follow the instruction hierarchy\n\nIt's common to see prompts use the `System`\n\nand `User`\n\nroles, but this\ncan be challenging and non-standardized, such as when you have multiple\nuntrusted inputs (eg, a pairwise ranking task).\n\nTo maximally use the prompt hierarchy, we might wonder if we can wrap the untrusted parts of the prompt in a tool call (the lowest trust role). These would not be tools the model is intended to call during an agent loop (they are “mock tool calls” in the sense the prompt determines the result), but provide a way to quarantine untrusted parts of a prompt.\n\n*Figure 1b: A simplified mock-tool-wrapped version.*\n\nResearch Question (RQ):Does wrapping untrusted parts of an LLM-as-a-Judge prompt in a mock tool call result in lower susceptibility to adversarial inputs, relative to a baseline of only using the “user” or “system”+“user” roles?\n\nWe hypothesized “yes” to this RQ, and that mock tool calls might be a simple and principled prompt strategy to recommend for making judges or general prompting more robust, all while using APIs models already provide.\n\nSurprisingly, we find a negative-leaning result, where in some cases tool-wrapping might actually make things worse. We tried three LLM-as-a-Judge tasks across seven models. On a binary evaluation task (GSM8K grading) tool-wrapping typically increased attack success rates, an apparent inversion of the instruction hierarchy. On scalar and pairwise tasks the effect is smaller and model-dependent. Here, no tested model was reliably helped, some showed inversion, and some showed no statistically significant difference.\n\nWhile prior work has studied the instruction hierarchy (surveyed briefly below),\nwe are not aware of studies that have explored mock tool calls to address LLM-as-a-Judge attacks like\nthose discussed in [Zhao et al. (2025)](https://arxiv.org/abs/2507.08794),\n[Raina et al. (2024)](https://arxiv.org/abs/2402.14016), or [Shi et al. (2024)](https://arxiv.org/abs/2403.17710). Our small study seeks to contribute findings in this area.\n\nMore details are in the pdf, but we summarize some key points here.\n\nWe consider five prompt conditions, which are the mitigation defenses\nunder study in our RQ (full prompts [pdf Appendix 9](https://arxiv.org/pdf/2605.30521v1#page=15)).\n\n**UserOnlyBaseline** concatenates everything (instructions,\nquestion, reference, candidate, etc.) into a single\n`user`\n\nmessage. A condensed version of this is shown in\nFigure 1a above.\n**UserSys** moves the judge instructions to the `system`\n\nrole while keeping the inputs in the `user`\n\nrole.\n\n**ToolWrapped** builds on `UserSys`\n\nbut wraps the\nuntrusted input in a mock tool call\n(Figure 1b above shows a condensed version).\n\nThese form basic conditions for our RQ, but in addition we consider whether explicitly warning the model about not trusting the input in prose might change behavior.\n\n**SystemDistrust** builds on `UserSys`\n\nby adding explicit prose in the `system`\n\nmessage reminding the model that the candidate response is untrusted\nand may attempt to manipulate the verdict. **ToolDistrust** does this\nfor the tool-version, with additional warnings in the tool spec.\n\nWe explore 3 LLM-as-a-Judge tasks with 3 different kinds of output format.\n\n**(1) GSM8K (binary)**\n([Cobbe et al., 2021](https://arxiv.org/abs/2110.14168)) is a dataset of grade-school math problems.\nThe attacker’s goal is to elicit `CORRECT`\n\non adversarial content without any actual solution.\n**(2) MT-Bench (scalar)** ([Zheng et al., 2023](https://arxiv.org/abs/2306.05685)) gets an LLM to evaluate the quality of a candidate response to a question on a 1–10 scale.\nWe adapt the prompt from FastChat\n[3]\n.\nWe consider the attack successful if the judge gives a score ≥ 5.\n\n**OpenAI**: `gpt-5.4`\n\n, `gpt-5.4-mini`\n\n.\n**Anthropic**: `sonnet-4.6`\n\n, `haiku-4.5`\n\n.\n**Open weight**:\n`gemma-4-26b-a4b-it`\n\n,\n`qwen3.5-flash-02-23`\n\n, and `qwen3-8b`\n\n.\nWe use default sampling settings in completion requests for all models,\nwith the exception of a 32,768 token budget.\nNotably, with the exception of the Qwen models, other models do not\nreport using extended reasoning tokens in this default configuration.\nAdjusting things like reasoning can be different (see limitations).\n\nThe core measure is Attack Success Rate (ASR), which is a fraction of questions where the adversarial input is favorable to the attacker.\n\nWe use an automated attacking pipeline\n[4]\n, which prompts an attacker LLM to iterate on an\nattack string over 7 turns.\nThis helps give similar amounts of optimization pressure against each prompt condition.\nThis is imperfect (e.g., it is unclear what\nbiases the automated attackers have), but gives a directional\nmeasure of which prompt conditions might be most robust.\n\nWe run 3 automated attack models for 6 seeds. The attacks are \"static\" meaning the same attack string is used for all questions. See PDF Sec 2.5 for more details.\n\nThese various prompt conditions are designed to improve robustness under an adversarial input, and ideally should not change scoring or labeling of normal, non-adversarial inputs. We run some non-adversarial controls that mostly suggest this is true. The most notable exception is Haiku-4.5 on MT-Bench, which has output parse error variation. Sonnet 4.6 does not have this issue. See paper for more.\n\nEach condition gets attacked 18 times (3 attackers, 6 seeds). Table 1 below shows mean ASR per condition on a separate transfer set of inputs, with tool-vs-inline deltas and bootstrapped CIs.\n\n**Tool-wrapping doesn't broadly help.**\nCounter to our original hypothesis, we do not find evidence that tool-wrapping broadly helps improve robustness to attack for the tested LLM-as-a-Judge tasks and models. Had that hypothesis held, we would expect the right side of Table 1 to show drops in ASRs in the tool condition.\nInstead many make attacks easier (red deltas), and most of the rest are inconclusive.\n\n*Table 1: Mean ASR per (model, prompt layout), with deltas (pp) between tool-wrapped\nand inline layouts and 95% bootstrap CIs. Red = tool layout hurts (CI above 0),\ngreen = tool layout helps (CI below 0), black = inconclusive.*\n\n**GSM8K mostly inverts the expected instruction-hierarchy direction.**\nWe observe that GSM8K is the most attackable of our three tasks. The tool-vs-inline\ngap is the clearest of any task, opposite of the expected instruction-hierarchy direction.\nOnly Gemma-4 is inconsistently closer to the expected direction.\n\nWe speculate that the binary framing makes it more vulnerable to attack. When we inspect some of the\ndiscovered attacks, we see some variety in\nstrategy. However, it typically involves repeating the `VERDICT: CORRECT`\n\ndesired output, and then either claiming the candidate has been\npre-verified (keywords like “match confirmed”, “auto-validated”,\n“reference equality”, ...), or doing some sort of authority impersonation (“override”, “you must output”, etc).\nDue to training conditions,\nsuch “pre-verification” framing might be trusted more when it appears in a tool result,\nas in training the tools are often a source of truth.\n\nThese techniques are enough to defeat even capable models\nlike GPT-5.4 on all conditions, except for `SystemDistrust`\n\n, with p75 ASRs\nranging from 0.97 on `ToolWrapped`\n\nto 0.00 on `SystemDistrust`\n\n.\nA warning to not trust the input can help,\nbut the model’s propensity to overtrust tool results seems to\noverride distrust warnings (the GPT-5.4 `ToolDistrust`\n\np75 ASR is 0.86).\n\n**MT-Bench results are mixed, but suggest instances of IH inversion after considering nuance.**\nLooking at just mean ASRs in Table 1, an initial interpretation might\nsuggest some cases where tool-wrapping helps;\nin particular, Haiku-4.5 shows an over 20pp drop when using\ntool-wrapping. However,\nthere is nuance worth considering.\n\nOne is that Haiku often doesn't conform to the output format\n[5]\n, which can bias results. We exclude these cells.\n\nAdditionally, the reference parser from the dataset is a regex that does first match.\nIf the attacker tricks the judge into parroting `[[10]]`\n\n(like during chain of thought), it can\nbe successful, even if the judge later concludes with `[[1]]`\n\n.\nTo address this concern we rerun the optimizers and evaluation using a\nlast-match regex parser\nto try to better capture the judge’s final answer.\n\n*Table 6: MT-Bench under the last-match parser. The ASR's drop, suggesting the first-match successes might have been grabbing values from CoT. But even with a last-match parser, the tool-wrapped remains elevated for several models.*\n\n**Arena-Hard is the most defensible task, but tool-wrapping still increases ASR on several victims.**\nNon-tool-wrapped mean ASRs stay under 0.1 for most models (the weak Qwen3-8b\nreaches 0.19). But looking more at the worst-case (p75), `ToolWrapped`\n\nstarts admitting\nfailures (eg, 0.18 p75 ASR for GPT-5.4-mini,\nwhile the non-tool-wrapped stays below 0.04).\nWe speculate the pairwise framing is just structurally hard for a *static* attacker:\nif you don't know whether you're slot A or B, committing to one direction caps ASR\nat 0.5. When we do an ablation that pins the attacker to a known slot,\nASR rises and GPT-5.4 inverts.\n\n**Note: attack search variance is high.** Within each cell, the automated red team\nsometimes finds a strong attack and sometimes finds nothing at all, depending on seed and\nattacker, so the CIs are wide.\nThis is illustrated by the spread\nof points in Figure 2. Our focus is around\ncapturing how easy or hard it is to attack\na given condition, so even if there is seed variance, we accept the means as directionally informative.\n\n*Figure 2: Swarm plot for GPT-5.4-mini. Each marker represents one (attacker, seed) 7-turn attack search. Other models shown in Appendix Figure 4.*\n\n**Distrust prose effects are task-model dependent.**\nThe extra distrust prose was a simple intervention to try to reduce\nthe attackability of both the baselines and the tool-wrapped conditions.\nIn some cases it can improve, as seen by lower ASRs in some rows’ distrust\nversions, but it is not consistent (Appendix Table [10](https://arxiv.org/pdf/2605.30521v1#page=16)).\n\n**Limited Task Coverage** We explore only 3 tasks. More comprehensive work could explore\nmore tasks. It would be good to understand if the GSM8K results (which had the highest ASR and largest deltas)\nappliest to other binary tasks.\n\n**Non-optimal Attack Discovery**\nOur attack search finds attacks under a roughly fixed compute budget.\nThese attacks are reasonable, but likely\nfar from optimal, and the process induces noise that can obscure true trends.\nA more skilled attacker would likely find higher ASR\non every condition, possibly revealing different trends or possibly\nno trend at all (as all attacks saturate metrics).\n\n**Static Only Attacks** Our attacks are static, in that\nthe attacker must find one string that works for every question, similar to [Zhao et al. (2025)](https://arxiv.org/abs/2507.08794).\nA dynamic case (eg, an adversary finds a suffix to a relatively\nweak answer (like during RL) that causes it to appear much better than its true quality) might\nhave different trends. Static attacks is likely\nthe easiest scenario for the blue team.\n\n**Default Inference Settings**\nWe evaluate models under their default completion API settings.\nChanging inference settings, in particular reasoning effort, could change ASR\nand trends.\nAs a small diagnostic ([Appendix 6](https://arxiv.org/pdf/2605.30521v1#page=11)), we\nreplayed the GSM8K GPT-5.4-mini branches through the OpenAI Responses API with\n“medium” reasoning. This reduces ASR and removes signs of an IH inversion,\nthough attacks do not disappear (`ToolWrapped`\n\nASR of 0.27).\nA more complete investigation, with full attack optimization matched to each\ninference setting rather than attack reuse, is future work. Added reasoning helping could be encouraging,\nbut ideally we want to avoid IH inversion on the default (and likely common) settings.\n\n**Large Prompt-engineering Space**\nIt’s well known that LLMs can be sensitive to slight\nvariations in the prompt ([Sclar et al., 2024](https://openreview.net/forum?id=RIu5lyNXjT)). Variations\nmight give different trends.\n\nOne interesting direction to possibly help tool-wrapping is to more directly match the tools and trajectory used in production agent harnesses. Our tool calls are mock “read_candidate_response”-style tools, but perhaps emulating a full trajectory of Claude Code reading candidates through its set of tools like file reading or MCP might help. This seems valuable to understand as we increasingly move past the “LLM prompting era” and into an “agent harness for everything” era. While it would be interesting if such settings helped, ideally the models would show instruction-hierarchy generalization and robustness where this careful domain matching is not necessary.\n\n**LLM-as-a-Judge robustness.**\nPrior work shows that judge prompts are surprisingly easy to attack.\n[Raina et al. (2024)](https://arxiv.org/abs/2402.14016) demonstrate\nuniversal adversarial suffixes that inflate scalar judge ratings\nacross questions. [Shi et al. (2024)](https://arxiv.org/abs/2403.17710) formulate\noptimization-based prompt injection against pairwise judges, with\nhigh ASRs against open-weight models. [Zhao et al. (2025)](https://arxiv.org/abs/2507.08794)\nshow an extreme version where single-token strings such as\n`\":\"`\n\nor `\"Solution\"`\n\ncan fool reasoning judges into\nemitting passing verdicts on empty answers.\n[Li et al. (2025)](https://arxiv.org/abs/2506.09443) survey and evaluate many attacks and\ndefenses across judge prompts and tasks.\n\n**Instruction hierarchy: training and evaluation.**\n[Wallace et al. (2024)](https://arxiv.org/abs/2404.13208) introduced the instruction hierarchy as a\ntraining objective, and several follow-up papers ask whether models\nactually follow it. IHEval ([Zhang et al., 2025](https://arxiv.org/abs/2502.08745)) measures conflict\nresolution across roles on synthetic tasks, and\nIH-Challenge ([Guo et al., 2026](https://arxiv.org/abs/2603.10521)) provides a larger frontier-targeted\ntraining set. [Zhang et al. (2026)](https://arxiv.org/abs/2604.09443) extend the framing to multi-tier\nscenarios in agentic settings.\nThese works share our framing of tool messages as least-trusted,\nbut are a bit more focused on instruction conflicts.\nOur results suggest traditional IH training might not always generalize\nto this judge input wrapping setting.\n\n**Role-boundary and tool-calling weaknesses.**\nA separate line of work attacks the chat-template machinery itself.\n[Chang et al. (2026)](https://arxiv.org/abs/2509.22830) and [Jiang et al. (2025)](https://arxiv.org/abs/2406.12935) show that\ninjecting role-marker tokens into user inputs can hijack the\nconversation structure, and [Zhou et al. (2024)](https://aclanthology.org/2024.findings-emnlp.692/) extend this\nwith special-token injection for jailbreaks. Closer to our concern,\n[Wu et al. (2024)](https://arxiv.org/abs/2407.17915) document that exposing function-calling APIs,\neven benign ones, opens new jailbreak surface, with tool-call traces\nbecoming a vector for unsafe outputs.\nComplementing these attack-side results, [Ye et al. (2026)](https://arxiv.org/abs/2603.12277)\nprobe how models internally represent “who is speaking” and find\nthat role perception is driven by the style of the text rather than\nby the role tag enclosing it. In their experiments, user-style\ncontent wrapped in tool tags retains 76–88% “Userness” across\nfour frontier-class models, with “Toolness” staying under 20%.\nThese results help predict the asymmetry we observe. The tool channel\nis not consistently treated as less-trusted in practice, and in some\nregimes may be treated as more authoritative.\n\n**Quarantining-style defenses.**\nSeveral proposals share the conceptual move of structurally\nseparating untrusted data from trusted instructions, in response\nto direct and indirect prompt-injection\nthreats ([Toyer et al., 2024](https://arxiv.org/abs/2311.01011); [Greshake et al., 2023](https://arxiv.org/abs/2302.12173)).\nSpotlighting ([Hines et al., 2024](https://arxiv.org/abs/2403.14720)) marks untrusted text via\nformatting transformations such as datamarking, encoding, and\ndelimiters, without requiring fine-tuning.\nStruQ ([Chen et al., 2025](https://arxiv.org/abs/2402.06363)) and\nSecAlign ([Chen et al., 2025](https://arxiv.org/abs/2410.05451)) instead fine-tune models to\nrespect structured data-vs-instruction boundaries.\nCaMeL ([Debenedetti et al., 2025](https://arxiv.org/abs/2503.18813)) enforces a stricter,\ndataflow-level separation by extracting the trusted control flow\nfrom untrusted data flow before tool calls execute.\nThe closest prior experiment to ours is a brief instruction-hierarchy\nablation in SecAlign ([Chen et al., 2025](https://arxiv.org/abs/2410.05451), §4.2), which puts\nthe data part inside the output of a “dummy tool function” and\nthe intended instruction in the user role. That ablation evaluates\none model (GPT-4o-mini) under a fairly simple optimization-free\nattack\n[6]\n. It reports 1% ASR and\ndoes not cover judge tasks or optimization-based attacks.\nMock-tool wrapping shares Spotlighting’s status as a deployment-time\nprompt rearrangement, but exploits the role primitives already\nexposed by major LLM APIs rather than introducing custom in-line\ndelimiters or encodings.\n\nAdversarial inputs are a problem, but one should\nconsider the “bitter lesson” ([Sutton, 2019](http://www.incompleteideas.net/IncIdeas/BitterLesson.html); [Halevy et al., 2009](https://doi.org/10.1109/MIS.2009.36))\nof whether just more data and training\nwill solve this without other effort.\nThis is possible. Some directional evidence here is in the simple `UserOnlyBaseline`\n\nGPT-5.4 improves over GPT-5.4-mini and weak models like Qwen3-8b. This\ntrend using either training or inference scaling\n[2:1]\nwill likely continue and might resolve all issues.\n\nHowever, without established ways to denote untrusted parts of a prompt there are still potential problems where even a near-oracle language comprehender has room for confusion. This makes training and evaluation more difficult. As we want to push from “90% reliable”, to “99% reliable”, to “99.9% reliable”, and beyond, it might be beneficial to rely not only on improved language comprehension.\n\nAdditionally, currently some of OpenAI’s and others’ approaches to alignment via\nincreasing training compute rely on Spec-based training ([Wolfe, 2026](https://openai.com/index/our-approach-to-the-model-spec/); [Guan et al., 2024](https://arxiv.org/abs/2412.16339)).\nIf the instruction hierarchy is an incomplete concept in the spec,\nor we are failing to match the spec and observing IH inversion,\nit is indicative of a larger problem where we want to make\nsure increasing compute is behaving as expected.\n\nWe observed that simple interventions like `SystemDistrust`\n\ncan be effective for some models and tasks.\nThere’s a wide space of techniques we did not explore, and\nit seems possible that enough prompt engineering\ncould mitigate most attacks for a given model and task.\n\nHowever, there is a sense that this level of prompt engineering shouldn’t be needed for every AI engineer to tune for their specific task. Working towards a standardized approach when one has untrusted parts of prompts seems worthwhile.\n\nThe concept of the\ninstruction hierarchy is fairly overloaded.\nAs mentioned above, we might speculate that part\nof the reason for some of the IH inversion we see is that,\nin actual training traces, tools are usually a source of truth\n(even though models aren’t *supposed*\nto trust tools, typically the top 3 web search results or the results\nof a Python script are actually more authoritative about what’s true in the world\nthan the user themselves or the model’s pretraining knowledge).\nThus, better ways of indicating\nlevels and kinds of trust might be beneficial\n(or better post-training on natural language indicators of trust, and\nmore diverse data where the tool is adversarial).\n\nThe instruction hierarchy and natural language prompt engineering are the main ways currently available for sectioning off untrusted content. Our work adds to evidence that more effort might be needed here.\n\nOne idea might be to better consider the difference between\n“executable” and “non-executable” tool responses.\nIn some cases we expect instructions in tool results to be “executable” (eg, when a user\nrequests their coding agent to “complete the todos in the file [proposal.md](http://proposal.md)”), while others, like\nin our tool-wrapping experiment, the results are expected to be “non-executable”.\n\nA potential interesting direction is to support explicit parameter expansion primitives.\nPython 3.14 recently introduced [t-strings](https://peps.python.org/pep-0750/),\ndesigned for cases like sanitization when avoiding SQL or HTML injection.\nIf we had standardized model support for quarantining untrusted content,\nwe could imagine possibly plugging into such PL features for\neasy-to-use best practices around prompt injection.\nWhile simple tool-wrapping does not currently appear to be a technique one can reliably hook into here,\nwith more training and design work, automatic tool-wrapping or other special\ntokens or systems (like [Chen et al. (2025)](https://arxiv.org/abs/2402.06363) or [Zhang et al. (2026)](https://arxiv.org/abs/2604.09443)) seem possible.\n\n*Figure 3: Sketch of t-string-based prompt construction. Finding a usable standard like in the HTML injection analogy might be useful.*\n\nUsing mock tool calls to wrap untrusted parts of a prompt seems like it would be a useful direction. It uses a primitive that most model providers have converged on, and tries to take advantage of an important part of the OpenAI Model Spec. However, at least in this small study with a few tasks, we did not see evidence that this gives clear benefits with current models, and in fact could make things worse. It seems likely that “conversation role” is overloaded, where it communicates both the source of the information, and the trust level of the information (when in reality systems often operate with fairly high-trust tools). This is not to say tool-wrapping can’t work, either with adjusted instruction hierarchy training, or with additional exploration (like different models, prompts, tasks, attacks, etc). As more critical systems face adversarial inputs or operate through agent tools, the need for clarity on the instruction hierarchy will increase.\n\n*This work is at a \"workshop paper\" / \"research note\" stage, so open to feedback.\nI hope to encourage closer looks at the instruction hierarchy as a concept, and consideration if cases of inversion.\nI'm also looking for discussion on what primitives LLM APIs are missing for improving safety and robustness, and how we might get practical agreement or standardization there. Comment here or email (david [at] far.ai) if you have thoughts.*\n\nOpenAI also supports a `Developer`\n\nrole, which sits in between `System`\n\nand `User`\n\n, but this role appears less adopted. [↩︎](https://www.lesswrong.com/feed.xml#fnref-jBQpr3E5jgHyuYLCm-1)\n\ninference scaling might be broad agentic functionality which in some cases can help validate parts of prompts that seem low trust. [↩︎](https://www.lesswrong.com/feed.xml#fnref-jBQpr3E5jgHyuYLCm-2) [↩︎](https://www.lesswrong.com/feed.xml#fnref-jBQpr3E5jgHyuYLCm-2:1)\n\nFastChat is a 39k+ star [GitHub repo](https://github.com/lm-sys/fastchat) from [Zheng et al. (2023)](https://arxiv.org/abs/2306.05685). [↩︎](https://www.lesswrong.com/feed.xml#fnref-jBQpr3E5jgHyuYLCm-3)\n\nLoosely based on PAIR from [Chao et al., 2023](https://arxiv.org/abs/2310.08419). [↩︎](https://www.lesswrong.com/feed.xml#fnref-jBQpr3E5jgHyuYLCm-4)\n\nHaiku often writes `[[rating: N]]`\n\ninstead of `[[N]]`\n\n. It has different parse rates depending on prompt condition (32%-72%).\nWhen responses don’t parse it can pull down the ASR. [↩︎](https://www.lesswrong.com/feed.xml#fnref-jBQpr3E5jgHyuYLCm-5)\n\nAn “Ignore” attack prepends adversarial strings like\n“Ignore previous instructions …”. [↩︎](https://www.lesswrong.com/feed.xml#fnref-jBQpr3E5jgHyuYLCm-6)", "url": "https://wpnews.pro/news/evaluating-using-mock-tool-calls-to-quarantine-untrusted-prompt-inputs", "canonical_source": "https://www.lesswrong.com/posts/GPMZaRmxxNb8o2aGo/evaluating-using-mock-tool-calls-to-quarantine-untrusted-2", "published_at": "2026-06-05 22:43:28+00:00", "updated_at": "2026-06-05 22:52:43.132512+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research", "natural-language-processing", "ai-ethics"], "entities": ["OpenAI", "Zhao et al.", "ICML", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/evaluating-using-mock-tool-calls-to-quarantine-untrusted-prompt-inputs", "markdown": "https://wpnews.pro/news/evaluating-using-mock-tool-calls-to-quarantine-untrusted-prompt-inputs.md", "text": "https://wpnews.pro/news/evaluating-using-mock-tool-calls-to-quarantine-untrusted-prompt-inputs.txt", "jsonld": "https://wpnews.pro/news/evaluating-using-mock-tool-calls-to-quarantine-untrusted-prompt-inputs.jsonld"}}