ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

wpnews.pro

cd /news/large-language-models/toolsense-a-diagnostic-framework-for… · home › topics › large-language-models › article

[ARTICLE · art-24792] src=arxiv.org pub=2026-06-12T04:00Z topic=large-language-models verified=true sentiment=· neutral

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Researchers have developed ToolSense, an open-source diagnostic framework that automatically generates three benchmarks to audit whether large language models truly understand their parametric tool knowledge. Testing on ToolBench's 47,000 tools revealed a knowledge-retrieval dissociation where models collapsed by 50-64 percentage points on realistic queries compared to standard benchmarks, with some scoring near-random on factual probes despite strong retrieval performance. The findings expose critical limitations in current tool-retrieval evaluation methods, as embedding-based and parametric approaches may achieve high scores on fully-specified queries without genuine tool comprehension.

read1 min publishedJun 12, 2026

arXiv:2606.12451v1 Announce Type: new Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/toolsense-a-diagnostic-f…

Read original on arxiv.org → arxiv.org/abs/2606.12451

mentioned entities

ToolSense

ToolBench

LLM

arXiv

metadata

slugtoolsense-a-diagnostic-framework-for-auditing-parametric-tool-knowledge-in-llms

topic#large-language-models

secondary4 topics

sentimentneutral

langen

canonicalarxiv.org

navigation

← prevLinear Coding Sessions

next →Can KKR Outmaneuver One of the B…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 13 Jun · #large-language-models

AI Agent Architecture: Why Process-Level Resilience Beats Proxy Gateways

dev.to · 13 Jun · #large-language-models

I almost gave up on my AI assistant — here’s how I fixed context handling

lesswrong.com · 12 Jun · #large-language-models

When Emotion Descriptors Fail: AI-Native Functions of Emotion Vectors

code.visualstudio.com · 17 Jun · #large-language-models

Visual Studio Code 1.125

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required