BoxAgnts Tool System (6) — Multi-Provider Adaptation and the Agent Query Loop

wpnews.pro

BoxAgnts' tool system, from the bottom-level WASM sandbox to the top-level Tool trait, has solved "how tools run safely." But tools ultimately need to be called by AI models — which introduces two engineering problems: the complete incompatibility of API formats across AI vendors, and the interleaved orchestration of conversation flow and tool execution. These two problems are solved by the Provider abstraction layer and the Agent query loop, respectively.

Different types of AI model APIs differ significantly in request format, response format, and error handling.

Let's start with the request side. Anthropic splits roles into user

and assistant

, with the system prompt as an independent top-level system

field; OpenAI treats the system prompt as a role: "system"

message; Google Gemini places system_instruction

at the top level of the request body but with yet another format. If the upper-layer Agent loop had to handle these differences directly, the code would become a giant match provider_id { ... }

branch.

BoxAgnts' solution introduces three layers of abstraction:

Layer 1: ProviderRequest / ProviderResponse Unified Data Model

// provider_types.rs
pub struct ProviderRequest {
    pub messages: Vec<ApiMessage>,
    pub system: Option<String>,
    pub tools: Vec<ApiToolDefinition>,
    pub max_tokens: u32,
    pub temperature: Option<f32>,
}

pub struct ProviderResponse {
    pub content: Vec<ContentBlock>,
    pub usage: UsageInfo,
    pub stop_reason: String,
}

The Agent loop only deals with these two structures, never needing to know whether the user has configured Anthropic or OpenAI.

Layer 2: LlmProvider trait

pub trait LlmProvider: Send + Sync {
    fn id(&self) -> &ProviderId;
    async fn create_message_stream(
        &self, request: ProviderRequest
    ) -> Result<Pin<Box<dyn Stream<Item = Result<StreamEvent, ProviderError>> + Send>>>;
    async fn list_models(&self) -> Result<Vec<ModelInfo>>;
}

create_message_stream

returns a Pin<Box<dyn Stream>>

— the standard idiom in Rust's async ecosystem for unifying multiple stream types (analogous to Java's Stream<T>

or Python's AsyncIterator

). Each Provider implementation internally handles its own HTTP request construction, authentication, and SSE parsing, exposing a unified StreamEvent

externally.

Layer 3: Transformer (Message Format Conversion)

Transformers handle the "last mile" of eliminating vendor format differences:

// transformers/anthropic.rs
pub fn to_anthropic_request(req: &ProviderRequest) -> AnthropicMessagesRequest { ... }

// transformers/openai_chat.rs
pub fn to_openai_request(req: &ProviderRequest) -> OpenAIChatRequest { ... }

Transformers are pure functions — unified format in, vendor format out. Adding a new Provider only requires implementing a new Transformer and corresponding LlmProvider

implementation. The shared ProviderRegistry

looks up implementations by Provider ID:

pub struct ProviderRegistry {
    providers: HashMap<ProviderId, Arc<dyn LlmProvider>>,
    default_provider_id: ProviderId,
}

All Providers' streaming interactions rely on SSE (Server-Sent Events). But each vendor's SSE event granularity and semantics differ:

content_block_start

/ content_block_delta

/ content_block_stop

form a three-level event hierarchy; a single ContentBlock spans multiple SSE messages from start to stopchoices[0].delta

is a flat delta with no explicit block start/stopBoxAgnts' stream_parser

module digests all these differences and exposes a unified StreamEvent

enum:

pub enum StreamEvent {
    TextDelta { text: String },
    ToolUseStart { id: String, name: String },
    ToolUseDelta { id: String, json: String },
    ToolUseEnd { id: String },
    ThinkingDelta { text: String },
    UsageUpdate { input_tokens: u32, output_tokens: u32 },
    MessageStop,
}

Each Provider's stream parser internally is a finite state machine. Taking Anthropic as an example:

Wait for message_start
  │
  ├── message_start ──► extract model, initial usage
  │
  ├── content_block_start
  │     │ type = "text"        → create TextBlock state
  │     │ type = "tool_use"    → create ToolUseBlock state, emit ToolUseStart
  │     │ type = "thinking"    → create ThinkingBlock state
  │
  ├── content_block_delta
  │     │ text_delta           → append to current TextBlock, emit TextDelta
  │     │ input_json_delta     → concatenate JSON fragment to ToolUseBlock, emit ToolUseDelta
  │     │ thinking_delta       → append to ThinkingBlock, emit ThinkingDelta
  │
  ├── content_block_stop
  │     │ corresponding tool_use block → emit ToolUseEnd
  │
  └── message_stop ──► emit MessageStop, accumulate final usage

StreamAccumulator

maintains the state of all ContentBlocks in the current message:

pub struct StreamAccumulator {
    text_blocks: Vec<TextBlock>,
    tool_use_blocks: HashMap<String, ToolUseBlock>,
    thinking_block: Option<String>,
    usage: UsageInfo,
}

When MessageStop

arrives, finish()

assembles all accumulated blocks into a complete Message

, returning stop_reason

and final UsageInfo

.

The stream parser has converted SSE events into structured Message

. Next, query::run_query_loop()

hands this Message to the tool system.

Core flow:

loop {
    // 1. Send message history + system Prompt + tool list to the AI model
    let request = CreateMessageRequest::builder(model, max_tokens)
        .messages(messages)
        .tools(all_tools_as_definitions(tools))
        .build();

    // 2. Initiate streaming request, parse SSE events
    let mut rx = client.create_message_stream(request).await?;
    let mut acc = StreamAccumulator::new();

    while let Some(evt) = rx.recv().await {
        acc.on_event(&evt);
        match evt {
            StreamEvent::ToolUseStart { .. } | StreamEvent::ToolUseDelta { .. } => {
                // Send to frontend in real time (via WebSocket) so users can see what tools the model is using
            }
            StreamEvent::MessageStop => break,
            _ => {}
        }
    }

    // 3. Assemble the completed Message, check stop_reason
    let (msg, usage, stop_reason) = acc.finish();

    match stop_reason {
        "end_turn" => return QueryOutcome::EndTurn { message: msg, usage },
        "tool_use" => {
            // 4. For each tool_use ContentBlock, call the corresponding tool
            for block in msg.content.iter() {
                if let ContentBlock::ToolUse { name, input, .. } = block {
                    let tool = find_tool(tools, name);
                    let result = tool.execute(input, &ctx).await;
                    messages.push(result_to_message(result));
                }
            }
            // Return to loop top, continue to next round
        }
        "max_tokens" => {
            // 5. MaxTokens recovery: inject a hint message so the model can continue
            messages.push(UserMessage("Output token limit hit. Resume directly."));
            max_tokens_count += 1;
            if max_tokens_count > 3 { return MaxTokens { ... }; }
        }
        _ => return Error(...),
    }

    turn += 1;
    if turn >= config.max_turns { break; }
}

Several details worth noting:

Tool list injection strategy. Each API call round sends the complete tool list (all tools' name, description, and input_schema) as the tools

field to the AI model. This incurs a fixed token overhead — the more tools, the higher the per-round "tool description tokens." When tools exceed 20, this overhead becomes significant (potentially several thousand tokens/round). BoxAgnts' current strategy is full injection; future consideration includes tool selection and grouping mechanisms (similar to Anthropic's tool_choice

).

MaxTokens recovery. If a model exhausts its output token limit mid-response, it hasn't truly "failed" — it just hasn't finished speaking. BoxAgnts automatically injects a recovery message ("Output token limit hit. Resume directly...") to let the model continue. This loop executes at most 3 times — if after 3 attempts max_tokens is still hit, the task is genuinely too long; the system gives up and returns partial results.

Cancellation mechanism. CancellationToken

is borrowed from the tokio ecosystem. When the user clicks the "Stop" button in the frontend, the WebSocket handler cancels the corresponding token, and run_query_loop

returns QueryOutcome::Cancelled

at its next check.

Cost tracking. After each API call round, CostTracker

accumulates the current model's pricing (separately priced by input/output token; different models have different prices). If cumulative costs exceed budget_limit_usd

, QueryOutcome::BudgetExceeded

is returned. Cost information is pushed in real time to the frontend Dashboard via WebSocket.

AI API calls have several typical failure modes:

Error Type	Typical HTTP Code	Strategy
Rate Limit	429	Exponential backoff retry, respect Retry-After header
Overloaded	529	Exponential backoff retry, optional fallback model
Auth Failure	401/403	No retry, return error immediately
Bad Request	400	No retry (retrying parameter errors is pointless)
Server Error	500+	Limited retry (max 3 times)
Network Timeout	—	Limited retry

Exponential backoff uses intervals of 1s → 2s → 4s → 8s, multiplying on Duration

. For 529 (Overloaded), model switching is additionally supported — if the user has configured a fallback model (e.g., claude-sonnet-4-5

overloaded, switching to claude-haiku-4-5

), subsequent calls automatically use the fallback.

The steps for adding a new Provider are clear:

providers/

, implement the LlmProvider

traitregistry.rs

's provider_from_key()

model_registry.rs

The openai_compat_providers

module is a shortcut: for services using the OpenAI API format (DeepSeek, OpenCode, various domestic models), only API base URL and API key configuration is needed — no Provider code needs to be written. These services share the same OpenAI-compatible SSE parser and Request builder; only the configuration differs.

// Configuration example
"deepseek": {
    "provider_id": "deepseek",
    "api_base": "https://api.deepseek.com/v1",
    "api_key": "sk-...",
    "provider_type": "openai_compat"
}

The Provider abstraction and Agent query loop constitute BoxAgnts' tool system "engine":

Provider abstraction solves the problem of integrating 12 AI APIs through three-layer decoupling (ProviderRequest/Response unified data model → LlmProvider trait → Transformer format conversion). Adding a new Provider requires only implementing the trait + registration; the shared SSE parser and Request builder further reduce integration costs through the openai_compat

module.

Agent query loop achieves interleaved orchestration of conversation and tool execution through a closed loop of SSE state machine parsing, ToolUse detection, tool dispatch, and result feedback. MaxTokens automatic recovery (max 3 attempts) and exponential backoff retry strategy ensure reliability for long tasks.

The common feature of these two layers is dependency inversion — the Agent loop doesn't depend on a specific AI vendor, and the Provider implementation doesn't depend on specific conversation orchestration logic. All coupling is decoupled through trait interfaces.

Cost tracking (CostTracker

AtomicF64

) and cancellation mechanism (CancellationToken

) provide necessary operational observability and user control for production environments.

source & further reading

dev.to — original article BloodHound for AI Agents Means We've Officially Given Up Pretending This Is Simple The Governed Execution Gateway: Securing MCP Servers and Tool Egress Proxies LangChain, LangGraph, LangSmith, Langflow... What's the Difference? (2026 Developer's Map)

BoxAgnts Tool System (6) — Multi-Provider Adaptation and the Agent Query Loop

Run your AI side-project on zahid.host