AI Coding Agents Need Runtime Telemetry Before Commit Telemetry

A new arXiv paper scanning over 180 million Git repositories found that AI coding agents are heavily used in open source, but single-signal observability is weak. The study revealed a 30x recall gap between multi-method detection and bot-account lookup for Claude Code commits. The paper argues that runtime telemetry, not just commit telemetry, is essential for monitoring agent execution and preventing unsafe behavior.

A new arXiv paper published on June 23, 2026 scanned more than 180 million Git repositories to detect traces of AI coding agents in open source. The authors used multiple signals, including configuration-file scanning, commit-message analysis, author-identity matching, and bot-signature lookup. The most useful result for developers is the visibility gap. In one snapshot, multi-method detection found 850,157 Claude Code commits. Bot-account lookup found only 28,154. That is 3.3%, or a 30x relative recall gap. The paper also reports more than 320,000 commit-attributed agent commits per month across snapshots from December 2024 to April 2026. The immediate takeaway: AI coding agents are being used heavily. The engineering takeaway: Single-signal observability is weak. Commit telemetry is too late A commit is the end of an agent run. It does not tell you enough about the run itself. A commit may not show: how many model calls happened how many retries happened whether prompts repeated whether tools failed whether the model price was known whether the run exceeded budget whether the agent made progress whether fallback models were used whether the agent stopped safely If you only inspect the repository after the fact, you are observing the artifact. You are not observing the execution. For agent systems, execution is where many failures happen. Agents are loops A coding agent is usually some version of this: while task.done { const response = await model.call task.context ; const action = parseAction response ; const result = await runTool action ; task = updateTask task, result ; } This is useful. It is also incomplete. There is no budget. No max-step limit. No retry control. No prompt-loop detection. No known-pricing check. No no-progress stop. A safer runtime shape puts a decision before the provider call. const decision = guard.beforeCall { runId: task.id, model: task.model, prompt: task.currentPrompt, stepCount: task.steps.length, retryCount: task.retryCount, previousPrompts: task.previousPrompts, budgetRemaining: task.budgetRemaining, progressState: task.progress, } ; if decision.allowed { return { status: "stopped", reason: decision.reason, error: decision.error, }; } const response = await model.call task.context ; The important part is not the exact API. The important part is timing. The check happens before the provider call. That means the runtime can stop unsafe execution before more cost is created. What to log before the call A useful agent runtime should log decision inputs, not only final outputs. For each provider call, consider recording: type AgentCallDecision = { runId: string; model: string; modelPriceKnown: boolean; stepCount: number; maxSteps: number; retryCount: number; budgetRemaining: number; estimatedNextCallCost: number; promptSimilarityScore?: number; progressScore?: number; allowed: boolean; stopReason?: string; }; This gives you data that a commit cannot provide. You can now ask: Which tasks hit max steps? Which runs stopped because pricing was unknown? Which prompts repeated? Which models caused budget pressure? Which agent workflows produced commits only after many failed attempts? Which agents consumed budget without progress? That is runtime telemetry. Guardrails to implement first Agents should not run forever. if stepCount = maxSteps { return { allowed: false, reason: "max steps exceeded", }; } This is basic. It is also one of the highest-value controls. If the runtime cannot price the model, it cannot enforce a budget. if pricingCatalog model { return { allowed: false, reason: "unknown model pricing", }; } Do not guess. Fail closed. Budgets should exist at the task level, not only at the account level. if estimatedNextCallCost budgetRemaining { return { allowed: false, reason: "budget exceeded", }; } A small refactor and a multi-hour migration should not share the same ceiling. Retries are normal. Retry storms are not. if retryCount maxRetries && recentErrorsAreSimilar errors { return { allowed: false, reason: "retry storm detected", }; } The goal is not to ban retries. The goal is to stop blind repetition. If the current prompt is almost the same as previous failed prompts, the agent may be stuck. if similarToRecentPrompt currentPrompt, previousPrompts { return { allowed: false, reason: "similar prompt loop", }; } Even a simple similarity check can catch obvious waste. A run can be active and still not moving. Track progress signals: tests passing errors decreasing files changing meaningfully checklist items completing user-defined success criteria improving If those signals do not change after several steps, stop. Why this matters now GitHub has already said Copilot moved to usage-based billing on June 1, 2026, with usage calculated from token consumption including input, output, and cached tokens. GitHub also described Copilot as moving from an in-editor assistant into an agentic platform capable of long, multi-step coding sessions across repositories. That means agent runtime behavior increasingly has direct cost impact. A loop is no longer just a UX problem. It is a billing problem. A retry storm is not just noisy. It is spend. A prompt loop is not just inefficient. It is measurable waste. Where AI CostGuard fits AI CostGuard is the local-first TypeScript / Node.js runtime safety layer I’m building for this problem. It focuses on stopping agent failures before provider calls execute: retry storms prompt loops max-step explosions no-progress runs budget overruns unknown model pricing runaway agent behavior The key design question is simple: Should this next provider call be allowed? If the answer is no, the runtime should return a structured stop reason before the call happens. Takeaway The new arXiv paper shows that even detecting AI coding-agent activity in repositories requires multiple signals. That lesson applies directly to runtime engineering. Do not wait for the commit. Do not wait for the dashboard. Do not wait for the invoice. Instrument the loop. Add one pre-call decision log to your agent runtime before adding another dashboard. https://github.com/salimassili62-afk/ai-costguard https://github.com/salimassili62-afk/ai-costguard