My LLM API Calls Were Failing Silently. Here's the Logging Setup I Wish I Had Earlier

A developer built a structured logging system for LLM API calls after encountering silent failures in production. The setup captures latency, token usage, retries, fallbacks, and error types to distinguish between simple outages and subtle degradations. The implementation uses the OpenAI SDK with custom event logging for both successful and failed requests.

The first few LLM API bugs I hit in production were easy to notice. The request failed. The user saw an error. I opened the logs, found the stack trace, fixed the obvious thing, and moved on. The harder bugs were quieter. The API still returned a response, but it was slower than usual. A fallback model kicked in without anyone noticing. Token usage crept up over a few days. A retry made the request succeed, but doubled the latency. Streaming worked most of the time, except when it didn't. Nothing looked "down." The app just started feeling worse. That was when I realized my LLM logging was too thin. I was logging errors, but not enough context to understand behavior. For a typical REST API call, I might log: That is useful, but LLM calls have a few extra dimensions. A successful LLM request can still be a problem if: If all I log is status: 200 , I miss almost everything that matters. This is the basic shape I try to capture now: { "event": "llm request", "request id": "req 123", "provider": "tokenbay", "model": "gpt-4.1-mini", "operation": "chat completion", "status": "success", "latency ms": 1842, "input tokens": 812, "output tokens": 244, "estimated cost usd": 0.0019, "retry count": 0, "fallback from": null, "fallback to": null, "streaming": false, "error type": null, "error message": null } For failed requests: { "event": "llm request", "request id": "req 124", "provider": "tokenbay", "model": "some-model", "operation": "chat completion", "status": "error", "latency ms": 5000, "input tokens": null, "output tokens": null, "estimated cost usd": null, "retry count": 2, "fallback from": "some-model", "fallback to": "backup-model", "streaming": false, "error type": "rate limit", "error message": "Rate limit exceeded" } The exact fields depend on your app, but the categories matter more than the names. I want to know: That is the difference between "the AI feature feels slow today" and "requests to model X are retrying twice after 429s, then falling back to model Y." Here is a simple version using the OpenAI SDK. It works with OpenAI directly, or with any OpenAI-compatible endpoint by changing baseURL . Install: npm install openai Create llm-client.js : python import OpenAI from "openai"; import crypto from "node:crypto"; const client = new OpenAI { apiKey: process.env.LLM API KEY, baseURL: process.env.LLM BASE URL || "https://api.openai.com/v1" } ; function nowMs { return Number process.hrtime.bigint / 1000000n ; } function promptHash messages { const text = JSON.stringify messages ; return crypto.createHash "sha256" .update text .digest "hex" .slice 0, 16 ; } function classifyError error { const status = error?.status; if status === 400 return "invalid request"; if status === 401 || status === 403 return "auth or permission"; if status === 413 return "request too large"; if status === 429 return "rate limit"; if status === 503 return "service unavailable"; if status === 504 return "upstream timeout"; if status = 500 return "provider 5xx"; const message = String error?.message || "" .toLowerCase ; if message.includes "context length" return "context length"; if message.includes "timeout" return "timeout"; if message.includes "content filter" return "content filter"; return "unknown"; } function logLLMEvent event { console.log JSON.stringify event ; } export async function createLoggedChatCompletion { requestId, provider = "default", model, messages, temperature = 0.2, maxTokens = 500, streaming = false } { const startedAt = nowMs ; const baseEvent = { event: "llm request", request id: requestId, provider, model, operation: "chat completion", prompt hash: promptHash messages , streaming, retry count: 0, fallback from: null, fallback to: null }; try { const response = await client.chat.completions.create { model, messages, temperature, max tokens: maxTokens, stream: streaming } ; const latencyMs = nowMs - startedAt; if streaming { logLLMEvent { ...baseEvent, status: "success", latency ms: latencyMs, input tokens: null, output tokens: null, estimated cost usd: null, error type: null, error message: null } ; return response; } logLLMEvent { ...baseEvent, status: "success", latency ms: latencyMs, input tokens: response.usage?.prompt tokens ?? null, output tokens: response.usage?.completion tokens ?? null, estimated cost usd: null, error type: null, error message: null } ; return response; } catch error { const latencyMs = nowMs - startedAt; logLLMEvent { ...baseEvent, status: "error", latency ms: latencyMs, input tokens: null, output tokens: null, estimated cost usd: null, error type: classifyError error , error message: error?.message || "Unknown error" } ; throw error; } } Use it like this: python import crypto from "node:crypto"; import { createLoggedChatCompletion } from "./llm-client.js"; const response = await createLoggedChatCompletion { requestId: crypto.randomUUID , provider: "openai-compatible", model: "gpt-4.1-mini", messages: { role: "user", content: "Explain retries and exponential backoff in one paragraph." } } ; console.log response.choices 0 .message.content ; Run it: LLM API KEY="your-api-key" node app.js If you use TokenBay, the OpenAI-compatible base URL is: LLM API KEY="your-tokenbay-api-key" \ LLM BASE URL="https://api.tokenbay.com/v1" \ node app.js Same SDK shape. Different base URL. This part matters. It is tempting to log the full prompt because it makes debugging easier. I try not to do that by default. Prompts can contain: Instead, I usually log a hash of the prompt and a few safe metadata fields: { "prompt hash": "a3f9c01de81b7a22", "message count": 4, "has system prompt": true, "input chars": 3821 } That lets me group repeated failures without storing the actual content. For local development, raw prompt logging can be useful. For production, I want it behind a very explicit flag, with retention rules and access control. Provider-side usage logs are useful. For example, TokenBay's Usage Logs page can show request-level details such as time, model, token count, and cost. That is helpful, especially when you are using multiple models through one OpenAI-compatible API. But provider logs usually do not know your application context. They do not know that this request came from your support reply generator, or that the user had already waited through two failed attempts, or that the answer was discarded before being shown. That is why I still keep app-side logs. The provider can tell me what happened at the API layer. My app logs tell me why it mattered. Some fields looked boring at first, but ended up being the most useful. model This sounds obvious until you have multiple models in production. If your app can use GPT, Claude, Gemini, Qwen, DeepSeek, GLM, or smaller fallback models, you need to know which one actually handled the request. Not which one the product team thinks is configured. The actual model. provider This matters when using multiple vendors or an OpenAI-compatible API gateway. The same model name can behave differently depending on the provider, gateway, account limits, or routing setup. If latency spikes, I want to know whether it is model-specific or provider-specific. latency ms Average latency is not enough. I usually want p50, p95, and p99 by model and operation. A chatbot can feel fine at p50 and awful at p95. retry count Retries are sneaky. They make reliability look better while quietly increasing latency and cost. If a request succeeds after two retries, the user may not see an error, but the system still degraded. fallback from and fallback to Fallback is great until it hides the original problem. If model A fails and model B saves the request, that is useful. But if it happens 30 percent of the time, I need to know. Otherwise I might think model A is working fine. input tokens and output tokens Token usage explains a lot of cost surprises. When a bill jumps, the cause is often not "the provider got expensive." It is more likely: You cannot see that from request count alone. error type Raw error messages are messy. One provider says rate limit exceeded . Another says Too many requests . Another gives you a 429 with a different body. I normalize errors into categories: js const errorTypes = "auth or permission", "invalid request", "rate limit", "request too large", "context length", "content filter", "provider 5xx", "service unavailable", "upstream timeout", "stream interrupted", "unknown" ; This makes dashboards and alerts much easier. The worst failures are not always exceptions. These are the ones I try to catch with logs and metrics: A provider starts returning intermittent 429s, 503s, or 504s. Your retry logic hides it. The app still works, but latency doubles and costs rise. Watch: Fallback should be the backup plan. If fallback becomes normal, you may have a provider issue, a bad timeout setting, or a model that is no longer suitable. Watch: This is when prompts slowly get larger over time. Maybe you added more retrieved documents. Maybe the system prompt grew. Maybe conversation history is not being trimmed. Nothing breaks immediately. The bill just gets heavier. Watch: Streaming can fail differently from normal responses. Sometimes the first tokens arrive, then the stream stops. If you only log the initial request success, you miss the failure. Watch: This happens when config changes, environment variables drift, or a gateway route points somewhere unexpected. The app asks for one model, but production traffic goes somewhere else. Watch: After a few rounds, my log event usually grows into something like this: { "event": "llm request", "timestamp": "2026-06-26T08:30:00.000Z", "request id": "req abc", "user id hash": "user 91ab", "environment": "production", "feature": "support reply generator", "provider": "tokenbay", "model": "gpt-4.1-mini", "operation": "chat completion", "streaming": false, "status": "success", "latency ms": 1842, "retry count": 1, "fallback from": null, "fallback to": null, "input tokens": 812, "output tokens": 244, "estimated cost usd": 0.0019, "prompt hash": "a3f9c01de81b7a22", "error type": null } This is not fancy observability. It is just enough structure to answer practical questions. Which feature got slower? Which model is causing errors? Did fallback save us or hide a bigger issue? Did the cost increase because of traffic, tokens, retries, or model choice? Disclosure: I work on TokenBay https://www.tokenbay.com/?utm source=devto&utm medium=community content&utm campaign=week1 free content , so I am biased here. One reason I care about this logging shape is that TokenBay is built around using multiple AI models through one OpenAI-compatible API. That makes it convenient to switch between models, but it also makes observability more important. TokenBay can show usage details at the API layer. I still want my own application logs because my app knows things the API layer cannot always know: The more flexible your model setup becomes, the more important boring logs become. For every production LLM call, I want enough information to debug four questions: If my logs cannot answer those, I am probably flying blind. The annoying part is that you usually do not notice this on day one. You notice it later, when something is already weird and your only log line says: LLM request completed Ask me how I know.