{"slug": "my-llm-api-calls-were-failing-silently-here-s-the-logging-setup-i-wish-i-had", "title": "My LLM API Calls Were Failing Silently. Here's the Logging Setup I Wish I Had Earlier", "summary": "A developer built a structured logging system for LLM API calls after encountering silent failures in production. The setup captures latency, token usage, retries, fallbacks, and error types to distinguish between simple outages and subtle degradations. The implementation uses the OpenAI SDK with custom event logging for both successful and failed requests.", "body_md": "The first few LLM API bugs I hit in production were easy to notice.\n\nThe request failed. The user saw an error. I opened the logs, found the stack trace, fixed the obvious thing, and moved on.\n\nThe harder bugs were quieter.\n\nThe API still returned a response, but it was slower than usual. A fallback model kicked in without anyone noticing. Token usage crept up over a few days. A retry made the request succeed, but doubled the latency. Streaming worked most of the time, except when it didn't.\n\nNothing looked \"down.\" The app just started feeling worse.\n\nThat was when I realized my LLM logging was too thin.\n\nI was logging errors, but not enough context to understand behavior.\n\nFor a typical REST API call, I might log:\n\nThat is useful, but LLM calls have a few extra dimensions.\n\nA successful LLM request can still be a problem if:\n\nIf all I log is `status: 200`\n\n, I miss almost everything that matters.\n\nThis is the basic shape I try to capture now:\n\n```\n{\n  \"event\": \"llm_request\",\n  \"request_id\": \"req_123\",\n  \"provider\": \"tokenbay\",\n  \"model\": \"gpt-4.1-mini\",\n  \"operation\": \"chat_completion\",\n  \"status\": \"success\",\n  \"latency_ms\": 1842,\n  \"input_tokens\": 812,\n  \"output_tokens\": 244,\n  \"estimated_cost_usd\": 0.0019,\n  \"retry_count\": 0,\n  \"fallback_from\": null,\n  \"fallback_to\": null,\n  \"streaming\": false,\n  \"error_type\": null,\n  \"error_message\": null\n}\n```\n\nFor failed requests:\n\n```\n{\n  \"event\": \"llm_request\",\n  \"request_id\": \"req_124\",\n  \"provider\": \"tokenbay\",\n  \"model\": \"some-model\",\n  \"operation\": \"chat_completion\",\n  \"status\": \"error\",\n  \"latency_ms\": 5000,\n  \"input_tokens\": null,\n  \"output_tokens\": null,\n  \"estimated_cost_usd\": null,\n  \"retry_count\": 2,\n  \"fallback_from\": \"some-model\",\n  \"fallback_to\": \"backup-model\",\n  \"streaming\": false,\n  \"error_type\": \"rate_limit\",\n  \"error_message\": \"Rate limit exceeded\"\n}\n```\n\nThe exact fields depend on your app, but the categories matter more than the names.\n\nI want to know:\n\nThat is the difference between \"the AI feature feels slow today\" and \"requests to model X are retrying twice after 429s, then falling back to model Y.\"\n\nHere is a simple version using the OpenAI SDK.\n\nIt works with OpenAI directly, or with any OpenAI-compatible endpoint by changing `baseURL`\n\n.\n\nInstall:\n\n```\nnpm install openai\n```\n\nCreate `llm-client.js`\n\n:\n\n``` python\nimport OpenAI from \"openai\";\nimport crypto from \"node:crypto\";\n\nconst client = new OpenAI({\n  apiKey: process.env.LLM_API_KEY,\n  baseURL: process.env.LLM_BASE_URL || \"https://api.openai.com/v1\"\n});\n\nfunction nowMs() {\n  return Number(process.hrtime.bigint() / 1000000n);\n}\n\nfunction promptHash(messages) {\n  const text = JSON.stringify(messages);\n  return crypto.createHash(\"sha256\").update(text).digest(\"hex\").slice(0, 16);\n}\n\nfunction classifyError(error) {\n  const status = error?.status;\n\n  if (status === 400) return \"invalid_request\";\n  if (status === 401 || status === 403) return \"auth_or_permission\";\n  if (status === 413) return \"request_too_large\";\n  if (status === 429) return \"rate_limit\";\n  if (status === 503) return \"service_unavailable\";\n  if (status === 504) return \"upstream_timeout\";\n  if (status >= 500) return \"provider_5xx\";\n\n  const message = String(error?.message || \"\").toLowerCase();\n\n  if (message.includes(\"context length\")) return \"context_length\";\n  if (message.includes(\"timeout\")) return \"timeout\";\n  if (message.includes(\"content filter\")) return \"content_filter\";\n\n  return \"unknown\";\n}\n\nfunction logLLMEvent(event) {\n  console.log(JSON.stringify(event));\n}\n\nexport async function createLoggedChatCompletion({\n  requestId,\n  provider = \"default\",\n  model,\n  messages,\n  temperature = 0.2,\n  maxTokens = 500,\n  streaming = false\n}) {\n  const startedAt = nowMs();\n\n  const baseEvent = {\n    event: \"llm_request\",\n    request_id: requestId,\n    provider,\n    model,\n    operation: \"chat_completion\",\n    prompt_hash: promptHash(messages),\n    streaming,\n    retry_count: 0,\n    fallback_from: null,\n    fallback_to: null\n  };\n\n  try {\n    const response = await client.chat.completions.create({\n      model,\n      messages,\n      temperature,\n      max_tokens: maxTokens,\n      stream: streaming\n    });\n\n    const latencyMs = nowMs() - startedAt;\n\n    if (streaming) {\n      logLLMEvent({\n        ...baseEvent,\n        status: \"success\",\n        latency_ms: latencyMs,\n        input_tokens: null,\n        output_tokens: null,\n        estimated_cost_usd: null,\n        error_type: null,\n        error_message: null\n      });\n\n      return response;\n    }\n\n    logLLMEvent({\n      ...baseEvent,\n      status: \"success\",\n      latency_ms: latencyMs,\n      input_tokens: response.usage?.prompt_tokens ?? null,\n      output_tokens: response.usage?.completion_tokens ?? null,\n      estimated_cost_usd: null,\n      error_type: null,\n      error_message: null\n    });\n\n    return response;\n  } catch (error) {\n    const latencyMs = nowMs() - startedAt;\n\n    logLLMEvent({\n      ...baseEvent,\n      status: \"error\",\n      latency_ms: latencyMs,\n      input_tokens: null,\n      output_tokens: null,\n      estimated_cost_usd: null,\n      error_type: classifyError(error),\n      error_message: error?.message || \"Unknown error\"\n    });\n\n    throw error;\n  }\n}\n```\n\nUse it like this:\n\n``` python\nimport crypto from \"node:crypto\";\nimport { createLoggedChatCompletion } from \"./llm-client.js\";\n\nconst response = await createLoggedChatCompletion({\n  requestId: crypto.randomUUID(),\n  provider: \"openai-compatible\",\n  model: \"gpt-4.1-mini\",\n  messages: [\n    {\n      role: \"user\",\n      content: \"Explain retries and exponential backoff in one paragraph.\"\n    }\n  ]\n});\n\nconsole.log(response.choices[0].message.content);\n```\n\nRun it:\n\n```\nLLM_API_KEY=\"your-api-key\" node app.js\n```\n\nIf you use TokenBay, the OpenAI-compatible base URL is:\n\n```\nLLM_API_KEY=\"your-tokenbay-api-key\" \\\nLLM_BASE_URL=\"https://api.tokenbay.com/v1\" \\\nnode app.js\n```\n\nSame SDK shape. Different base URL.\n\nThis part matters.\n\nIt is tempting to log the full prompt because it makes debugging easier. I try not to do that by default.\n\nPrompts can contain:\n\nInstead, I usually log a hash of the prompt and a few safe metadata fields:\n\n```\n{\n  \"prompt_hash\": \"a3f9c01de81b7a22\",\n  \"message_count\": 4,\n  \"has_system_prompt\": true,\n  \"input_chars\": 3821\n}\n```\n\nThat lets me group repeated failures without storing the actual content.\n\nFor local development, raw prompt logging can be useful. For production, I want it behind a very explicit flag, with retention rules and access control.\n\nProvider-side usage logs are useful.\n\nFor example, TokenBay's Usage Logs page can show request-level details such as time, model, token count, and cost.\n\nThat is helpful, especially when you are using multiple models through one OpenAI-compatible API.\n\nBut provider logs usually do not know your application context.\n\nThey do not know that this request came from your support reply generator, or that the user had already waited through two failed attempts, or that the answer was discarded before being shown.\n\nThat is why I still keep app-side logs.\n\nThe provider can tell me what happened at the API layer.\n\nMy app logs tell me why it mattered.\n\nSome fields looked boring at first, but ended up being the most useful.\n\n`model`\n\nThis sounds obvious until you have multiple models in production.\n\nIf your app can use GPT, Claude, Gemini, Qwen, DeepSeek, GLM, or smaller fallback models, you need to know which one actually handled the request.\n\nNot which one the product team thinks is configured.\n\nThe actual model.\n\n`provider`\n\nThis matters when using multiple vendors or an OpenAI-compatible API gateway.\n\nThe same model name can behave differently depending on the provider, gateway, account limits, or routing setup.\n\nIf latency spikes, I want to know whether it is model-specific or provider-specific.\n\n`latency_ms`\n\nAverage latency is not enough.\n\nI usually want p50, p95, and p99 by model and operation.\n\nA chatbot can feel fine at p50 and awful at p95.\n\n`retry_count`\n\nRetries are sneaky.\n\nThey make reliability look better while quietly increasing latency and cost.\n\nIf a request succeeds after two retries, the user may not see an error, but the system still degraded.\n\n`fallback_from`\n\nand `fallback_to`\n\nFallback is great until it hides the original problem.\n\nIf model A fails and model B saves the request, that is useful. But if it happens 30 percent of the time, I need to know.\n\nOtherwise I might think model A is working fine.\n\n`input_tokens`\n\nand `output_tokens`\n\nToken usage explains a lot of cost surprises.\n\nWhen a bill jumps, the cause is often not \"the provider got expensive.\" It is more likely:\n\nYou cannot see that from request count alone.\n\n`error_type`\n\nRaw error messages are messy.\n\nOne provider says `rate_limit_exceeded`\n\n. Another says `Too many requests`\n\n. Another gives you a 429 with a different body.\n\nI normalize errors into categories:\n\n``` js\nconst errorTypes = [\n  \"auth_or_permission\",\n  \"invalid_request\",\n  \"rate_limit\",\n  \"request_too_large\",\n  \"context_length\",\n  \"content_filter\",\n  \"provider_5xx\",\n  \"service_unavailable\",\n  \"upstream_timeout\",\n  \"stream_interrupted\",\n  \"unknown\"\n];\n```\n\nThis makes dashboards and alerts much easier.\n\nThe worst failures are not always exceptions.\n\nThese are the ones I try to catch with logs and metrics:\n\nA provider starts returning intermittent 429s, 503s, or 504s.\n\nYour retry logic hides it.\n\nThe app still works, but latency doubles and costs rise.\n\nWatch:\n\nFallback should be the backup plan.\n\nIf fallback becomes normal, you may have a provider issue, a bad timeout setting, or a model that is no longer suitable.\n\nWatch:\n\nThis is when prompts slowly get larger over time.\n\nMaybe you added more retrieved documents. Maybe the system prompt grew. Maybe conversation history is not being trimmed.\n\nNothing breaks immediately. The bill just gets heavier.\n\nWatch:\n\nStreaming can fail differently from normal responses.\n\nSometimes the first tokens arrive, then the stream stops. If you only log the initial request success, you miss the failure.\n\nWatch:\n\nThis happens when config changes, environment variables drift, or a gateway route points somewhere unexpected.\n\nThe app asks for one model, but production traffic goes somewhere else.\n\nWatch:\n\nAfter a few rounds, my log event usually grows into something like this:\n\n```\n{\n  \"event\": \"llm_request\",\n  \"timestamp\": \"2026-06-26T08:30:00.000Z\",\n  \"request_id\": \"req_abc\",\n  \"user_id_hash\": \"user_91ab\",\n  \"environment\": \"production\",\n  \"feature\": \"support_reply_generator\",\n  \"provider\": \"tokenbay\",\n  \"model\": \"gpt-4.1-mini\",\n  \"operation\": \"chat_completion\",\n  \"streaming\": false,\n  \"status\": \"success\",\n  \"latency_ms\": 1842,\n  \"retry_count\": 1,\n  \"fallback_from\": null,\n  \"fallback_to\": null,\n  \"input_tokens\": 812,\n  \"output_tokens\": 244,\n  \"estimated_cost_usd\": 0.0019,\n  \"prompt_hash\": \"a3f9c01de81b7a22\",\n  \"error_type\": null\n}\n```\n\nThis is not fancy observability.\n\nIt is just enough structure to answer practical questions.\n\nWhich feature got slower?\n\nWhich model is causing errors?\n\nDid fallback save us or hide a bigger issue?\n\nDid the cost increase because of traffic, tokens, retries, or model choice?\n\nDisclosure: I work on [TokenBay](https://www.tokenbay.com/?utm_source=devto&utm_medium=community_content&utm_campaign=week1_free_content), so I am biased here.\n\nOne reason I care about this logging shape is that TokenBay is built around using multiple AI models through one OpenAI-compatible API.\n\nThat makes it convenient to switch between models, but it also makes observability more important.\n\nTokenBay can show usage details at the API layer. I still want my own application logs because my app knows things the API layer cannot always know:\n\nThe more flexible your model setup becomes, the more important boring logs become.\n\nFor every production LLM call, I want enough information to debug four questions:\n\nIf my logs cannot answer those, I am probably flying blind.\n\nThe annoying part is that you usually do not notice this on day one.\n\nYou notice it later, when something is already weird and your only log line says:\n\n```\nLLM request completed\n```\n\nAsk me how I know.", "url": "https://wpnews.pro/news/my-llm-api-calls-were-failing-silently-here-s-the-logging-setup-i-wish-i-had", "canonical_source": "https://dev.to/plasma_01/my-llm-api-calls-were-failing-silently-heres-the-logging-setup-i-wish-i-had-earlier-507o", "published_at": "2026-06-26 05:53:05+00:00", "updated_at": "2026-06-26 06:03:59.261105+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "mlops"], "entities": ["OpenAI", "TokenBay"], "alternates": {"html": "https://wpnews.pro/news/my-llm-api-calls-were-failing-silently-here-s-the-logging-setup-i-wish-i-had", "markdown": "https://wpnews.pro/news/my-llm-api-calls-were-failing-silently-here-s-the-logging-setup-i-wish-i-had.md", "text": "https://wpnews.pro/news/my-llm-api-calls-were-failing-silently-here-s-the-logging-setup-i-wish-i-had.txt", "jsonld": "https://wpnews.pro/news/my-llm-api-calls-were-failing-silently-here-s-the-logging-setup-i-wish-i-had.jsonld"}}