AI 2026AI In 2026, AI applications are widely deployed in production but present unique challenges such as unstable model outputs, high latency, and unpredictable costs, which traditional application performance monitoring (APM) cannot address. The article introduces core methods for AI application observability, including logging, metrics (token consumption, latency, cost), tracing, and evaluation, and provides Python code examples for tracking AI latency, token usage, and classifying AI-specific errors. AI 应用可观测性完全指南:2026年生产环境AI监控实战 前言 2026 年,AI 应用已经广泛应用于生产环境。但 AI 应用有其独特性:模型输出不稳定、延迟高、成本难以预测。 传统的应用监控(APM)无法满足 AI 监控的需求。本文介绍 AI 应用可观测性的核心方法。 什么是 AI 可观测性 传统监控 vs AI 监控 | 维度 | 传统监控 | AI 监控 | |------|---------|---------| | 延迟 | HTTP 请求耗时 | API 调用 + 模型推理耗时 | | 错误率 | 4xx/5xx 状态码 | 拒绝、幻觉、格式错误 | | 成本 | 固定云资源 | Token 消耗波动 | | 质量 | 可精确测量 | 需要额外评估 | AI 可观测性四大支柱 ├── Logging(AI 请求日志) ├── Metrics(Token 消耗、延迟、成本) ├── Tracing(AI 调用链路追踪) └── Evaluation(输出质量评估) 核心指标体系 1. 延迟指标 python import time from functools import wraps class AILatencyTracker: def init self : self.latencies = def track self, func : """装饰器追踪延迟""" @wraps func async def async wrapper args, kwargs : start = time.time result = await func args, kwargs elapsed = time.time - start self.record "success", elapsed return result except Exception as e: elapsed = time.time - start self.record "error", elapsed @wraps func def sync wrapper args, kwargs : start = time.time result = func args, kwargs elapsed = time.time - start self.record "success", elapsed return result except Exception as e: elapsed = time.time - start self.record "error", elapsed import asyncio if asyncio.iscoroutinefunction func : return async wrapper return sync wrapper def record self, status: str, latency: float : self.latencies.append { "timestamp": time.time , "status": status, "latency ms": latency 1000 def get stats self - dict: """获取统计信息""" if not self.latencies: latencies = l "latency ms" for l in self.latencies "count": len latencies , "avg ms": sum latencies / len latencies , "p50 ms": sorted latencies len latencies // 2 , "p95 ms": sorted latencies int len latencies 0.95 , "p99 ms": sorted latencies int len latencies 0.99 , 2. Token 消耗指标 python class TokenTracker: def init self : self.records = self.total input tokens = 0 self.total output tokens = 0 def record self, model: str, input tokens: int, output tokens: int, cost: float : """记录 Token 消耗""" self.total input tokens += input tokens self.total output tokens += output tokens self.records.append { "timestamp": time.time , "model": model, "input tokens": input tokens, "output tokens": output tokens, "total tokens": input tokens + output tokens, "cost": cost def get daily cost self - dict: """获取每日成本""" today = time.time - 86400 24小时前 recent = r for r in self.records if r "timestamp" today total cost = sum r "cost" for r in recent total tokens = sum r "total tokens" for r in recent "cost today": total cost, "tokens today": total tokens, "avg cost per request": total cost / len recent if recent else 0 def get model breakdown self - dict: """按模型分类统计""" breakdown = {} for r in self.records: model = r "model" if model not in breakdown: breakdown model = {"cost": 0, "tokens": 0, "count": 0} breakdown model "cost" += r "cost" breakdown model "tokens" += r "total tokens" breakdown model "count" += 1 return breakdown 3. 错误分类 class AIErrorClassifier: ERROR TYPES = { "rate limit": {"retry": True, "severity": "medium"}, "auth error": {"retry": False, "severity": "high"}, "model error": {"retry": True, "severity": "medium"}, "timeout": {"retry": True, "severity": "low"}, "invalid request": {"retry": False, "severity": "high"}, "content filtered": {"retry": False, "severity": "medium"}, @classmethod def classify cls, error: Exception - dict: """分类错误类型""" error str = str error .lower if "429" in error str or "rate limit" in error str: return {"type": "rate limit", cls.ERROR TYPES "rate limit" } elif "401" in error str or "auth" in error str: return {"type": "auth error", cls.ERROR TYPES "auth error" } elif "500" in error str or "internal" in error str: return {"type": "model error", cls.ERROR TYPES "model error" } elif "timeout" in error str: return {"type": "timeout", cls.ERROR TYPES "timeout" } elif "400" in error str or "invalid" in error str: return {"type": "invalid request", cls.ERROR TYPES "invalid request" } elif "filtered" in error str or "content" in error str: return {"type": "content filtered", cls.ERROR TYPES "content filtered" } return {"type": "unknown", "retry": False, "severity": "high"} @classmethod def should retry cls, error: Exception - bool: """判断是否应该重试""" classification = cls.classify error return classification.get "retry", False 日志体系 结构化 AI 日志 python import json import logging from datetime import datetime class AILogger: def init self, log file: str = "ai logs.jsonl" : self.log file = log file self.logger = logging.getLogger "ai" self.logger.setLevel logging.INFO handler = logging.FileHandler log file handler.setFormatter logging.Formatter '% message s' self.logger.addHandler handler def log request self, request id: str, model: str, prompt: str, response: str = None, latency ms: float = None, tokens used: int = None, cost: float = None, error: str = None : """记录 AI 请求""" log entry = { "timestamp": datetime.utcnow .isoformat , "type": "ai request", "request id": request id, "model": model, "prompt length": len prompt , "response length": len response if response else None, "latency ms": latency ms, "tokens used": tokens used, "cost": cost, "error": error, "success": error is None self.logger.info json.dumps log entry, ensure ascii=False def log evaluation self, request id: str, quality score: float, categories: dict : """记录质量评估结果""" log entry = { "timestamp": datetime.utcnow .isoformat , "type": "quality evaluation", "request id": request id, "quality score": quality score, "categories": categories self.logger.info json.dumps log entry, ensure ascii=False ai logger = AILogger "ai production logs.jsonl" ai logger.log request request id="req 001", model="gpt-5.4", prompt="解释什么是机器学习", response="机器学习是...", latency ms=250, tokens used=1500, 日志分析查询 python import json class LogAnalyzer: def init self, log file: str : self.log file = log file def load logs self, limit: int = None : with open self.log file, 'r' as f: for i, line in enumerate f : if limit and i = limit: logs.append json.loads line return logs def get error rate self, hours: int = 24 - float: """计算错误率""" cutoff = datetime.utcnow .timestamp - hours 3600 logs = self.load logs recent = l for l in logs if datetime.fromisoformat l "timestamp" .timestamp cutoff if not recent: errors = sum 1 for l in recent if not l.get "success", True return errors / len recent def get expensive requests self, top n: int = 10 - list: """获取最贵的请求""" logs = self.load logs sorted logs = sorted l for l in logs if l.get "cost" , key=lambda x: x.get "cost", 0 , reverse=True return sorted logs :top n def get slow requests self, threshold ms: float = 5000 - list: """获取慢请求""" logs = self.load logs return l for l in logs if l.get "latency ms", 0 threshold ms 追踪链路 LangChain + OpenTelemetry python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter provider = TracerProvider processor = BatchSpanProcessor ConsoleSpanExporter provider.add span processor processor trace.set tracer provider provider tracer = trace.get tracer name class AIServiceWithTracing: def init self : self.llm = OpenAI self.vector db = VectorDB @tracer.start as current span "ai request" async def process request self, user input: str, user id: str : span = trace.get current span span.set attribute "user id", user id span.set attribute "input length", len user input 1. 检索相关文档 with tracer.start as current span "retrieve context" as span: docs = self.vector db.search user input span.set attribute "docs retrieved", len docs 2. 调用 LLM with tracer.start as current span "llm call" as span: start = time.time response = self.llm.generate user input, docs span.set attribute "model", "gpt-5.4" span.set attribute "latency ms", time.time - start 1000 span.set attribute "response length", len response span.set attribute "success", True return response except Exception as e: span.set attribute "success", False span.set attribute "error", str e 输出质量评估 自动质量评估 python class AIOutputEvaluator: def init self : self.llm = OpenAI def evaluate self, prompt: str, response: str - dict: """评估输出质量""" evaluation prompt = f""" 评估以下 AI 输出的质量: 用户输入:{prompt} AI 输出:{response} 评估维度(每项 1-5 分): 1. 相关性:输出是否与问题相关 2. 准确性:信息是否正确 3. 完整性:是否完整回答了问题 4. 清晰度:表达是否清晰易读 5. 安全性:是否有不当内容 "relevance": 4, "accuracy": 5, "completeness": 4, "clarity": 5, "safety": 5, "overall score": 4.6, "issues": "问题1", "问题2" , "suggestions": "建议1", "建议2" result = self.llm.generate evaluation prompt return json.loads result return {"error": "评估解析失败", "raw": result} def batch evaluate self, requests: list - list: results = for req in requests: evaluation = self.evaluate req "prompt" , req "response" results.append { "request id": req "id" , evaluation return results def detect hallucination self, response: str, context: str - dict: detection prompt = f""" 检测以下回答是否存在幻觉(编造不存在的信息): 上下文/背景:{context} AI 回答:{response} 1. 是否有具体事实(人名、日期、数字)需要验证 2. 这些事实是否在上下文中 3. 是否有明显编造的内容 "has hallucination": true/false, "confidence": 0.85, "risky content": "具体可疑内容" , "reason": "判断理由" result = self.llm.generate detection prompt return json.loads result return {"has hallucination": False, "confidence": 0} Prometheus 监控面板 指标导出 python from prometheus client import Counter, Histogram, Gauge, generate latest REQUEST COUNT = Counter 'ai requests total', 'Total AI requests', 'model', 'status' REQUEST LATENCY = Histogram 'ai request latency seconds', 'AI request latency', TOKEN USAGE = Counter 'ai tokens used total', 'Total tokens used', 'model', 'type' type: input/output COST USAGE = Counter 'ai cost total', 'Total API cost', ACTIVE REQUESTS = Gauge 'ai active requests', 'Number of active requests', @app.middleware "http" async def track requests request: Request, call next : model = request.headers.get "X-Model", "unknown" ACTIVE REQUESTS.labels model=model .inc start = time.time response = await call next request latency = time.time - start REQUEST COUNT.labels model=model, status=response.status code .inc REQUEST LATENCY.labels model=model .observe latency ACTIVE REQUESTS.labels model=model .dec return response @app.get "/metrics" def metrics : return Response content=generate latest 告警配置 关键告警规则 alertmanager.yml 或监控配置 - name: ai application - alert: HighAIErrorRate sum rate ai requests total{status="error"} 5m sum rate ai requests total 5m 0.05 severity: critical annotations: summary: "AI 请求错误率超过 5%" - alert: HighAILatency histogram quantile 0.95, sum rate ai request latency seconds bucket 5m by le severity: warning annotations: summary: "AI 请求 P95 延迟超过 10 秒" - alert: HighAICost increase ai cost total 1h 100 severity: warning annotations: summary: "AI 调用成本小时增长超过 $100" - alert: AIRateLimit increase ai requests total{status="429"} 5m 10 severity: warning annotations: summary: "AI API 限流频繁发生" Grafana 仪表板 关键面板 ┌─────────────────────────────────────────────────────────────┐ │ AI Application Dashboard │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Requests │ │ Error Rate │ │ Avg Latency │ │ │ │ 12,345 │ │ 2.3% │ │ 1.2s │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Token Usage Over Time │ │ │ │ ████████████████░░░░░░░░░░░░░░░░░░ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Cost by Model │ │ │ │ GPT-5.4: $45.2 67% │ │ │ │ Claude: $22.1 33% │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Quality Score Distribution │ │ │ │ ██████████████████████████░░░░░░░░░░░░ │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ 最佳实践 1. 数据采样 class SamplingLogger: """采样记录,避免存储成本过高""" SAMPLE RATE = 0.1 10% 采样 def init self : self.full logger = AILogger self.sample count = 0 def should log self - bool: """判断是否应该记录完整日志""" self.sample count += 1 if self.sample count % int 1 / self.SAMPLE RATE == 0: return True return False def log self, entry: dict : if self.should log : self.full logger.log request entry 2. 成本预警 python class CostAlert: def init self, threshold daily: float = 100 : self.threshold daily = threshold daily self.token tracker = TokenTracker def check and alert self : """检查成本并告警""" daily = self.token tracker.get daily cost if daily "cost today" self.threshold daily: "alert": True, "message": f"今日 AI 成本 ${daily 'cost today' :.2f} 超过阈值 ${self.threshold daily}", "action": "review recent requests" return {"alert": False} 总结 AI 应用可观测性是生产环境的必备: 延迟追踪 :P50/P95/P99 延迟指标 Token 消耗 :按模型、按时间的成本分析 错误分类 :区分可重试和不可重试错误 质量评估 :自动评估输出质量,检测幻觉 告警配置 :错误率、延迟、成本告警 没有可观测性,就没有 AI 应用的生产治理。 本文是 AI 工程化系列之一。 This article contains affiliate links. If you sign up through the links above, I may earn a commission at no additional cost to you. Ready to Build Your AI Business? Get started with Systeme.io for free — All-in-one platform for building your online business with AI tools.