I scanned Langfuse. It observes its own LLM calls through its own platform.

Langfuse, an open-source LLM observability platform, uses its own product to trace its internal large language model calls. A code scan revealed that the platform's internal LangChain calls flow through the same `processEventBatch` ingestion pipeline as customer traces, creating a self-observing system. The architecture includes a web app and a worker with 24 queue processors, and ships a Model Context Protocol server for prompt management.

Post 3 of "Scanning Open Source." So far: Dub hides a fraud engine https://dev.to/link . Inbox Zero has prompt injection defense https://dev.to/link . The pattern: every project is architecturally bigger than its tagline. Today: Langfuse https://langfuse.com — open source LLM observability platform. YC W23. 8K+ stars. bash $ npx anatomia-cli scan . langfuse web-app TypeScript · Next.js · Prisma → PostgreSQL 65 models · 7 packages Stack ───── Language TypeScript Framework Next.js Database Prisma → PostgreSQL 65 models Auth NextAuth AI LangChain Payments Stripe Testing Vitest, Playwright, Testing Library UI shadcn/ui Tailwind Services AWS S3 · Nodemailer · Sentry · PostHog · tRPC +6 more Deploy Docker · GitHub Actions Workspace Turborepo pnpm Surfaces ──────── web Next.js · Vitest worker TypeScript · Vitest ⚠ ~75 of 93 API route files may lack input validation 5 seconds. Two surfaces — a web app and a worker. The validation warning is worth context: Langfuse uses tRPC extensively, where validation happens via .input schemas in the router layer — the scanner checks file-level imports and may not detect middleware-based validation. Here's what I found when I pulled threads. This is the finding that made me stop and reread the code. Langfuse uses LangChain internally to power features like the playground where users test prompts against different models and LLM-as-judge evaluations. The scan detected AI: LangChain — but the interesting part isn't that they use LangChain. It's HOW they trace those calls. In getInternalTracingHandler.ts , Langfuse creates a callback handler using langfuse-langchain — their own open source LangChain integration package. Every internal LLM call flows through processEventBatch , the same ingestion pipeline that handles customer traces. The observability tool is observing itself. This isn't debugging. It's architectural dogfooding. The team's own LLM usage generates production traces through the same pipeline their customers use. If the tracing breaks, they'd notice on their own dashboard before any customer reports it. The scan detected LangChain as the AI SDK. When I traced the imports in fetchLLMCompletion.ts , six providers are wired up: js import { ChatAnthropic } from "@langchain/anthropic"; import { ChatVertexAI } from "@langchain/google-vertexai"; import { ChatBedrockConverse } from "@langchain/aws"; import { ChatGoogleGenerativeAI } from "@langchain/google-genai"; import { ChatOpenAI, AzureChatOpenAI } from "@langchain/openai"; Anthropic, Google Vertex, AWS Bedrock, Google Generative AI, OpenAI, and Azure OpenAI — all through LangChain as a unified interface. This powers the playground where users can test prompts across different models and the evaluation system where LLMs judge other LLMs' outputs. The scan detected two surfaces: web and worker . The worker has 253 source files and 24 separate queue processors — ingestion, evaluations, experiments, batch exports, data retention, integrations PostHog, Mixpanel , OpenTelemetry ingestion, and more. Langfuse processes traces asynchronously — the web app accepts data, the worker processes, aggregates, evaluates, and routes it. The separation means trace ingestion never blocks the dashboard. 26 TypeScript files in web/src/features/mcp/ . Langfuse ships a Model Context Protocol server — you can manage prompts and query observation data directly from Claude Code or any MCP-compatible tool. Create a prompt, version it, label it, without leaving your editor. If you use Langfuse for prompt management AND Claude Code for development, this closes the loop between the two. The model count alone isn't the story. It's what the models ARE: Core tracing: traces, observations, sessions, media attachments Evaluation: eval templates, job configurations, job executions, score configs Human review: annotation queues, queue items, queue assignments Prompt management: prompts, prompt dependencies, protected labels, LLM schemas, LLM tools Automation: automations, triggers, actions, automation executions, monitors Integrations: PostHog, Mixpanel, Slack, blob storage — each with its own model The annotation queue system is worth noting. It's a human-in-the-loop review workflow — assign traces to reviewers, score them against configurable criteria, track completion. That's the bridge between "the AI said this" and "a human confirmed this was correct." Most observability tools stop at dashboards. Langfuse has a structured process for human judgment on AI output. The self-tracing pattern is the thread that ties everything together. Langfuse runs LLM calls for the playground and evaluations. Those calls flow through their own ingestion pipeline, processed by their own worker queues, visible on their own dashboard. If you're evaluating Langfuse as an observability platform, the fact that they trust their own product with their own AI workload is the strongest signal in the codebase. The annotation queue system is the second finding worth noting — a human-in-the-loop review workflow where you assign traces to reviewers, score them against configurable criteria, and track completion. Most observability tools stop at dashboards. Langfuse has structured the bridge between "the AI said this" and "a human confirmed this was correct." Post 3 of "Scanning Open Source." Tomorrow: Formbricks. npx anatomia-cli scan . — GitHub https://github.com/anatomia-dev/anatomia