Post 3 of "Scanning Open Source." So far: Dub hides a fraud engine. Inbox Zero has prompt injection defense. The pattern: every project is architecturally bigger than its tagline.
Today: Langfuse β open source LLM observability platform. YC W23. 8K+ stars.
$ npx anatomia-cli scan .
langfuse web-app
TypeScript Β· Next.js Β· Prisma β PostgreSQL (65 models) Β· 7 packages
Stack
βββββ
Language TypeScript
Framework Next.js
Database Prisma β PostgreSQL (65 models)
Auth NextAuth
AI LangChain
Payments Stripe
Testing Vitest, Playwright, Testing Library
UI shadcn/ui (Tailwind)
Services AWS S3 Β· Nodemailer Β· Sentry Β· PostHog Β· tRPC (+6 more)
Deploy Docker Β· GitHub Actions
Workspace Turborepo (pnpm)
Surfaces
ββββββββ
web Next.js Β· Vitest
worker TypeScript Β· Vitest
β ~75 of 93 API route files may lack input validation
5 seconds. Two surfaces β a web app and a worker. The validation warning is worth context: Langfuse uses tRPC extensively, where validation happens via .input()
schemas in the router layer β the scanner checks file-level imports and may not detect middleware-based validation. Here's what I found when I pulled threads.
This is the finding that made me stop and reread the code.
Langfuse uses LangChain internally to power features like the playground (where users test prompts against different models) and LLM-as-judge evaluations. The scan detected AI: LangChain
β but the interesting part isn't that they use LangChain. It's HOW they trace those calls.
In getInternalTracingHandler.ts
, Langfuse creates a callback handler using langfuse-langchain
β their own open source LangChain integration package. Every internal LLM call flows through processEventBatch
, the same ingestion pipeline that handles customer traces. The observability tool is observing itself.
This isn't debugging. It's architectural dogfooding. The team's own LLM usage generates production traces through the same pipeline their customers use. If the tracing breaks, they'd notice on their own dashboard before any customer reports it.
The scan detected LangChain as the AI SDK. When I traced the imports in fetchLLMCompletion.ts
, six providers are wired up:
import { ChatAnthropic } from "@langchain/anthropic";
import { ChatVertexAI } from "@langchain/google-vertexai";
import { ChatBedrockConverse } from "@langchain/aws";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { ChatOpenAI, AzureChatOpenAI } from "@langchain/openai";
Anthropic, Google Vertex, AWS Bedrock, Google Generative AI, OpenAI, and Azure OpenAI β all through LangChain as a unified interface. This powers the playground where users can test prompts across different models and the evaluation system where LLMs judge other LLMs' outputs.
The scan detected two surfaces: web
and worker
. The worker has 253 source files and 24 separate queue processors β ingestion, evaluations, experiments, batch exports, data retention, integrations (PostHog, Mixpanel), OpenTelemetry ingestion, and more. Langfuse processes traces asynchronously β the web app accepts data, the worker processes, aggregates, evaluates, and routes it. The separation means trace ingestion never blocks the dashboard.
26 TypeScript files in web/src/features/mcp/
. Langfuse ships a Model Context Protocol server β you can manage prompts and query observation data directly from Claude Code or any MCP-compatible tool. Create a prompt, version it, label it, without leaving your editor. If you use Langfuse for prompt management AND Claude Code for development, this closes the loop between the two.
The model count alone isn't the story. It's what the models ARE:
Core tracing: traces, observations, sessions, media attachments
Evaluation: eval templates, job configurations, job executions, score configs
Human review: annotation queues, queue items, queue assignments
Prompt management: prompts, prompt dependencies, protected labels, LLM schemas, LLM tools
Automation: automations, triggers, actions, automation executions, monitors
Integrations: PostHog, Mixpanel, Slack, blob storage β each with its own model
The annotation queue system is worth noting. It's a human-in-the-loop review workflow β assign traces to reviewers, score them against configurable criteria, track completion. That's the bridge between "the AI said this" and "a human confirmed this was correct." Most observability tools stop at dashboards. Langfuse has a structured process for human judgment on AI output.
The self-tracing pattern is the thread that ties everything together. Langfuse runs LLM calls for the playground and evaluations. Those calls flow through their own ingestion pipeline, processed by their own worker queues, visible on their own dashboard. If you're evaluating Langfuse as an observability platform, the fact that they trust their own product with their own AI workload is the strongest signal in the codebase.
The annotation queue system is the second finding worth noting β a human-in-the-loop review workflow where you assign traces to reviewers, score them against configurable criteria, and track completion. Most observability tools stop at dashboards. Langfuse has structured the bridge between "the AI said this" and "a human confirmed this was correct."
Post 3 of "Scanning Open Source." Tomorrow: Formbricks.
npx anatomia-cli scan .
β GitHub