{"slug": "llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing", "title": "LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack", "summary": "Meera, a GenAI engineer at AcmeAI, addresses prompt regression risks by implementing evaluation gates with LangSmith, building a routing benchmark dataset and correctness evaluator to catch silent failures before deployment. The article also compares LangSmith vs. Langfuse and surveys the 2026 LLM observability tooling landscape.", "body_md": "📚 This is Part 2 of a two-part series.In[, Meera — a GenAI engineer at AcmeAI — shipped a LangGraph support agent and got stopped by three questions from Sanjay, the head of risk. We answered two of them: with LANGSMITH_TRACING=true and a project name, every conversation became a]Part 1replayable trace(question #1 — the Air Canada defense), and a custom BaseCallbackHandler gave compliance atamper-evident, PII-redacted, vendor-independent audit logrunning alongside LangSmith (question #2 — the two-lane pattern). We also proved the @traceable decorator tracesanyPython function, LangChain or not. If you haven't read it, start there — this part reuses the agent built in Part 1.\n\nOne question is still open, and it’s the sneakiest of the three:\n\n“If somebody tweaks a prompt next quarter, will we catch the regressionbeforeit ships?”\n\nRight now the answer is no. Let’s fix that — and then zoom all the way out to the industry playbook, the LangSmith-vs-Langfuse decision, and the rest of the 2026 tooling field.\n\nThe failure Sanjay is describing is silent. Someone “improves” the routing prompt — adds one clarifying sentence — and refund questions quietly start routing to *General*, where the responder can’t see billing documents. No error, no alert, just gradually worse answers discovered weeks later through complaints.\n\nCode has had the antidote for decades: regression tests. LangSmith brings the same discipline to LLM behavior with three nouns:\n\nMeera builds a routing benchmark, including the trap cases that bite real routers:\n\n``` python\nfrom langsmith import Clientfrom langsmith.evaluation import evaluateclient = Client()DATASET_NAME = \"acmeai-routing-eval\"examples = [    {\"inputs\": {\"customer_query\": \"Do you ship to Singapore?\"},     \"outputs\": {\"expected_category\": \"General\"}},    {\"inputs\": {\"customer_query\": \"My GPU appliance throws CUDA errors after the latest firmware\"},     \"outputs\": {\"expected_category\": \"Technical\"}},    {\"inputs\": {\"customer_query\": \"Why was my Visa charged twice this month?\"},     \"outputs\": {\"expected_category\": \"Billing\"}},    {\"inputs\": {\"customer_query\": \"How do I update my saved payment method?\"},     \"outputs\": {\"expected_category\": \"Billing\"}},    {\"inputs\": {\"customer_query\": \"I want a refund right now, your product is unusable\"},     \"outputs\": {\"expected_category\": \"Billing\"}},   # angry, but still Billing - sentiment handles escalation    {\"inputs\": {\"customer_query\": \"Does your SDK support Python 3.12?\"},     \"outputs\": {\"expected_category\": \"Technical\"}},    {\"inputs\": {\"customer_query\": \"What are your weekend support hours?\"},     \"outputs\": {\"expected_category\": \"General\"}},]# Idempotent: create the dataset only if it doesn't already existtry:    ds = client.read_dataset(dataset_name=DATASET_NAME)except Exception:    ds = client.create_dataset(        dataset_name=DATASET_NAME,        description=\"Routing-accuracy benchmark for the support router agent.\",    )    client.create_examples(        dataset_id=ds.id,        examples=[{\"inputs\": e[\"inputs\"], \"outputs\": e[\"outputs\"]} for e in examples],    )\n```\n\nNote that fifth example: *“I want a refund right now, your product is unusable.”* It encodes a real design decision from Part 1 — the **router** should still say Billing; the **sentiment** check is what triggers escalation. Once that’s in the dataset, nobody can accidentally “fix” it away.\n\nNext, the target under test and the evaluator. The target wraps just the classification node — unit-testing one decision, not the whole pipeline:\n\n``` php\ndef routing_target(inputs: dict) -> dict:    state = categorize_inquiry({\"customer_query\": inputs[\"customer_query\"]})    return {\"predicted_category\": state[\"query_category\"]}def correctness_evaluator(run, example) -> dict:    pred = (run.outputs or {}).get(\"predicted_category\")    expected = (example.outputs or {}).get(\"expected_category\")    return {        \"key\": \"routing_correctness\",        \"score\": 1.0 if pred == expected else 0.0,        \"comment\": f\"pred={pred} expected={expected}\",    }\n```\n\nAnd the experiment — one function call:\n\n```\nresults = evaluate(    routing_target,    data=DATASET_NAME,    evaluators=[correctness_evaluator],    experiment_prefix=\"router-baseline\",    metadata={\"agent\": \"router-v1\"},)\nView the evaluation results for experiment: 'router-baseline-c7e21a4f' at:https://smith.langchain.com/o/.../datasets/.../compare?selectedSessions=...7it [00:11,  1.58s/it]\n```\n\nFollow that link and the dashboard shows the per-example breakdown. Meera’s baseline comes back at **6/7–0.86** — and the table tells her exactly which example bled:\n\nThere it is — the exact trap the dataset was built to catch. When the customer is venting *about the product*, the model reads “unusable product” as a general complaint and misses that the actionable intent is a refund. In production this is invisible: the customer still gets *an* answer, just one generated without access to the refund-policy documents. In an experiment, it’s a red cell with a comment string. That’s the whole pitch for evals in one table row.\n\nThe loop above is now Meera’s development workflow. The top row is mechanical: fixtures in, scores out, gate at the end. The two feedback arrows are where value compounds. The solid one — *re-run until green* — is the inner cycle she’s about to do (fix the prompt, re-run, compare experiments side by side). The dashed one, from “production incident” back into the dataset, is the long game: every bad answer your Part 1 tracing catches in the wild gets distilled into a new example, which means **the test suite grows in exactly the directions your system actually fails.** Six months in, that dataset is the team’s institutional memory of every way the agent has ever embarrassed them.\n\nOne scope note: this evaluator is an exact-match check, which works because structured output constrains the labels. For free-text answers you’d add an **LLM-as-judge** evaluator (a strong model grading faithfulness against a rubric), and LangSmith’s **online evaluators** can score samples of *live production traffic* continuously, so drift (Scenario 4 from Part 1) shows up on a dashboard instead of in a complaint.\n\nNow — about that failing prompt. Fixing it properly raises a bigger question.\n\nHere’s the uncomfortable question: **where do your prompts actually live?** If the honest answer is “in f-strings, scattered across the codebase, edited by whoever and deployed whenever” — that’s exactly how silent regressions are born, and exactly what a tribunal will subpoena, as Air Canada learned in Part 1.\n\n[LangSmith’s Hub](https://docs.langchain.com/langsmith/manage-prompts) treats prompts as versioned, deployable artifacts. Every push creates an **immutable commit** — old versions are never overwritten and stay pullable by hash, forever — and [ tags](https://changelog.langchain.com/announcements/prompt-tags-in-langsmith-for-version-control) like production or staging are movable pointers to commits, exactly like git branches.\n\nMeera lifts the routing prompt out of the code and pushes it:\n\n``` python\nfrom langchain_core.prompts import ChatPromptTemplatefrom langsmith import Clientclient = Client()PROMPT_NAME = \"acmeai-router-categorization\"routing_prompt = ChatPromptTemplate.from_messages([    (\"system\", \"You are a customer support agent for an AI products and hardware company. \"               \"Classify the customer query into exactly one of: Technical, Billing, General. \"               \"Return only the category name.\"),    (\"human\", \"{customer_query}\"),])url = client.push_prompt(PROMPT_NAME, object=routing_prompt)print(f\"Pushed → {url}\")\nPushed → https://smith.langchain.com/prompts/acmeai-router-categorization/...\n```\n\nThat’s commit #1. Now the fix for the failing eval case — a v2 with one extra routing rule, pushed as a *new commit* of the same prompt:\n\n```\nrouting_prompt_v2 = ChatPromptTemplate.from_messages([    (\"system\", \"You are a customer support agent for an AI products and hardware company. \"               \"Classify the customer query into exactly one of: Technical, Billing, General. \"               \"Rule: complaints about charges, refunds, or payments are ALWAYS Billing, \"               \"even when the customer is angry or insulting the product. \"               \"Return only the category name.\"),    (\"human\", \"{customer_query}\"),])client.push_prompt(PROMPT_NAME, object=routing_prompt_v2)   # commit #2\n```\n\nDoes v2 actually fix the regression without breaking anything else? That’s not a matter of opinion anymore — it’s an experiment. The new target pulls the prompt from the Hub (note: the app code no longer contains prompt text at all):\n\n``` php\ndef routing_target_v2(inputs: dict) -> dict:    prompt = client.pull_prompt(PROMPT_NAME)            # latest commit    chain = prompt | llm.with_structured_output(QueryCategory)    result = chain.invoke({\"customer_query\": inputs[\"customer_query\"]})    return {\"predicted_category\": result.categorized_topic}results_v2 = evaluate(    routing_target_v2,    data=DATASET_NAME,    evaluators=[correctness_evaluator],    experiment_prefix=\"router-v2-refund-rule\",)\nView the evaluation results for experiment: 'router-v2-refund-rule-9b3d51e0' at:https://smith.langchain.com/o/.../datasets/.../compare?selectedSessions=...7it [00:10,  1.49s/it]\n```\n\nThe dashboard’s compare view puts both experiments side by side, and it’s worth looking at closely, because this view is the cultural artifact that changes how teams argue about prompts:\n\nrouter-baseline at 0.86, router-v2-refund-rule at **1.00** — the amber refund row flips from red to green, and the six other rows hold steady. Both halves of that sentence matter: the flip proves the fix worked, and the steady rows prove the new rule didn't quietly break anything else (the failure mode of every \"small prompt tweak\" ever made). *That* — not \"it looks better to me\" — is the evidence that earns v2 the production tag. In the Hub UI, moving the :production tag onto commit #2 is one click, and the production app picks it up on its next pull:\n\n```\nprompt = client.pull_prompt(f\"{PROMPT_NAME}:production\")# or pin an exact commit hash for absolute reproducibility:# prompt = client.pull_prompt(f\"{PROMPT_NAME}:abc123de\")\n```\n\nThe diagram is the whole operating model on one page. The rail across the top is what the Hub stores: immutable commits v1…v4, with :production and :staging as re-pinnable flags — promotion is moving a flag, not shipping a build. The pipeline underneath is what Meera just did manually, automated: a prompt edit becomes a commit, the commit triggers the eval suite in CI, a regression blocks the PR with the failing examples attached, and a pass moves the flag. Two operational footnotes are baked in: pulled prompts are **cached** (expect a few minutes of TTL after a re-tag before every running instance converges — pin commit hashes where you need determinism), and the rollback story is the killer feature. A bad prompt in production is fixed by moving the flag *backwards*: no build, no release train, no 2 a.m. deploy.\n\nThe cultural shift lands quietly but permanently:\n\n**Prompts now ship through the same gate as code.** That’s the sentence that finally makes Sanjay smile — question #3, closed.\n\nEverything so far used a support bot, but look at the four moves again — *trace everything, keep your own audit log, gate changes with evals, version your prompts* — and notice that nothing about them is support-specific. Here’s how the same stack earns its keep elsewhere.\n\n**🏥 Healthcare — the symptom-triage assistant.** A telehealth platform runs an intake bot that asks about symptoms and suggests urgency levels. Traces let clinical reviewers replay *exactly* why the bot said “routine appointment” instead of “urgent care” — which retrieval surfaced, which guideline was quoted. The custom callback is non-negotiable here: PHI must be scrubbed in-process before any trace leaves the network (HIPAA), and the JSONL log feeds the clinical-governance board. The eval dataset is a library of physician-written vignettes — “crushing chest pain radiating to left arm” must score urgent=1.0 on every model version, forever. A failed experiment blocks release like a failed unit test.\n\n**🛒 E-commerce — the shopping copilot.** A retailer’s product-Q&A agent answers “will these boots survive a Norwegian winter?” from spec sheets and reviews. Tracing exposes the classic silent killer: the retriever returning the *men’s* boot specs for a *women’s* boot question. Cost telemetry per trace reveals that 4% of conversations consume 40% of spend (users pasting entire return-policy PDFs) — Part 1’s Scenario 3, found and fixed in a week. Before Black Friday, the team re-runs a 500-example eval suite against holiday prompt variants, and merchandising A/Bs a “warmer” tone by moving a Hub tag — zero engineering deploys.\n\n**⚖️ Legal — the contract-review copilot.** A firm’s associates use an agent that flags risky clauses in NDAs. Privilege means traces can’t leave the building — so they self-host (or run callback-only logging) with the exact same code. The eval dataset is partner-annotated contracts, and the evaluator checks clause-level recall: missing an uncapped-liability clause is a career-limiting false negative. Prompt commits matter for a subtler reason: when a client asks *“under what instructions did the AI review my contract in April?”*, the firm produces the exact prompt version, by hash.\n\n**🎓 EdTech — the AI tutor.** A math-tutoring app serves students aged 10–16. Online evaluators continuously score live traces for age-appropriateness and “did the tutor *explain* rather than hand over the answer.” The audit log doubles as a safety record for school districts. The Hub holds per-grade prompt variants (tutor-prompt:grade6, tutor-prompt:grade10) — pedagogy teams iterate on scaffolding without touching the codebase.\n\nA quick map for everyone else:\n\nDifferent stakes, same four moves. The infrastructure doesn’t care whether the disaster is a misrouted refund, a missed liability clause, or a tutor handing a 12-year-old the answer key.\n\nThe question Meera gets most often from other teams: *“should we use LangSmith or Langfuse?”* It deserves a real answer, not a shrug — they’re the two most common finalists, and they genuinely optimize for different things.\n\n**Langfuse** is the open-source counterweight: MIT-licensed core, self-hosting as a first-class citizen (one Docker Compose for Postgres + ClickHouse + the server), an SDK rebuilt around OpenTelemetry, and transparent unit-based pricing for its cloud. **LangSmith** is the vertically integrated, managed platform: the deepest LangChain/LangGraph integration on the market, plus the production toppings — alerts, online evaluators, automation rules, agent deployment — that Langfuse mostly leaves to you.\n\nHere’s the matrix, distilled from [Langfuse’s own comparison](https://langfuse.com/faq/all/langsmith-alternative), [LangChain’s counter-comparison](https://www.langchain.com/resources/langsmith-vs-langfuse), and independent 2026 write-ups ([ZenML](https://www.zenml.io/blog/langfuse-vs-langsmith), [TECHSY](https://techsy.io/en/blog/langfuse-vs-langsmith)) — pricing figures are mid-2026 cloud list prices, so verify before you budget:\n\nAnd the decision rules, compressed:\n\nLangSmith and Langfuse aren’t the only players — LLM observability has become a [multi-billion-dollar category](https://www.firecrawl.dev/blog/best-llm-observability-tools), and a few others deserve a look depending on your shape:\n\nThe tree compresses this whole section into three questions, asked in the order that actually matters. **Sovereignty first:** if traces can’t leave your network, you’re self-hosting, and the realistic shortlist is Langfuse, Phoenix, or MLflow — or writing the enterprise check for self-hosted LangSmith. **Framework second:** deep LangChain/LangGraph investment makes LangSmith’s zero-config integration genuinely hard to beat. **Philosophy third:** managed suite (LangSmith, Braintrust, Weave, Datadog) versus open-source-first (Langfuse, Phoenix, MLflow, Opik). And don’t skim past the footnote at the bottom of the tree — the audit-callback lane from Part 1 belongs in *every* outcome box, because it’s the one component you’ll never migrate, never license, and never lose to an acquisition. (Ask a Helicone user.)\n\nA practical privacy ladder, from lightest to heaviest control — each rung buys more sovereignty and costs more ops effort, so climb only as high as your data requires:\n\nMeera’s launch checklist, distilled from the whole series:\n\nAnd if you keep exactly one artifact from this series, make it this poster — the whole playbook, both parts, on one page:\n\nAnd the story ends where stories like this should: Sanjay signs off, the agent ships, and three weeks later a customer claims the bot promised them a free GPU. Meera pulls the trace, reads the actual conversation, checks which prompt commit was live that day, and replies in four minutes flat.\n\nThat’s the whole point.\n\nIf this two-parter saved you a debugging weekend — or surfaced a gap in your team’s LLM stack — I’d genuinely like to hear about it. I’m **Prashant Sahu**, and I train and consult on GenAI engineering: LLM observability and evaluation, RAG systems, and multi-agent architectures, including the 10-day (70-hour) corporate curriculum this series grew out of.\n\n🔗 **Connect with me on LinkedIn: **[ linkedin.com/in/prashantksahu](https://www.linkedin.com/in/prashantksahu/) — say hi, share your observability war stories, or just tell me which part of this series you’d like a deeper dive on.\n\n➕ **Follow me here on Medium** for the next articles in this series — Langfuse hands-on, the 3-layer agent-evaluation hierarchy, and PII redaction with fairness guardrails are all in the pipeline.\n\n*Missed the beginning? **Read Part 1 here** — observability fundamentals, zero-config tracing, tracing any Python function, and the audit-grade callback.*\n\n[LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack](https://pub.towardsai.net/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing-your-stack-e607473320b5) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing", "canonical_source": "https://pub.towardsai.net/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing-your-stack-e607473320b5?source=rss----98111c9905da---4", "published_at": "2026-06-13 22:01:01+00:00", "updated_at": "2026-06-13 22:31:34.546984+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "mlops", "developer-tools"], "entities": ["LangSmith", "Langfuse", "AcmeAI", "Meera", "Sanjay", "LangGraph", "BaseCallbackHandler", "Air Canada"], "alternates": {"html": "https://wpnews.pro/news/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing", "markdown": "https://wpnews.pro/news/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing.md", "text": "https://wpnews.pro/news/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing.txt", "jsonld": "https://wpnews.pro/news/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing.jsonld"}}