LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack

wpnews.pro

📚 This is Part 2 of a two-part series.In[, Meera — a GenAI engineer at AcmeAI — shipped a LangGraph support agent and got stopped by three questions from Sanjay, the head of risk. We answered two of them: with LANGSMITH_TRACING=true and a project name, every conversation became a]Part 1replayable trace(question #1 — the Air Canada defense), and a custom BaseCallbackHandler gave compliance atamper-evident, PII-redacted, vendor-independent audit logrunning alongside LangSmith (question #2 — the two-lane pattern). We also proved the @traceable decorator tracesanyPython function, LangChain or not. If you haven't read it, start there — this part reuses the agent built in Part 1.

One question is still open, and it’s the sneakiest of the three:

“If somebody tweaks a prompt next quarter, will we catch the regressionbeforeit ships?”

Right now the answer is no. Let’s fix that — and then zoom all the way out to the industry playbook, the LangSmith-vs-Langfuse decision, and the rest of the 2026 tooling field.

The failure Sanjay is describing is silent. Someone “improves” the routing prompt — adds one clarifying sentence — and refund questions quietly start routing to General, where the responder can’t see billing documents. No error, no alert, just gradually worse answers discovered weeks later through complaints.

Code has had the antidote for decades: regression tests. LangSmith brings the same discipline to LLM behavior with three nouns:

Meera builds a routing benchmark, including the trap cases that bite real routers:

from langsmith import Clientfrom langsmith.evaluation import evaluateclient = Client()DATASET_NAME = "acmeai-routing-eval"examples = [    {"inputs": {"customer_query": "Do you ship to Singapore?"},     "outputs": {"expected_category": "General"}},    {"inputs": {"customer_query": "My GPU appliance throws CUDA errors after the latest firmware"},     "outputs": {"expected_category": "Technical"}},    {"inputs": {"customer_query": "Why was my Visa charged twice this month?"},     "outputs": {"expected_category": "Billing"}},    {"inputs": {"customer_query": "How do I update my saved payment method?"},     "outputs": {"expected_category": "Billing"}},    {"inputs": {"customer_query": "I want a refund right now, your product is unusable"},     "outputs": {"expected_category": "Billing"}},   # angry, but still Billing - sentiment handles escalation    {"inputs": {"customer_query": "Does your SDK support Python 3.12?"},     "outputs": {"expected_category": "Technical"}},    {"inputs": {"customer_query": "What are your weekend support hours?"},     "outputs": {"expected_category": "General"}},]# Idempotent: create the dataset only if it doesn't already existtry:    ds = client.read_dataset(dataset_name=DATASET_NAME)except Exception:    ds = client.create_dataset(        dataset_name=DATASET_NAME,        description="Routing-accuracy benchmark for the support router agent.",    )    client.create_examples(        dataset_id=ds.id,        examples=[{"inputs": e["inputs"], "outputs": e["outputs"]} for e in examples],    )

Note that fifth example: “I want a refund right now, your product is unusable.” It encodes a real design decision from Part 1 — the router should still say Billing; the sentiment check is what triggers escalation. Once that’s in the dataset, nobody can accidentally “fix” it away.

Next, the target under test and the evaluator. The target wraps just the classification node — unit-testing one decision, not the whole pipeline:

def routing_target(inputs: dict) -> dict:    state = categorize_inquiry({"customer_query": inputs["customer_query"]})    return {"predicted_category": state["query_category"]}def correctness_evaluator(run, example) -> dict:    pred = (run.outputs or {}).get("predicted_category")    expected = (example.outputs or {}).get("expected_category")    return {        "key": "routing_correctness",        "score": 1.0 if pred == expected else 0.0,        "comment": f"pred={pred} expected={expected}",    }

And the experiment — one function call:

results = evaluate(    routing_target,    data=DATASET_NAME,    evaluators=[correctness_evaluator],    experiment_prefix="router-baseline",    metadata={"agent": "router-v1"},)
View the evaluation results for experiment: 'router-baseline-c7e21a4f' at:https://smith.langchain.com/o/.../datasets/.../compare?selectedSessions=...7it [00:11,  1.58s/it]

Follow that link and the dashboard shows the per-example breakdown. Meera’s baseline comes back at 6/7–0.86 — and the table tells her exactly which example bled:

There it is — the exact trap the dataset was built to catch. When the customer is venting about the product, the model reads “unusable product” as a general complaint and misses that the actionable intent is a refund. In production this is invisible: the customer still gets an answer, just one generated without access to the refund-policy documents. In an experiment, it’s a red cell with a comment string. That’s the whole pitch for evals in one table row.

The loop above is now Meera’s development workflow. The top row is mechanical: fixtures in, scores out, gate at the end. The two feedback arrows are where value compounds. The solid one — re-run until green — is the inner cycle she’s about to do (fix the prompt, re-run, compare experiments side by side). The dashed one, from “production incident” back into the dataset, is the long game: every bad answer your Part 1 tracing catches in the wild gets distilled into a new example, which means the test suite grows in exactly the directions your system actually fails. Six months in, that dataset is the team’s institutional memory of every way the agent has ever embarrassed them.

One scope note: this evaluator is an exact-match check, which works because structured output constrains the labels. For free-text answers you’d add an LLM-as-judge evaluator (a strong model grading faithfulness against a rubric), and LangSmith’s online evaluators can score samples of live production traffic continuously, so drift (Scenario 4 from Part 1) shows up on a dashboard instead of in a complaint.

Now — about that failing prompt. Fixing it properly raises a bigger question.

Here’s the uncomfortable question: where do your prompts actually live? If the honest answer is “in f-strings, scattered across the codebase, edited by whoever and deployed whenever” — that’s exactly how silent regressions are born, and exactly what a tribunal will subpoena, as Air Canada learned in Part 1.

LangSmith’s Hub treats prompts as versioned, deployable artifacts. Every push creates an immutable commit — old versions are never overwritten and stay pullable by hash, forever — and tags like production or staging are movable pointers to commits, exactly like git branches.

Meera lifts the routing prompt out of the code and pushes it:

from langchain_core.prompts import ChatPromptTemplatefrom langsmith import Clientclient = Client()PROMPT_NAME = "acmeai-router-categorization"routing_prompt = ChatPromptTemplate.from_messages([    ("system", "You are a customer support agent for an AI products and hardware company. "               "Classify the customer query into exactly one of: Technical, Billing, General. "               "Return only the category name."),    ("human", "{customer_query}"),])url = client.push_prompt(PROMPT_NAME, object=routing_prompt)print(f"Pushed → {url}")
Pushed → https://smith.langchain.com/prompts/acmeai-router-categorization/...

That’s commit #1. Now the fix for the failing eval case — a v2 with one extra routing rule, pushed as a new commit of the same prompt:

routing_prompt_v2 = ChatPromptTemplate.from_messages([    ("system", "You are a customer support agent for an AI products and hardware company. "               "Classify the customer query into exactly one of: Technical, Billing, General. "               "Rule: complaints about charges, refunds, or payments are ALWAYS Billing, "               "even when the customer is angry or insulting the product. "               "Return only the category name."),    ("human", "{customer_query}"),])client.push_prompt(PROMPT_NAME, object=routing_prompt_v2)   # commit #2

Does v2 actually fix the regression without breaking anything else? That’s not a matter of opinion anymore — it’s an experiment. The new target pulls the prompt from the Hub (note: the app code no longer contains prompt text at all):

def routing_target_v2(inputs: dict) -> dict:    prompt = client.pull_prompt(PROMPT_NAME)            # latest commit    chain = prompt | llm.with_structured_output(QueryCategory)    result = chain.invoke({"customer_query": inputs["customer_query"]})    return {"predicted_category": result.categorized_topic}results_v2 = evaluate(    routing_target_v2,    data=DATASET_NAME,    evaluators=[correctness_evaluator],    experiment_prefix="router-v2-refund-rule",)
View the evaluation results for experiment: 'router-v2-refund-rule-9b3d51e0' at:https://smith.langchain.com/o/.../datasets/.../compare?selectedSessions=...7it [00:10,  1.49s/it]

The dashboard’s compare view puts both experiments side by side, and it’s worth looking at closely, because this view is the cultural artifact that changes how teams argue about prompts:

router-baseline at 0.86, router-v2-refund-rule at 1.00 — the amber refund row flips from red to green, and the six other rows hold steady. Both halves of that sentence matter: the flip proves the fix worked, and the steady rows prove the new rule didn't quietly break anything else (the failure mode of every "small prompt tweak" ever made). That — not "it looks better to me" — is the evidence that earns v2 the production tag. In the Hub UI, moving the :production tag onto commit #2 is one click, and the production app picks it up on its next pull:

prompt = client.pull_prompt(f"{PROMPT_NAME}:production")# or pin an exact commit hash for absolute reproducibility:# prompt = client.pull_prompt(f"{PROMPT_NAME}:abc123de")

The diagram is the whole operating model on one page. The rail across the top is what the Hub stores: immutable commits v1…v4, with :production and :staging as re-pinnable flags — promotion is moving a flag, not shipping a build. The pipeline underneath is what Meera just did manually, automated: a prompt edit becomes a commit, the commit triggers the eval suite in CI, a regression blocks the PR with the failing examples attached, and a pass moves the flag. Two operational footnotes are baked in: pulled prompts are cached (expect a few minutes of TTL after a re-tag before every running instance converges — pin commit hashes where you need determinism), and the rollback story is the killer feature. A bad prompt in production is fixed by moving the flag backwards: no build, no release train, no 2 a.m. deploy.

The cultural shift lands quietly but permanently:

Prompts now ship through the same gate as code. That’s the sentence that finally makes Sanjay smile — question #3, closed.

Everything so far used a support bot, but look at the four moves again — trace everything, keep your own audit log, gate changes with evals, version your prompts — and notice that nothing about them is support-specific. Here’s how the same stack earns its keep elsewhere.

🏥 Healthcare — the symptom-triage assistant. A telehealth platform runs an intake bot that asks about symptoms and suggests urgency levels. Traces let clinical reviewers replay exactly why the bot said “routine appointment” instead of “urgent care” — which retrieval surfaced, which guideline was quoted. The custom callback is non-negotiable here: PHI must be scrubbed in-process before any trace leaves the network (HIPAA), and the JSONL log feeds the clinical-governance board. The eval dataset is a library of physician-written vignettes — “crushing chest pain radiating to left arm” must score urgent=1.0 on every model version, forever. A failed experiment blocks release like a failed unit test.

🛒 E-commerce — the shopping copilot. A retailer’s product-Q&A agent answers “will these boots survive a Norwegian winter?” from spec sheets and reviews. Tracing exposes the classic silent killer: the retriever returning the men’s boot specs for a women’s boot question. Cost telemetry per trace reveals that 4% of conversations consume 40% of spend (users pasting entire return-policy PDFs) — Part 1’s Scenario 3, found and fixed in a week. Before Black Friday, the team re-runs a 500-example eval suite against holiday prompt variants, and merchandising A/Bs a “warmer” tone by moving a Hub tag — zero engineering deploys.

⚖️ Legal — the contract-review copilot. A firm’s associates use an agent that flags risky clauses in NDAs. Privilege means traces can’t leave the building — so they self-host (or run callback-only logging) with the exact same code. The eval dataset is partner-annotated contracts, and the evaluator checks clause-level recall: missing an uncapped-liability clause is a career-limiting false negative. Prompt commits matter for a subtler reason: when a client asks “under what instructions did the AI review my contract in April?”, the firm produces the exact prompt version, by hash.

🎓 EdTech — the AI tutor. A math-tutoring app serves students aged 10–16. Online evaluators continuously score live traces for age-appropriateness and “did the tutor explain rather than hand over the answer.” The audit log doubles as a safety record for school districts. The Hub holds per-grade prompt variants (tutor-prompt:grade6, tutor-prompt:grade10) — pedagogy teams iterate on scaffolding without touching the codebase.

A quick map for everyone else:

Different stakes, same four moves. The infrastructure doesn’t care whether the disaster is a misrouted refund, a missed liability clause, or a tutor handing a 12-year-old the answer key.

The question Meera gets most often from other teams: “should we use LangSmith or Langfuse?” It deserves a real answer, not a shrug — they’re the two most common finalists, and they genuinely optimize for different things.

Langfuse is the open-source counterweight: MIT-licensed core, self-hosting as a first-class citizen (one Docker Compose for Postgres + ClickHouse + the server), an SDK rebuilt around OpenTelemetry, and transparent unit-based pricing for its cloud. LangSmith is the vertically integrated, managed platform: the deepest LangChain/LangGraph integration on the market, plus the production toppings — alerts, online evaluators, automation rules, agent deployment — that Langfuse mostly leaves to you.

Here’s the matrix, distilled from Langfuse’s own comparison, LangChain’s counter-comparison, and independent 2026 write-ups (ZenML, TECHSY) — pricing figures are mid-2026 cloud list prices, so verify before you budget:

And the decision rules, compressed:

LangSmith and Langfuse aren’t the only players — LLM observability has become a multi-billion-dollar category, and a few others deserve a look depending on your shape:

The tree compresses this whole section into three questions, asked in the order that actually matters. Sovereignty first: if traces can’t leave your network, you’re self-hosting, and the realistic shortlist is Langfuse, Phoenix, or MLflow — or writing the enterprise check for self-hosted LangSmith. Framework second: deep LangChain/LangGraph investment makes LangSmith’s zero-config integration genuinely hard to beat. Philosophy third: managed suite (LangSmith, Braintrust, Weave, Datadog) versus open-source-first (Langfuse, Phoenix, MLflow, Opik). And don’t skim past the footnote at the bottom of the tree — the audit-callback lane from Part 1 belongs in every outcome box, because it’s the one component you’ll never migrate, never license, and never lose to an acquisition. (Ask a Helicone user.)

A practical privacy ladder, from lightest to heaviest control — each rung buys more sovereignty and costs more ops effort, so climb only as high as your data requires:

Meera’s launch checklist, distilled from the whole series:

And if you keep exactly one artifact from this series, make it this poster — the whole playbook, both parts, on one page:

And the story ends where stories like this should: Sanjay signs off, the agent ships, and three weeks later a customer claims the bot promised them a free GPU. Meera pulls the trace, reads the actual conversation, checks which prompt commit was live that day, and replies in four minutes flat.

That’s the whole point.

If this two-parter saved you a debugging weekend — or surfaced a gap in your team’s LLM stack — I’d genuinely like to hear about it. I’m Prashant Sahu, and I train and consult on GenAI engineering: LLM observability and evaluation, RAG systems, and multi-agent architectures, including the 10-day (70-hour) corporate curriculum this series grew out of.

🔗 **Connect with me on LinkedIn: ** linkedin.com/in/prashantksahu — say hi, share your observability war stories, or just tell me which part of this series you’d like a deeper dive on.

➕ Follow me here on Medium for the next articles in this series — Langfuse hands-on, the 3-layer agent-evaluation hierarchy, and PII redaction with fairness guardrails are all in the pipeline.

Missed the beginning? Read Part 1 here — observability fundamentals, zero-config tracing, tracing any Python function, and the audit-grade callback.

LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article The Sub-$3 Power Meter: Measuring Edge AI Energy Consumption Without an SMU How Will The Future Software Engineer Distinquish Itself When Writing Code Becomes Fully… Agent Skills: The Composition Cliff

LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack

Run your AI side-project on zahid.host