LLM Observability with LangSmith -Part 1: Tracing Everything & Building Audit-Grade Callbacks

wpnews.pro

Meera ships the demo on a Friday.

She’s a GenAI engineer at AcmeAI, a company that sells AI models and the hardware to run them. Her latest build is a customer-support agent: a LangGraph workflow that reads an incoming query, classifies it as Technical, Billing, or General, checks the customer’s sentiment, retrieves answers from a knowledge base, and replies. If the customer sounds furious, it skips the robot answer entirely and escalates to a human.

The demo lands. Leadership loves it. And then Sanjay, the head of risk, asks three questions that stop the launch cold:

“Can we replay any past customer interaction?” If a customer claims the bot promised them a free GPU, can we pull up exactly what happened?
“Do we have a tamper-evident audit log?” Every LLM call, every retrieval, every error — somewhere we control, not just a vendor dashboard?
“If somebody tweaks a prompt next quarter, will we catch the regression before it ships?” Or will we find out from angry customers?

Meera realizes something every LLM engineer eventually learns: building the agent is the easy half. Operating it is the hard half.

📚 This is Part 1 of a two-part series.**In this part, we cover what observability and traceability actually mean (and why LLM apps break every assumption your monitoring stack was built on), what LangSmith is and everything it can do, zero-config tracing for a real LangGraph agent, tracinganyPython function, and building a compliance-grade audit callback — that’s Sanjay’s questions #1 and #2.

ThePart 2 (yet to be published)takes on question #3: eval datasets and CI regression gates, prompt versioning with the Hub, the same playbook across six industries, and an honest LangSmith-vs-Langfuse decision matrix plus the wider 2026 tool landscape.

Grab a coffee. Let’s go.

Let’s define the words before we sling the tools, because they get used loosely.

**Observability **is a property of a system, not a product you buy: it’s the degree to which you can understand what’s happening inside the system purely from what it emits — its logs, metrics, and traces. A highly observable system lets you ask questions you didn’t think of in advance (“why are answers about refunds suddenly 30% longer for German users?”) and get answers without shipping new code.

Traceability is narrower and sharper: the ability to follow one specific request end to end — every hop, every transformation, every sub-call, in order, with timing. If observability is the security-camera system for the whole building, a trace is the complete CCTV cut of one visitor’s walk through it.

Monitoring, for completeness, is the dashboards-and-alerts layer you bolt on top: predefined checks for failures you already anticipated. Monitoring catches known unknowns. Observability is what saves you when the failure is one nobody predicted — which, with LLMs, is most of them.

Classic web apps had this figured out: APM tools, structured logs, distributed tracing. So why do LLM apps need their own version of the discipline? Because they violate the core assumption all of that tooling was built on — that failures are loud.

The left half of that picture is the world your current tooling grew up in. When a traditional app breaks, it breaks theatrically: an exception fires, the response is a 500, the error-rate graph spikes, someone gets paged, and the stack trace points at the crime scene. The right half is the world you live in now. An LLM failure arrives wearing a 200 OK. It’s fluent, confident, grammatically lovely — and wrong. No exception, no spike, no page. Your logs swear everything is fine, and the first detector to fire is a customer, weeks later. The three boxes along the bottom are the answer this series builds, piece by piece:

If that still sounds like nice-to-have engineering hygiene, walk through five very real scenarios.

Scenario 1: The chatbot that invented a policy (real, and it went to a tribunal). In 2022, a passenger asked Air Canada’s website chatbot about bereavement fares. The bot confidently told him he could book a full-price ticket and claim the discount retroactively within 90 days — a policy that did not exist; the airline’s actual policy page said the opposite. He flew, applied for the refund, and was rejected. In February 2024, a British Columbia tribunal ordered Air Canada to pay CA$812.02, explicitly rejecting the airline’s argument that the chatbot was “a separate legal entity responsible for its own actions.” The ruling set the tone for every deployment since: your bot’s words are your company’s words. Now ask yourself — if a customer made that claim against your bot, could you produce the exact conversation, the documents the bot retrieved, and the prompt version that was live that day? Without tracing, the honest answer is no. You’d be litigating against a ghost.

Scenario 2: The silent regression. A teammate “improves” a routing prompt — adds one clarifying sentence. Refund questions quietly start routing to General, where the responder can’t see billing documents. No error. No alert. Accuracy degrades for three weeks until a pattern emerges in complaints. The fix takes five minutes; finding it takes days — unless an eval suite had flagged it before merge. (This scenario is the entire plot of Part 2.)

Scenario 3: The invisible money leak. Every LLM call has a price tag, which makes cost a first-class observability metric in a way traditional apps never needed. One team discovered that 4% of conversations were consuming 40% of their token spend — users pasting entire PDFs into the chat. That’s invisible in a monthly bill (“the OpenAI line item went up”) and obvious in five minutes of trace telemetry grouped by conversation.

Scenario 4: Drift you didn’t deploy. Model providers update models. Sometimes behavior shifts subtly — formats change, refusal rates move, a model gets terser. Your code didn’t change, your prompts didn’t change, and yet Tuesday’s system is not Monday’s system. Without baseline evals re-run on a schedule, you discover drift the way you discover everything else: from users.

Scenario 5: The 3 a.m. incident. Something went wrong with a customer interaction and it’s escalating — legal is asking questions. Your observability SaaS is having an outage, or your compliance team was never allowed to ship data there in the first place. What do you hand the auditor? If the answer isn’t “our own append-only log, on our own disk, with every event timestamped and correlated,” you have a governance gap, not just a tooling gap.

Five scenarios, one conclusion: an LLM app without observability isn’t a product — it’s a liability with a chat interface. Traces answer “what happened?”, evaluations answer “is it still good?”, and versioned prompts answer “what changed, and can we undo it?”

Now let’s meet the tool Meera reaches for.

LangSmith is the observability, evaluation, and prompt-engineering platform built by the LangChain team. It entered closed beta in mid-2023, reached general availability in February 2024 alongside LangChain’s $25M Series A led by Sequoia Capital, and has since grown from “a debugger for LangChain” into what the company now positions as a full agent engineering platform.

A few facts worth knowing before you commit to it:

This part of the series lives in the first row of that table; Part 2 lives in the second and third. Let’s build.

**Setup: **You need two API keys: one for your LLM provider (OpenAI here, but anything works) and one free LangSmith key from smith.langchain.com.

pip install "langchain>=1.0" "langchain-core>=1.0" "langchain-openai>=1.0" \"langgraph>=1.0" langchain-chroma "langsmith>=0.4" python-dotenv
python
import osfrom dotenv import load_dotenvload_dotenv() # expects OPENAI_API_KEY and LANGSMITH_API_KEY in .env# The entire tracing setup. Yes, really.os.environ["LANGSMITH_TRACING"] = "true"os.environ["LANGSMITH_PROJECT"] = "acmeai-support-router"# Optional: pin the endpoint (default is US; use eu.api.smith.langchain.com for EU residency)os.environ.setdefault("LANGSMITH_ENDPOINT", "https://api.smith.langchain.com")

That’s the magic trick, and it’s worth pausing on: you haven’t imported LangSmith anywhere. With these environment variables set, every LangChain and LangGraph operation from this point on — every model call, every retriever hit, every graph node — reports itself to your LangSmith project automatically. The tracer rides LangChain’s internal callback system and uploads in batches, off the hot path, so your latency doesn’t pay for it.

Now let’s build the thing worth observing.

Meera’s agent answers from a small product knowledge base. In production this would be your real docs; here, twelve documents keep the story self-contained:

from langchain_core.documents import Documentknowledge_base = [# - - technical - -{"text": "Our pre-trained models include vision (CLIP-style), speech (Whisper-style), and text (Llama-3 fine-tunes). They ship with example notebooks.", "metadata": {"category": "technical"}},{"text": "On-prem deployment is supported via the AcmeAI Edge appliance - Kubernetes-based, runs Llama 3 70B on 2x H100.", "metadata": {"category": "technical"}},{"text": "Hardware troubleshooting: if the GPU light blinks red, run acmectl diagnose - gpu; common cause is a loose NVLink bridge.", "metadata": {"category": "technical"}},{"text": "AcmeAI SDK supports Python 3.10+, Node 20+, and Java 17. The REST API is OpenAPI 3.1 compliant.", "metadata": {"category": "technical"}},# - - billing - -{"text": "We accept Visa, Mastercard, Amex, ACH bank transfer, and wire. Crypto is not supported.", "metadata": {"category": "billing"}},{"text": "Invoices are emailed on the 1st of each month. To download past invoices, log in and visit Account → Billing → Invoices.", "metadata": {"category": "billing"}},{"text": "You can update your billing info under Account → Billing → Payment Methods. Changes take effect immediately.", "metadata": {"category": "billing"}},{"text": "Refunds are processed within 7 business days. We refund pro-rata on cancellation within 30 days of purchase.", "metadata": {"category": "billing"}},# - - general - -{"text": "Our refund policy: full refund within 30 days, pro-rata thereafter. Contact billing@acmeai.example.", "metadata": {"category": "general"}},{"text": "Standard shipping is 3–5 business days within the US. International shipping is 7–14 business days; duties not included.", "metadata": {"category": "general"}},{"text": "Working hours: Mon–Fri 8am–8pm Eastern. Weekend support is available for Enterprise customers only.", "metadata": {"category": "general"}},{"text": "You can reach support at support@acmeai.example or +1–555-ACME-HELP. Average first response: under 4 hours.", "metadata": {"category": "general"}},]docs = [Document(page_content=d["text"], metadata=d["metadata"]) for d in knowledge_base]

Embed it into a Chroma vector store with cosine similarity:

from langchain_openai import OpenAIEmbeddingsfrom langchain_chroma import Chromaembeddings = OpenAIEmbeddings(model="text-embedding-3-small")kbase_db = Chroma.from_documents(    documents=docs,    collection_name="knowledge_base",    embedding=embeddings,    collection_metadata={"hnsw:space": "cosine"}, # default is euclidean - be explicit    persist_directory="./knowledge_base",)retriever = kbase_db.as_retriever(    search_type="similarity_score_threshold",    search_kwargs={"k": 3, "score_threshold": 0.2},)

LangGraph agents pass a typed state dictionary between nodes:

from typing import TypedDict, Literalfrom pydantic import BaseModelfrom langchain_openai import ChatOpenAIclass CustomerSupportState(TypedDict):    customer_query: str    query_category: str    query_sentiment: str    final_response: strclass QueryCategory(BaseModel):    categorized_topic: Literal["Technical", "Billing", "General"]class QuerySentiment(BaseModel):    sentiment: Literal["Positive", "Neutral", "Negative"]llm = ChatOpenAI(model="gpt-5-mini") # swap for any chat model you like

The two Pydantic models matter more than they look. Paired with with_structured_output, the LLM cannot reply “I think this is probably a billing question 😊” — it must return one of the three allowed labels. Routers need guarantees, not vibes.

def categorize_inquiry(state: CustomerSupportState) -> CustomerSupportState:    """Classify the query into Technical / Billing / General."""    prompt = f"""Act as a customer support agent for an AI products and hardware company.    Read the customer query and pick the best category: 'Technical', 'Billing', or 'General'.    - Technical: AI models, hardware, software, SDK issues    - Billing: payments, invoices, refunds, purchases    - General: policies, contact info, shipping, everything else    Query:    {state["customer_query"]}    """    result = llm.with_structured_output(QueryCategory).invoke(prompt)    return {"query_category": result.categorized_topic}def analyze_inquiry_sentiment(state: CustomerSupportState) -> CustomerSupportState:    """Classify sentiment as Positive / Neutral / Negative."""    prompt = f"""Act as a customer support agent. Read the customer query below and    classify its sentiment as exactly one of: 'Positive', 'Neutral', or 'Negative'.    Query:    {state["customer_query"]}    """    result = llm.with_structured_output(QuerySentiment).invoke(prompt)    return {"query_sentiment": result.sentiment}

Sanity-check the sentiment node before wiring anything — same question, two emotional registers:

analyze_inquiry_sentiment({"customer_query": "what is your refund policy?"})# {'query_sentiment': 'Neutral'}analyze_inquiry_sentiment({"customer_query": "what is your refund policy? I am fed up with this product and want my money back"})# {'query_sentiment': 'Negative'}

Same topic, opposite routing destinies — the first will get a polite RAG answer about refund windows; the second is heading straight to a human. That’s the whole escalation design in two lines of output.

Each responder filters the vector store to its own category using a metadata filter — the billing node physically cannot retrieve technical docs:

from langchain_core.prompts import ChatPromptTemplateRESPONSE_TEMPLATE = ChatPromptTemplate.from_template(    """Craft a clear and helpful {category} support response for the customer query below.    Ground your answer in the provided knowledge base information.    If the knowledge base does not contain the answer, say exactly:    "Apologies, I was not able to answer your question, please reach out to +1-555-ACME-HELP"    Customer Query:    {customer_query}    Relevant Knowledge Base Information:    {retrieved_content}    """)def make_category_responder(category: str):    """Build a RAG responder node scoped to one KB category via metadata filter."""    def responder(state: CustomerSupportState) -> CustomerSupportState:        retriever.search_kwargs["filter"] = {"category": category}        docs = retriever.invoke(state["customer_query"])        retrieved = "\n\n".join(d.page_content for d in docs)        chain = RESPONSE_TEMPLATE | llm        reply = chain.invoke({            "category": category,            "customer_query": state["customer_query"],            "retrieved_content": retrieved,        }).content        return {"final_response": reply}    return respondergenerate_technical_response = make_category_responder("technical")generate_billing_response   = make_category_responder("billing")generate_general_response   = make_category_responder("general")def escalate_to_human_agent(state: CustomerSupportState) -> CustomerSupportState:    """Negative sentiment? No robot. A human will call."""    return {"final_response": "We're really sorry! Someone from our team will reach out to you shortly."}
python
from langgraph.graph import StateGraph, ENDfrom langgraph.checkpoint.memory import MemorySaverdef determine_route(state: CustomerSupportState) -> str:    if state["query_sentiment"] == "Negative":        return "escalate_to_human_agent"    elif state["query_category"] == "Technical":        return "generate_technical_response"    elif state["query_category"] == "Billing":        return "generate_billing_response"    return "generate_general_response"graph = StateGraph(CustomerSupportState)graph.add_node("categorize_inquiry", categorize_inquiry)graph.add_node("analyze_inquiry_sentiment", analyze_inquiry_sentiment)graph.add_node("generate_technical_response", generate_technical_response)graph.add_node("generate_billing_response", generate_billing_response)graph.add_node("generate_general_response", generate_general_response)graph.add_node("escalate_to_human_agent", escalate_to_human_agent)graph.set_entry_point("categorize_inquiry")graph.add_edge("categorize_inquiry", "analyze_inquiry_sentiment")graph.add_conditional_edges("analyze_inquiry_sentiment", determine_route, [    "generate_technical_response", "generate_billing_response",    "generate_general_response", "escalate_to_human_agent",])for terminal in ["generate_technical_response", "generate_billing_response",                 "generate_general_response", "escalate_to_human_agent"]:    graph.add_edge(terminal, END)agent = graph.compile(checkpointer=MemorySaver())

If you’re in a notebook, LangGraph will draw itself — agent.get_graph().draw_mermaid_png() — and what it draws is this topology:

The two indigo nodes at the top are LLM classifiers writing into the shared state; the diamond is plain Python reading that state — auditable logic, no model involved; the three teal terminals are the RAG responders, each fenced into its own slice of the knowledge base by that metadata filter; and the red terminal is the empathy hatch, where angry customers bypass the robot entirely. Keep this picture in mind, because in about thirty seconds every shape on it is going to reappear as a span in a trace tree.

def ask(query: str, session_id: str = "demo") -> str:    final = None    for event in agent.stream(        {"customer_query": query},        {"configurable": {"thread_id": session_id}},        stream_mode="values",    ):        final = event    return final["final_response"]print(ask("Do you support pre-trained vision models?"))   # → Technical pathprint(ask("How do I download my last invoice?"))           # → Billing pathprint(ask("Can you tell me about your shipping policy?"))  # → General path

The billing answer comes back grounded in exactly the documents we seeded:

You can download past invoices by logging in and going to Account → Billing →Invoices. Invoices are also emailed on the 1st of each month, so check the inboxassociated with your account. If you don't see an invoice you expected, contactsupport@acmeai.example and we'll resend it.

Pleasant enough. But the real payoff is on the other screen: open smith.langchain.com, click into the acmeai-support-router project, and three traces are waiting. Click the invoice one and you get the full waterfall — which looks like this:

Time flows left to right; each bar is a run, LangSmith’s unit of work, and the indentation is the call hierarchy. Three things jump out the first time you see your own agent like this. First, the final generation call eats 1.25 of the total 2.95 seconds — so when someone says “the bot feels slow,” this chart settles the optimize-retrieval-or-optimize-generation argument in five seconds flat (it’s generation, and the two classifier calls in front of it are the next suspects). Second, that little amber sliver: 140 ms for the Chroma retrieval, and clicking it shows the exact three documents it returned — which is precisely the evidence you need when the bot confidently cites the wrong spec. Third, quietly doing the bookkeeping: every bar carries token counts in and out (LangSmith turns those into cost per trace, per user, per day), plus a run_id and parent_run_id linking each run to its parent. File those two IDs away — they become important in §5.

And that’s Sanjay’s question #1 answered, with a click instead of a forensic project. When a customer claims “your bot told me the appliance supports water cooling” — or invents a bereavement-fare policy — support pulls the trace and reads exactly what the retriever returned and what the model said.

A question Meera gets asked constantly, so let’s answer it head-on: **yes, LangSmith traces arbitrary Python — no LangChain required. **The @traceable decorator turns any function into a run, and nested decorated calls assemble into the same parent-child tree automatically.

Here’s a fully standalone example — raw OpenAI SDK, a fake database call, plain Python orchestration. Not a LangChain import in sight:

import osfrom langsmith import traceablefrom langsmith.wrappers import wrap_openaifrom openai import OpenAIos.environ["LANGSMITH_TRACING"] = "true"os.environ["LANGSMITH_PROJECT"] = "acmeai-standalone-demo"# wrap_openai instruments the raw OpenAI client: every .create() becomes an LLM runoai = wrap_openai(OpenAI())@traceable(name="crm_lookup", run_type="tool")def fetch_customer_tier(customer_id: str) -> str:    # pretend this hits your CRM / database    return "enterprise" if customer_id.startswith("ENT") else "standard"@traceable(name="ticket_summarizer")def summarize_ticket(ticket_text: str) -> str:    response = oai.chat.completions.create(        model="gpt-5-mini",        messages=[{"role": "user", "content": f"Summarize this support ticket in one line: {ticket_text}"}],    )    return response.choices[0].message.content@traceable(name="handle_ticket", tags=["support", "v2"], metadata={"team": "acmeai-support"})def handle_ticket(customer_id: str, ticket_text: str) -> dict:    tier = fetch_customer_tier(customer_id)       # child run #1 (tool)    summary = summarize_ticket(ticket_text)       # child run #2 -> contains the LLM run    return {"tier": tier, "summary": summary, "priority": "P1" if tier == "enterprise" else "P3"}handle_ticket("ENT-00451", "The Edge appliance reboots whenever we run the vision pipeline at batch size 64.")

Open the project and the tree reads exactly like the code: handle_ticket as the parent,* crm_lookup* and * ticket_summarizer *nested inside it, and the wrapped OpenAI call inside that — with token counts captured even though LangChain was never involved. Three details worth knowing:

This matters strategically: your observability isn’t welded to your framework choice. If you rip out LangChain next year, the tracing survives.

Meera shows Sanjay the dashboard. He’s impressed — for about a minute. Then he leans in:

“This isdashboard, ontheirservers. If their cloud is down during an incident, what do we show the regulator? And does customer data leave our network before we’ve scrubbed it?”their

Fair. LangSmith is brilliant for debugging. But audit and compliance teams want guarantees a SaaS dashboard alone can’t give:

The primitive that solves this is LangChain’s **BaseCallbackHandler **— the same machinery LangSmith itself rides on: lifecycle hooks that fire synchronously, in your process, on every LLM start/end, tool start/end, and error. Subclass it, and you decide what gets persisted, where, and in what shape.

Meera writes hers to emit JSON Lines — one JSON object per line, append-only. It’s the dullest format in computing, and that’s the point: grep reads it, jq reads it, pandas reads it, Splunk ingests it.

import jsonimport reimport timefrom datetime import datetime, timezonefrom pathlib import Pathfrom typing import Anyfrom uuid import UUIDfrom langchain_core.callbacks import BaseCallbackHandlerAUDIT_LOG_PATH = Path("./audit.jsonl")EMAIL_RE = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+")def redact(text: str) -> str:    """Minimal demo redaction. In production, use a real PII engine    (Microsoft Presidio, Amazon Comprehend) — emails alone won't cut it."""    return EMAIL_RE.sub("[EMAIL_REDACTED]", text)class JsonLinesAuditHandler(BaseCallbackHandler):    """Append-only audit log of every LLM call, tool call, and error.    One JSON object per line — grep-, jq-, pandas-, and Splunk-friendly.    Designed for environments where an external SaaS alone can't be    the system of record.    """    def __init__(self, log_path: Path = AUDIT_LOG_PATH) -> None:        self.log_path = log_path        self._llm_starts: dict[UUID, float] = {}        self._tool_starts: dict[UUID, float] = {}    def _emit(self, event: dict[str, Any]) -> None:        event["ts"] = datetime.now(timezone.utc).isoformat()        with self.log_path.open("a", encoding="utf-8") as f:            f.write(json.dumps(event, default=str) + "\n")    def on_llm_start(self, serialized, prompts, *, run_id, parent_run_id=None, **kwargs):        self._llm_starts[run_id] = time.perf_counter()        self._emit({            "event": "llm_start",            "run_id": str(run_id),            "parent_run_id": str(parent_run_id) if parent_run_id else None,            "model": (serialized or {}).get("id", ["unknown"])[-1],            "prompt_chars": sum(len(redact(p)) for p in prompts),        })    def on_llm_end(self, response, *, run_id, parent_run_id=None, **kwargs):        latency_ms = (time.perf_counter() - self._llm_starts.pop(run_id, time.perf_counter())) * 1000        usage = (response.llm_output or {}).get("token_usage", {}) if response.llm_output else {}        self._emit({            "event": "llm_end",            "run_id": str(run_id),            "parent_run_id": str(parent_run_id) if parent_run_id else None,            "latency_ms": round(latency_ms, 1),            "prompt_tokens": usage.get("prompt_tokens"),            "completion_tokens": usage.get("completion_tokens"),            "total_tokens": usage.get("total_tokens"),        })    def on_llm_error(self, error, *, run_id, parent_run_id=None, **kwargs):        self._emit({            "event": "llm_error",            "run_id": str(run_id),            "parent_run_id": str(parent_run_id) if parent_run_id else None,            "error_type": type(error).__name__,            "error_msg": str(error)[:500],        })    def on_tool_start(self, serialized, input_str, *, run_id, parent_run_id=None, **kwargs):        self._tool_starts[run_id] = time.perf_counter()        self._emit({            "event": "tool_start",            "run_id": str(run_id),            "parent_run_id": str(parent_run_id) if parent_run_id else None,            "tool": (serialized or {}).get("name"),            "input_chars": len(input_str),        })    def on_tool_end(self, output, *, run_id, parent_run_id=None, **kwargs):        latency_ms = (time.perf_counter() - self._tool_starts.pop(run_id, time.perf_counter())) * 1000        self._emit({            "event": "tool_end",            "run_id": str(run_id),            "parent_run_id": str(parent_run_id) if parent_run_id else None,            "latency_ms": round(latency_ms, 1),            "output_chars": len(str(output)),        })    def on_chain_error(self, error, *, run_id, parent_run_id=None, **kwargs):        self._emit({            "event": "chain_error",            "run_id": str(run_id),            "parent_run_id": str(parent_run_id) if parent_run_id else None,            "error_type": type(error).__name__,            "error_msg": str(error)[:500],        })audit_handler = JsonLinesAuditHandler()

A few deliberate choices worth noticing:

It also helps to see ** when **each of those overridden hooks actually fires. For one billing query through the router, the sequence and what each hook leaves behind in the file looks like this:

Notice the greyed rows: on_chain_start and on_chain_end fire too — for the graph and for every node — but our handler deliberately lets them pass; LLM and error events are the audit-worthy moments. The amber row is an invitation: if your compliance story needs retrieval evidence (“which documents informed this answer?”), on_retriever_start / on_retriever_end are sitting there waiting for the same treatment. And the red strip at the bottom is the part auditors care about most — failures don’t vanish, they write a row with the error type, because ** the absence of a record is itself a finding **in most audit frameworks.

Attaching the handler costs one config key — and LangSmith keeps tracing alongside it. They’re independent layers:

queries = [    ("audit-001", "Do you support pre-trained vision models?"),    ("audit-002", "How do I download my last invoice?"),    ("audit-003", "I am furious — your hardware bricked itself overnight, refund NOW."),]for session_id, query in queries:    events = agent.stream(        {"customer_query": query},        config={            "configurable": {"thread_id": session_id},            "callbacks": [audit_handler],          # ← audit log + LangSmith both fire            "metadata": {"app": "support-router", "session_id": session_id},        },        stream_mode="values",    )    final = None    for ev in events:        final = ev    print(f"{session_id}: category={final['query_category']} sentiment={final['query_sentiment']}")
audit-001: category=Technical sentiment=Neutralaudit-002: category=Billing sentiment=Neutralaudit-003: category=Billing sentiment=Negative

Read that third line carefully — it’s the design working as intended. The router still classified the furious message as Billing (it is about a refund), but the Negative sentiment overrode the route and the customer got a human, not a robot. Meanwhile, ** audit.jsonl **quietly collected the paper trail. Six LLM calls happened across those three queries (two for the escalated one — its generation step never ran), and each produced a start and end event:

import pandas as pdevents = [json.loads(line) for line in AUDIT_LOG_PATH.open() if line.strip()]df = pd.DataFrame(events)print("Total events:", len(events))print("Total LLM tokens:", int(df["total_tokens"].dropna().sum()))print("Avg LLM latency (ms):", round(df.loc[df["event"] == "llm_end", "latency_ms"].mean(), 1))
Total events: 16Total LLM tokens: 1732Avg LLM latency (ms): 894.6

And this is what a single line of the file looks like — what your SOC team greps at 3 a.m. when an incident lands:

{"event": "llm_end", "run_id": "9f2c...", "parent_run_id": "b41a...", "latency_ms": 842.3, "prompt_tokens": 187, "completion_tokens": 9, "total_tokens": 196, "ts": "2026-06-12T07:14:55.103+00:00"}

Picture the full event stream flowing down two lanes from the same source. Down the engineer’s lane: the LangSmith tracer batching events to the cloud, feeding the dashboards, the replay UI, and (in Part 2) the experiments. Down the auditor’s lane: our handler, running synchronously in-process — and that placement is the entire compliance argument, because *redact() *runs before any byte leaves the machine — then the append-only file, then the SIEM with its retention policy. The two lanes share nothing but the callback events and those *run_id*s. LangSmith down? The auditor’s record is intact. Disk hiccup? LangSmith still has the traces. That’s Sanjay’s question #2 answered, and it earns the rule Meera writes on the team wiki:

Run both, always.LangSmith for the humans debugging at their desks. The custom callback for the auditor’s chain-of-custody. They’re complementary layers, not alternatives.

Take stock of what Meera has after one day of work.

Two environment variables bought her a flight recorder: every conversation with the support agent is now a replayable trace with per-step prompts, retrieved documents, token costs, and latencies — that’s the Air Canada defense, question #1. Eighty lines of BaseCallbackHandler bought her an institution-grade audit lane: append-only, PII-redacted before egress, vendor-independent, SIEM-ready — question #2. And the @traceable decorator means none of this is hostage to LangChain: the day the team rewrites the agent in a different framework, the observability comes along.

But Sanjay’s third question is still open — and it’s the one that bites teams after launch: “if somebody tweaks a prompt next quarter, will we catch the regression before it ships?” Right now, the honest answer is still no. A well-meaning edit to the routing prompt tomorrow would sail straight into production, and Meera would learn about it from the complaints queue.

**In the Part 2 (yet to be published) **we close that gap and then zoom out: we turn LangSmith into a regression-test framework with datasets, evaluators, and experiments; put prompts under real version control with the Hub (immutable commits, movable ** :production **tags, CI-gated promotion — and an instant-rollback story your release manager will love); tour how the exact same four moves play out in

If this saved you a future debugging weekend, follow me here on Medium so Part 2 lands in your feed —and I’d genuinely love to hear your observability war stories in the comments.

You can follow me and connect with me on LinkedIn as well https://www.linkedin.com/in/prashantksahu

LLM Observability with LangSmith -Part 1: Tracing Everything & Building Audit-Grade Callbacks was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article Beyond Cosine Similarity: Fixing the RAG Freshness Trap in Enterprise AI Architecture DeepSeek V4 vs DeepSeek V4 Flash: Which Model Should Developers Choose in 2026? I Built an AI Agent That Can Query My Kubernetes Cluster, But Never Break It

LLM Observability with LangSmith -Part 1: Tracing Everything & Building Audit-Grade Callbacks

Run your AI side-project on zahid.host