DProvenanceKit: Execution Provenance for AI Systems

DProvenanceKit, a Python port of the Swift library for execution provenance in AI systems, has been released. It provides reasoning observability and regression testing by turning each execution into a queryable, diffable trace, enabling detection of reasoning drift between runs. The tool ships with CI gating, anomaly detection rules, and a hosted visualizer, requiring zero third-party dependencies.

https://github.com/Therealdk8890/DProvenanceKitPython/actions/workflows/ci.yml https://pypi.org/project/dprovenancekit/ Reasoning observability and regression testing for AI systems — a Python port of the Swift DProvenanceKit. When an agent's reasoning drifts between runs, DProvenanceKit turns each execution into a queryable, diffable trace so you can see what changed and why — not just what happened . Run → Record → Query → Diff → Detect regressions → Gate in CI It's not just the library — it ships the surfaces that make reasoning regressions actionable: Gate in CI — a server-less dprovenancekit gate CLI, plus a drop-in GitHub Action /Therealdk8890/DProvenanceKitPython/blob/main/action/README.md and GitLab CI template /Therealdk8890/DProvenanceKitPython/blob/main/gitlab/README.md that fail a PR/MR when an agent's reasoning drifts from a golden baseline, and comment the diff. Out-of-the-box anomaly rules — Tool Drop and Looping detection with a JSON rule registry, runnable locally or on every PR. A hosted visualizer — a web dashboard single-run span tree, JSON payload inspector, side-by-side semantic diff, shareable HTML reports backed by a regression-gate API and multi-tenant control plane. Available as a separate commercial service. See it all in one runnable script: python examples/end to end demo.py /Therealdk8890/DProvenanceKitPython/blob/main/examples/end to end demo.py . This is a faithful, dependency-free port of the Swift library to Python. It keeps the same architecture and guarantees — synchronous non-blocking recording, priority-aware backpressure, one query language over two backends held at parity, structural diffing, formally-modeled semantic alignment, and by-tier drop accounting so load-shedding is never silent. The original Swift package is unchanged; this is a parallel implementation. The Swift library targets Apple-platform and on-device AI. This port brings the same reasoning-layer observability to Python codebases — agent frameworks, LLM workflows, tool-using models — with zero third-party dependencies it uses only the standard library: sqlite3 , contextvars , threading , json , hashlib , uuid , urllib . From PyPI released builds : pip install dprovenancekit pip install "dprovenancekit langchain " + LangChain adapter pip install "dprovenancekit openai-agents " + OpenAI Agents adapter From a checkout development : pip install -e ". dev " Requires Python 3.9+; the core has zero third-party dependencies . Releasing is documented in RELEASING.md /Therealdk8890/DProvenanceKitPython/blob/main/RELEASING.md . Want the whole arc in one runnable script — record → query → gate → detect anomalies → diff → report, then hand the same runs to the CLI? Run . The steps below build it up piece by piece. python examples/end to end demo.py Any frozen dataclass that subclasses TraceableEvent , exposing a stable type identifier and a priority : python from dataclasses import dataclass from dprovenancekit import TraceableEvent, TracePriority @dataclass frozen=True class MyAIDecision TraceableEvent : kind: str "promptGenerated" | "documentEvaluated" | "conflictDetected" | "finalDecisionMade" token count: int = 0 document id: str = "" score: float = 0.0 reason: str = "" approved: bool = False @property def type identifier self - str: return self.kind @property def priority self - TracePriority: if self.kind == "finalDecisionMade": return TracePriority.CRITICAL if self.kind == "conflictDetected": return TracePriority.DIAGNOSTIC return TracePriority.TELEMETRY record ... is synchronous and never blocks — it touches only an in-memory buffer. Ambient run / engine / span context propagates through contextvars , so nested scopes attribute events correctly with no plumbing. python from dprovenancekit import DProvenanceKit, InMemoryTraceStore kit = DProvenanceKit MyAIDecision store = InMemoryTraceStore with kit.run context id="demo case", store=store : kit.record MyAIDecision kind="documentEvaluated", document id="DocA", score=0.95 kit.record MyAIDecision kind="conflictDetected", reason="timeline inconsistency" kit.record MyAIDecision kind="finalDecisionMade", approved=False python from dprovenancekit import TraceQueryDSL suspicious = store.query runs TraceQueryDSL .requiring step "conflictDetected" .missing step "documentEvaluated" Find runs where a conflict was reported but no document was ever evaluated. The same DSL compiles to SQL for SQLiteTraceStore and is evaluated in memory for InMemoryTraceStore — the two backends are held in lockstep by a parity test suite. python from dprovenancekit import TraceDiffEngine diff = TraceDiffEngine .diff base=run a, comparison=run b print diff.changes structural steps that appeared, disappeared, or moved TraceAlignmentEngine decides whether two executions are behaviorally equivalent within a formally-defined semantic model, even when payloads vary slightly: python from dprovenancekit import AlignmentConfiguration, AlignmentProfile, AnyEquivalenceEvaluator, TraceAlignmentEngine, config = AlignmentConfiguration profile=AlignmentProfile.strict audit v1, equivalence evaluator=AnyEquivalenceEvaluator evaluator identifier="MyAIDecision Semantic", evaluator=lambda a, b: 1.0 if a == b else 0.0, , result = TraceAlignmentEngine config .align base=run a, comparison=run b print result.regression risk.level python from dprovenancekit import AnomalyDetector, AnomalyRule, TraceQueryDSL class UnverifiedConflictRule AnomalyRule : @property def name self : return "unverified conflict" @property def anomaly query self : return TraceQueryDSL .requiring step "conflictDetected" .missing step "documentEvaluated" def describe self, run : return "Conflict detected with no supporting evaluation" anomalies = AnomalyDetector store .detect anomalies UnverifiedConflictRule Or drop in ready-made rules from the built-in library instead of writing your own: python from dprovenancekit import AnomalyDetector, LoopingRule, ToolDropRule anomalies = AnomalyDetector store .detect anomalies ToolDropRule "safety check" , never ran a required step LoopingRule "web search", max repeats=5 , stuck repeating the same tool call Run the regression gate in CI with no server — point it at a local SQLite trace database and a golden/candidate run id. Exit code is 0 pass , 1 regression , or 2 usage error : dprovenancekit gate --db traces.sqlite --golden "$GOLDEN RUN ID" --candidate "$CANDIDATE RUN ID" dprovenancekit gate --db traces.sqlite --golden "$G" --candidate "$C" --max-level low --json Gate across separate databases a restored baseline vs. this PR's run , resolving the golden run id from the baseline instead of hardcoding it: GOLDEN=$ dprovenancekit runs --db baseline.sqlite --context my-agent --latest --format id dprovenancekit gate --golden-db baseline.sqlite --golden "$GOLDEN" \ --candidate-db candidate.sqlite --candidate "$CANDIDATE RUN ID" Prebuilt CI integrations wrap this and comment the diff on the PR/MR: a GitHub Action /Therealdk8890/DProvenanceKitPython/blob/main/action/README.md and a GitLab CI template /Therealdk8890/DProvenanceKitPython/blob/main/gitlab/README.md . The library ships the same validation corpus as the Swift version. The headless CLI runs it through the real benchmark runner: dprovenancekit evaluate precision/recall/F1 over the standard + adversarial corpora dprovenancekit diagnose causal ranking of failure modes dprovenancekit stability determinism boundary: isolated vs perturbed F1 variance Both corpora score Precision 1.000 / Recall 1.000 / F1 1.000 — 8 standard scenarios reordering, semantic evolution, noise injection, branch collapse, … and 5 adversarial robustness traps dependency inversion, partial truncation, semantic substitution, … — matching the Swift implementation case-for-case. | Component | Module | |---|---| | Event model, priority tiers, drop accounting | event , priority , drop stats | | Recording API + ambient context | kit , context | | Stores in-memory, WAL SQLite, raw read, cloud | store , sqlite store , raw store , cloud store | | Priority-aware write buffer | write buffer | | Query DSL + two backends AST eval + SQL compiler | query | | Live querying + anomaly detection + rule library | live engine , anomaly , rules | | Structural diff + span-aware snapshot diff | diff , snapshot diff | | Deterministic replay | replay | | Semantic alignment engine + evidence + verification | alignment , verification | | Benchmark harness, failure diagnoser, corpus | benchmark , corpus | | Pure view models for a trace viewer | viewmodel | | Framework adapters LangChain / LangGraph | integrations.langchain | | Framework adapters OpenAI Agents SDK | integrations.openai agents | | Regression-gate test helper | testing | | Shareable HTML regression report | report | | Framework-agnostic instrumentation decorators | instrument | Headless CLI — gate , anomalies , runs , evaluate | cli | The SwiftUI DProvenanceUI target is intentionally not ported it is Apple-platform UI ; its pure value-model layer SpanViewModel , flattening is ported in viewmodel . Keeping the Swift and Python SDKs behaviorally equivalent is enforced, not hoped for. conformance/ /Therealdk8890/DProvenanceKitPython/blob/main/conformance holds Trace Specification v1 — a language-neutral contract plus frozen golden vectors that pin the run fingerprint, the alignment profile hash, canonical payload encoding, query semantics, and alignment verdicts. python -m pytest tests/test conformance.py the Python SDK's claim of conformance python conformance/generate vectors.py intentionally re-freeze the contract The committed conformance/vectors/ .json are the contract: any SDK — Swift today, Rust or TypeScript later — proves equivalence by reproducing the same files. See conformance/TRACE SPEC v1.md /Therealdk8890/DProvenanceKitPython/blob/main/conformance/TRACE SPEC v1.md . Framework adapters live in dprovenancekit.integrations and are the only parts of the package with third-party dependencies — the core stays pure standard library, and nothing imports an adapter unless you do. pip install dprovenancekit langchain python from dprovenancekit import SQLiteTraceStore from dprovenancekit.integrations.langchain import DProvenanceTracer, LangChainTraceEvent store = SQLiteTraceStore LangChainTraceEvent, "traces.sqlite" tracer = DProvenanceTracer store with tracer.trace context id="customer-42" as cb: answer = chain.invoke question, config={"callbacks": cb } The run is now recorded — query it, diff it against a known-good run, or compare run fingerprints to detect when the agent took a different path. DProvenanceCallbackHandler /Therealdk8890/DProvenanceKitPython/blob/main/src/dprovenancekit/integrations/langchain.py translates LangChain's callback stream into a trace: each on llm start / on tool start / on retriever start / on chain start and its completion becomes a typed event in execution order, LangChain's run id / parent run id become the trace's span tree , the active model/tool/retriever becomes the engine , and by default lifecycle provenance edges are emitted DERIVED FROM start→completion, INFORMED parent→child . Because events flow through the same recording path as hand-written ones, the whole toolkit applies: a run's fingerprint is the structural identity of the agent's execution path, so two runs that diverge a tool called in a different order, a retrieval step skipped produce different fingerprints — a cheap regression signal. Options: capture payloads prompt/completion/IO previews , link lifecycle edges , record chains LCEL/LangGraph chain noise . pip install dprovenancekit openai-agents python from dprovenancekit import SQLiteTraceStore from dprovenancekit.integrations.openai agents import register, OpenAIAgentsTraceEvent store = SQLiteTraceStore OpenAIAgentsTraceEvent, "traces.sqlite" register store registers a global tracing processor ... run your agents normally; each run is recorded ... DProvenanceTracingProcessor /Therealdk8890/DProvenanceKitPython/blob/main/src/dprovenancekit/integrations/openai agents.py implements the SDK's TracingProcessor : each agent run becomes a trace-run context id = the trace name , and every span start/end becomes a typed event — agent.start , generation.end , function.start , guardrail.error , … — in execution order. The span's span id / parent id become the span tree , the active agent/tool/model becomes the engine , errors and triggered guardrails are recorded at CRITICAL , and lifecycle provenance edges are emitted same DERIVED FROM / INFORMED model . One registered processor captures every run; the same fingerprint /diff/align tooling then applies. dprovenancekit.testing turns "did my agent regress?" into one assertion you can drop into any test or CI step. Give it a golden run known-good and a candidate run what your current code produced ; it aligns them and fails with a readable diagnostic if the candidate diverged. python from dprovenancekit.testing import assert no regression assert no regression golden=golden run, candidate=candidate run Strict by default — any removed, added, or changed ambiguous step fails, and a removed or reordered CRITICAL step is additionally a HIGH-severity regression reordering a critical step can invert a dependency . Loosen with max regression level gate only on severity or allow divergent steps tolerate benign per-step changes , or pass a custom evaluator to define what "equivalent" means e.g. ignore volatile fields like token counts . RegressionGate ... .check ... returns a RegressionReport no raise for richer assertions. Detecting reordered steps requires a span-aware profile AlignmentProfile.developer debug v1 ; the default linear profile treats a pure reorder as still-matching. Complements AlignmentSnapshotValidator an exact output-hash snapshot : the gate works on two runs and reasons about regression severity. examples/regression testing.py /Therealdk8890/DProvenanceKitPython/blob/main/examples/regression testing.py is the end-to-end story in ~150 readable lines: record a golden run of a fact-checking agent retrieve → verify → decide , then catch a later run that skips its verification step — via both the fast fingerprint check and the detailed alignment verdict which flags the dropped claimVerified step as a HIGH regression . python examples/regression testing.py It self-asserts its verdicts, so it doubles as an executable test of the headline use case. Not using a framework? Instrument a hand-written agent loop directly — no event type to define, zero dependencies ships in core as dprovenancekit.instrument : python from dprovenancekit import InMemoryTraceStore, traced, traced run, record event @traced def search query : ... @traced def answer question, sources : ... store = InMemoryTraceStore with traced run store, context id="ticket-42" : sources = search question record event "plan.chosen", {"strategy": "rag"} reply = answer question, sources @traced records a "<name .start" / ".end" / ".error" event pair per call in its own span the function name is the engine , nests calls in the span tree, and emits the same DERIVED FROM / INFORMED provenance edges as the framework adapters. record event ... drops an ad-hoc event a decision, a chosen branch . Plain functions, async def , generators, and async generators are all supported for a generator, start/end bracket the full iteration . Instrumentation never changes behavior — capture is failure-proof and exceptions pass through unchanged. Outside a traced run the decorators are transparent, so instrumented code is safe to call untraced. The trace it produces is identical in shape to the adapter-produced ones, so fingerprint / diff / align / the regression gate all apply. examples/regression testing.py /Therealdk8890/DProvenanceKitPython/blob/main/examples/regression testing.py is the end-to-end story in ~150 readable lines: record a golden run of a fact-checking agent, then catch a later run that skips its verification step — via both the fast fingerprint check and the detailed alignment verdict which flags the dropped claimVerified step as a HIGH regression . python examples/regression testing.py It self-asserts its verdicts, so it doubles as an executable test of the headline use case. python -m pytest 168 tests: 80 ported from the Swift suite query parity, write-buffer backpressure, SQLite stress + drop accounting, alignment, replay, snapshot diff, explainability fidelity, benchmark scoring, cloud chaos, … , 28 cross-language conformance checks against the frozen Trace Specification v1 vectors, 14 LangChain integration tests, 16 OpenAI Agents SDK integration tests, 16 instrumentation-layer tests, 13 regression-gate tests, and the regression-testing example run as a self-asserting test. The real-framework tests run only when langchain-core / openai-agents are installed, otherwise skipped. Distributed under the Apache License 2.0 . See LICENSE /Therealdk8890/DProvenanceKitPython/blob/main/LICENSE .