{"slug": "dprovenancekit-execution-provenance-for-ai-systems", "title": "DProvenanceKit: Execution Provenance for AI Systems", "summary": "DProvenanceKit, a Python port of the Swift library for execution provenance in AI systems, has been released. It provides reasoning observability and regression testing by turning each execution into a queryable, diffable trace, enabling detection of reasoning drift between runs. The tool ships with CI gating, anomaly detection rules, and a hosted visualizer, requiring zero third-party dependencies.", "body_md": "[\n](https://github.com/Therealdk8890/DProvenanceKitPython/actions/workflows/ci.yml)[\n](https://pypi.org/project/dprovenancekit/)**Reasoning observability and regression testing for AI systems — a Python port of the Swift DProvenanceKit.**\n\nWhen an agent's reasoning drifts between runs, DProvenanceKit turns each execution into a queryable, diffable trace so you can see *what changed and why* — not just *what happened*.\n\nRun → Record → Query → Diff → Detect regressions → Gate in CI\n\n**It's not just the library** — it ships the surfaces that make reasoning regressions actionable:\n\n**Gate in CI**— a server-less`dprovenancekit gate`\n\nCLI, plus a drop-in[GitHub Action](/Therealdk8890/DProvenanceKitPython/blob/main/action/README.md)and[GitLab CI template](/Therealdk8890/DProvenanceKitPython/blob/main/gitlab/README.md)that fail a PR/MR when an agent's reasoning drifts from a golden baseline, and comment the diff.**Out-of-the-box anomaly rules**— Tool Drop and Looping detection with a JSON rule registry, runnable locally or on every PR.** A hosted visualizer**— a web dashboard (single-run span tree, JSON payload inspector, side-by-side semantic diff, shareable HTML reports) backed by a regression-gate API and multi-tenant control plane. Available as a separate commercial service.\n\nSee it all in one runnable script: [ python examples/end_to_end_demo.py](/Therealdk8890/DProvenanceKitPython/blob/main/examples/end_to_end_demo.py).\n\nThis is a faithful, dependency-free port of the Swift library to Python. It keeps the same architecture and guarantees — synchronous non-blocking recording, priority-aware backpressure, one query language over two backends held at parity, structural diffing, formally-modeled semantic alignment, and by-tier drop accounting so load-shedding is never silent.\n\nThe original Swift package is unchanged; this is a parallel implementation.\n\nThe Swift library targets Apple-platform and on-device AI. This port brings the same reasoning-layer observability to Python codebases — agent frameworks, LLM workflows, tool-using models — with **zero third-party dependencies** (it uses only the standard library: `sqlite3`\n\n, `contextvars`\n\n, `threading`\n\n, `json`\n\n, `hashlib`\n\n, `uuid`\n\n, `urllib`\n\n).\n\nFrom PyPI (released builds):\n\n```\npip install dprovenancekit\npip install \"dprovenancekit[langchain]\"        # + LangChain adapter\npip install \"dprovenancekit[openai-agents]\"    # + OpenAI Agents adapter\n```\n\nFrom a checkout (development):\n\n```\npip install -e \".[dev]\"\n```\n\nRequires Python 3.9+; the core has **zero third-party dependencies**. Releasing is documented\nin [RELEASING.md](/Therealdk8890/DProvenanceKitPython/blob/main/RELEASING.md).\n\nWant the whole arc in one runnable script — record → query → gate → detect anomalies → diff → report, then hand the same runs to the CLI? Run\n\n[. The steps below build it up piece by piece.]`python examples/end_to_end_demo.py`\n\nAny frozen dataclass that subclasses `TraceableEvent`\n\n, exposing a stable `type_identifier`\n\nand a `priority`\n\n:\n\n``` python\nfrom dataclasses import dataclass\nfrom dprovenancekit import TraceableEvent, TracePriority\n\n@dataclass(frozen=True)\nclass MyAIDecision(TraceableEvent):\n    kind: str           # \"promptGenerated\" | \"documentEvaluated\" | \"conflictDetected\" | \"finalDecisionMade\"\n    token_count: int = 0\n    document_id: str = \"\"\n    score: float = 0.0\n    reason: str = \"\"\n    approved: bool = False\n\n    @property\n    def type_identifier(self) -> str:\n        return self.kind\n\n    @property\n    def priority(self) -> TracePriority:\n        if self.kind == \"finalDecisionMade\":\n            return TracePriority.CRITICAL\n        if self.kind == \"conflictDetected\":\n            return TracePriority.DIAGNOSTIC\n        return TracePriority.TELEMETRY\n```\n\n`record(...)`\n\nis synchronous and never blocks — it touches only an in-memory buffer. Ambient run / engine / span context propagates through `contextvars`\n\n, so nested scopes attribute events correctly with no plumbing.\n\n``` python\nfrom dprovenancekit import DProvenanceKit, InMemoryTraceStore\n\nkit = DProvenanceKit(MyAIDecision)\nstore = InMemoryTraceStore()\n\nwith kit.run(context_id=\"demo_case\", store=store):\n    kit.record(MyAIDecision(kind=\"documentEvaluated\", document_id=\"DocA\", score=0.95))\n    kit.record(MyAIDecision(kind=\"conflictDetected\", reason=\"timeline_inconsistency\"))\n    kit.record(MyAIDecision(kind=\"finalDecisionMade\", approved=False))\npython\nfrom dprovenancekit import TraceQueryDSL\n\nsuspicious = store.query_runs(\n    TraceQueryDSL()\n        .requiring_step(\"conflictDetected\")\n        .missing_step(\"documentEvaluated\")\n)\n```\n\nFind runs where a conflict was reported but no document was ever evaluated. The same DSL compiles to SQL for `SQLiteTraceStore`\n\nand is evaluated in memory for `InMemoryTraceStore`\n\n— the two backends are held in lockstep by a parity test suite.\n\n``` python\nfrom dprovenancekit import TraceDiffEngine\n\ndiff = TraceDiffEngine().diff(base=run_a, comparison=run_b)\nprint(diff.changes)   # structural steps that appeared, disappeared, or moved\n```\n\n`TraceAlignmentEngine`\n\ndecides whether two executions are behaviorally equivalent within a formally-defined semantic model, even when payloads vary slightly:\n\n``` python\nfrom dprovenancekit import (\n    AlignmentConfiguration, AlignmentProfile, AnyEquivalenceEvaluator, TraceAlignmentEngine,\n)\n\nconfig = AlignmentConfiguration(\n    profile=AlignmentProfile.strict_audit_v1,\n    equivalence_evaluator=AnyEquivalenceEvaluator(\n        evaluator_identifier=\"MyAIDecision_Semantic\",\n        evaluator=lambda a, b: 1.0 if a == b else 0.0,\n    ),\n)\nresult = TraceAlignmentEngine(config).align(base=run_a, comparison=run_b)\nprint(result.regression_risk.level)\npython\nfrom dprovenancekit import AnomalyDetector, AnomalyRule, TraceQueryDSL\n\nclass UnverifiedConflictRule(AnomalyRule):\n    @property\n    def name(self): return \"unverified_conflict\"\n    @property\n    def anomaly_query(self):\n        return TraceQueryDSL().requiring_step(\"conflictDetected\").missing_step(\"documentEvaluated\")\n    def describe(self, run): return \"Conflict detected with no supporting evaluation\"\n\nanomalies = AnomalyDetector(store).detect_anomalies([UnverifiedConflictRule()])\n```\n\nOr drop in ready-made rules from the built-in library instead of writing your own:\n\n``` python\nfrom dprovenancekit import AnomalyDetector, LoopingRule, ToolDropRule\n\nanomalies = AnomalyDetector(store).detect_anomalies([\n    ToolDropRule(\"safety_check\"),              # never ran a required step\n    LoopingRule(\"web_search\", max_repeats=5),  # stuck repeating the same tool call\n])\n```\n\nRun the regression gate in CI with no server — point it at a local SQLite trace database\nand a golden/candidate run id. Exit code is `0`\n\n(pass), `1`\n\n(regression), or `2`\n\n(usage error):\n\n```\ndprovenancekit gate --db traces.sqlite --golden \"$GOLDEN_RUN_ID\" --candidate \"$CANDIDATE_RUN_ID\"\ndprovenancekit gate --db traces.sqlite --golden \"$G\" --candidate \"$C\" --max-level low --json\n\n# Gate across separate databases (a restored baseline vs. this PR's run), resolving\n# the golden run id from the baseline instead of hardcoding it:\nGOLDEN=$(dprovenancekit runs --db baseline.sqlite --context my-agent --latest --format id)\ndprovenancekit gate --golden-db baseline.sqlite --golden \"$GOLDEN\" \\\n                    --candidate-db candidate.sqlite --candidate \"$CANDIDATE_RUN_ID\"\n```\n\nPrebuilt CI integrations wrap this and comment the diff on the PR/MR:\na [GitHub Action](/Therealdk8890/DProvenanceKitPython/blob/main/action/README.md) and a [GitLab CI template](/Therealdk8890/DProvenanceKitPython/blob/main/gitlab/README.md).\n\nThe library ships the same validation corpus as the Swift version. The headless CLI runs it through the real benchmark runner:\n\n```\ndprovenancekit evaluate     # precision/recall/F1 over the standard + adversarial corpora\ndprovenancekit diagnose     # causal ranking of failure modes\ndprovenancekit stability    # determinism boundary: isolated vs perturbed F1 variance\n```\n\nBoth corpora score **Precision 1.000 / Recall 1.000 / F1 1.000** — 8 standard scenarios (reordering, semantic evolution, noise injection, branch collapse, …) and 5 adversarial robustness traps (dependency inversion, partial truncation, semantic substitution, …) — matching the Swift implementation case-for-case.\n\n| Component | Module |\n|---|---|\n| Event model, priority tiers, drop accounting | `event` , `priority` , `drop_stats` |\n| Recording API + ambient context | `kit` , `context` |\n| Stores (in-memory, WAL SQLite, raw read, cloud) | `store` , `sqlite_store` , `raw_store` , `cloud_store` |\n| Priority-aware write buffer | `write_buffer` |\n| Query DSL + two backends (AST eval + SQL compiler) | `query` |\n| Live querying + anomaly detection + rule library | `live_engine` , `anomaly` , `rules` |\n| Structural diff + span-aware snapshot diff | `diff` , `snapshot_diff` |\n| Deterministic replay | `replay` |\n| Semantic alignment engine + evidence + verification | `alignment_*` , `verification` |\n| Benchmark harness, failure diagnoser, corpus | `benchmark` , `corpus` |\n| Pure view models for a trace viewer | `viewmodel` |\n| Framework adapters (LangChain / LangGraph) | `integrations.langchain` |\n| Framework adapters (OpenAI Agents SDK) | `integrations.openai_agents` |\n| Regression-gate test helper | `testing` |\n| Shareable HTML regression report | `report` |\n| Framework-agnostic instrumentation (decorators) | `instrument` |\nHeadless CLI — `gate` , `anomalies` , `runs` , `evaluate` |\n`cli` |\n\nThe SwiftUI `DProvenanceUI`\n\ntarget is intentionally **not** ported (it is Apple-platform UI); its pure value-model layer (`SpanViewModel`\n\n, flattening) is ported in `viewmodel`\n\n.\n\nKeeping the Swift and Python SDKs behaviorally equivalent is enforced, not hoped for. [ conformance/](/Therealdk8890/DProvenanceKitPython/blob/main/conformance) holds\n\n**Trace Specification v1**— a language-neutral contract plus frozen golden vectors that pin the run fingerprint, the alignment profile hash, canonical payload encoding, query semantics, and alignment verdicts.\n\n```\npython -m pytest tests/test_conformance.py   # the Python SDK's claim of conformance\npython conformance/generate_vectors.py        # intentionally re-freeze the contract\n```\n\nThe committed `conformance/vectors/*.json`\n\nare the contract: any SDK — Swift today, Rust or TypeScript later — proves equivalence by reproducing the same files. See [ conformance/TRACE_SPEC_v1.md](/Therealdk8890/DProvenanceKitPython/blob/main/conformance/TRACE_SPEC_v1.md).\n\nFramework adapters live in `dprovenancekit.integrations`\n\nand are the only parts of the package with third-party dependencies — the core stays pure standard library, and nothing imports an adapter unless you do.\n\n```\npip install dprovenancekit[langchain]\npython\nfrom dprovenancekit import SQLiteTraceStore\nfrom dprovenancekit.integrations.langchain import DProvenanceTracer, LangChainTraceEvent\n\nstore = SQLiteTraceStore(LangChainTraceEvent, \"traces.sqlite\")\ntracer = DProvenanceTracer(store)\n\nwith tracer.trace(context_id=\"customer-42\") as cb:\n    answer = chain.invoke(question, config={\"callbacks\": [cb]})\n\n# The run is now recorded — query it, diff it against a known-good run, or\n# compare run fingerprints to detect when the agent took a different path.\n```\n\n[ DProvenanceCallbackHandler](/Therealdk8890/DProvenanceKitPython/blob/main/src/dprovenancekit/integrations/langchain.py) translates LangChain's callback stream into a trace: each\n\n`on_llm_start`\n\n/ `on_tool_start`\n\n/ `on_retriever_start`\n\n/ `on_chain_start`\n\n(and its completion) becomes a typed event in execution order, LangChain's `run_id`\n\n/`parent_run_id`\n\nbecome the trace's **span tree**, the active model/tool/retriever becomes the\n\n**engine**, and (by default) lifecycle\n\n**provenance edges** are emitted (\n\n`DERIVED_FROM`\n\nstart→completion, `INFORMED`\n\nparent→child). Because events flow through the same recording path as hand-written ones, the whole toolkit applies: a run's **fingerprint** is the structural identity of the agent's execution path, so two runs that diverge (a tool called in a different order, a retrieval step skipped) produce different fingerprints — a cheap regression signal. Options:\n\n`capture_payloads`\n\n(prompt/completion/IO previews), `link_lifecycle`\n\n(edges), `record_chains`\n\n(LCEL/LangGraph chain noise).\n\n```\npip install dprovenancekit[openai-agents]\npython\nfrom dprovenancekit import SQLiteTraceStore\nfrom dprovenancekit.integrations.openai_agents import register, OpenAIAgentsTraceEvent\n\nstore = SQLiteTraceStore(OpenAIAgentsTraceEvent, \"traces.sqlite\")\nregister(store)   # registers a global tracing processor\n\n# ... run your agents normally; each run is recorded ...\n```\n\n[ DProvenanceTracingProcessor](/Therealdk8890/DProvenanceKitPython/blob/main/src/dprovenancekit/integrations/openai_agents.py) implements the SDK's\n\n`TracingProcessor`\n\n: each agent run becomes a trace-run (`context_id`\n\n= the trace name), and every span start/end becomes a typed event — `agent.start`\n\n, `generation.end`\n\n, `function.start`\n\n, `guardrail.error`\n\n, … — in execution order. The span's `span_id`\n\n/`parent_id`\n\nbecome the **span tree**, the active agent/tool/model becomes the\n\n**engine**, errors and triggered guardrails are recorded at\n\n`CRITICAL`\n\n, and lifecycle **provenance edges** are emitted (same\n\n`DERIVED_FROM`\n\n/`INFORMED`\n\nmodel). One registered processor captures every run; the same `fingerprint`\n\n/diff/align tooling then applies.`dprovenancekit.testing`\n\nturns \"did my agent regress?\" into one assertion you can drop into any test or CI step. Give it a *golden* run (known-good) and a *candidate* run (what your current code produced); it aligns them and fails with a readable diagnostic if the candidate diverged.\n\n``` python\nfrom dprovenancekit.testing import assert_no_regression\n\nassert_no_regression(golden=golden_run, candidate=candidate_run)\n```\n\nStrict by default — any removed, added, or changed (ambiguous) step fails, and a removed *or reordered* CRITICAL step is additionally a HIGH-severity regression (reordering a critical step can invert a dependency). Loosen with `max_regression_level`\n\n(gate only on severity) or `allow_divergent_steps`\n\n(tolerate benign per-step changes), or pass a custom `evaluator`\n\nto define what \"equivalent\" means (e.g. ignore volatile fields like token counts). `RegressionGate(...).check(...)`\n\nreturns a `RegressionReport`\n\n(no raise) for richer assertions. Detecting *reordered* steps requires a span-aware profile (`AlignmentProfile.developer_debug_v1`\n\n); the default linear profile treats a pure reorder as still-matching. Complements `AlignmentSnapshotValidator`\n\n(an exact output-hash snapshot): the gate works on two runs and reasons about regression severity.\n\n[ examples/regression_testing.py](/Therealdk8890/DProvenanceKitPython/blob/main/examples/regression_testing.py) is the end-to-end story in ~150 readable lines: record a\n\n**golden** run of a fact-checking agent (retrieve → verify → decide), then catch a later run that skips its verification step — via both the fast\n\n**fingerprint** check and the detailed\n\n**alignment** verdict (which flags the dropped\n\n`claimVerified`\n\nstep as a HIGH regression).\n\n```\npython examples/regression_testing.py\n```\n\nIt self-asserts its verdicts, so it doubles as an executable test of the headline use case.\n\nNot using a framework? Instrument a hand-written agent loop directly — no event type to define, zero dependencies (ships in core as `dprovenancekit.instrument`\n\n):\n\n``` python\nfrom dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event\n\n@traced\ndef search(query): ...\n\n@traced\ndef answer(question, sources): ...\n\nstore = InMemoryTraceStore()\nwith traced_run(store, context_id=\"ticket-42\"):\n    sources = search(question)\n    record_event(\"plan.chosen\", {\"strategy\": \"rag\"})\n    reply = answer(question, sources)\n```\n\n`@traced`\n\nrecords a `\"<name>.start\"`\n\n/ `\".end\"`\n\n/ `\".error\"`\n\nevent pair per call in its own **span** (the function name is the **engine**), nests calls in the span tree, and emits the same `DERIVED_FROM`\n\n/ `INFORMED`\n\nprovenance edges as the framework adapters. `record_event(...)`\n\ndrops an ad-hoc event (a decision, a chosen branch). Plain functions, `async def`\n\n, generators, and async generators are all supported (for a generator, start/end bracket the full iteration). Instrumentation never changes behavior — capture is failure-proof and exceptions pass through unchanged. Outside a `traced_run`\n\nthe decorators are transparent, so instrumented code is safe to call untraced. The trace it produces is identical in shape to the adapter-produced ones, so fingerprint / diff / align / the regression gate all apply.\n\n[ examples/regression_testing.py](/Therealdk8890/DProvenanceKitPython/blob/main/examples/regression_testing.py) is the end-to-end story in ~150 readable lines: record a\n\n**golden** run of a fact-checking agent, then catch a later run that skips its verification step — via both the fast\n\n**fingerprint** check and the detailed\n\n**alignment** verdict (which flags the dropped\n\n`claimVerified`\n\nstep as a HIGH regression).\n\n```\npython examples/regression_testing.py\n```\n\nIt self-asserts its verdicts, so it doubles as an executable test of the headline use case.\n\n```\npython -m pytest\n```\n\n168 tests: 80 ported from the Swift suite (query parity, write-buffer backpressure, SQLite stress + drop accounting, alignment, replay, snapshot diff, explainability fidelity, benchmark scoring, cloud chaos, …), 28 cross-language conformance checks against the frozen Trace Specification v1 vectors, 14 LangChain integration tests, 16 OpenAI Agents SDK integration tests, 16 instrumentation-layer tests, 13 regression-gate tests, and the regression-testing example run as a self-asserting test. (The real-framework tests run only when `langchain-core`\n\n/ `openai-agents`\n\nare installed, otherwise skipped.)\n\nDistributed under the **Apache License 2.0**. See [LICENSE](/Therealdk8890/DProvenanceKitPython/blob/main/LICENSE).", "url": "https://wpnews.pro/news/dprovenancekit-execution-provenance-for-ai-systems", "canonical_source": "https://github.com/Therealdk8890/DProvenanceKitPython", "published_at": "2026-07-01 06:35:10+00:00", "updated_at": "2026-07-01 06:49:41.013567+00:00", "lang": "en", "topics": ["ai-tools", "ai-agents", "developer-tools", "ai-safety", "machine-learning"], "entities": ["DProvenanceKit", "GitHub", "PyPI", "LangChain", "OpenAI Agents", "GitLab", "Swift", "Python"], "alternates": {"html": "https://wpnews.pro/news/dprovenancekit-execution-provenance-for-ai-systems", "markdown": "https://wpnews.pro/news/dprovenancekit-execution-provenance-for-ai-systems.md", "text": "https://wpnews.pro/news/dprovenancekit-execution-provenance-for-ai-systems.txt", "jsonld": "https://wpnews.pro/news/dprovenancekit-execution-provenance-for-ai-systems.jsonld"}}