cd /news/ai-tools/dprovenancekit-execution-provenance-… Β· home β€Ί topics β€Ί ai-tools β€Ί article
[ARTICLE Β· art-46102] src=github.com β†— pub= topic=ai-tools verified=true sentiment=↑ positive

DProvenanceKit: Execution Provenance for AI Systems

DProvenanceKit, a Python port of the Swift library for execution provenance in AI systems, has been released. It provides reasoning observability and regression testing by turning each execution into a queryable, diffable trace, enabling detection of reasoning drift between runs. The tool ships with CI gating, anomaly detection rules, and a hosted visualizer, requiring zero third-party dependencies.

read11 min views1 publishedJul 1, 2026
DProvenanceKit: Execution Provenance for AI Systems
Image: source

Reasoning observability and regression testing for AI systems β€” a Python port of the Swift DProvenanceKit.

When an agent's reasoning drifts between runs, DProvenanceKit turns each execution into a queryable, diffable trace so you can see what changed and why β€” not just what happened.

Run β†’ Record β†’ Query β†’ Diff β†’ Detect regressions β†’ Gate in CI

It's not just the library β€” it ships the surfaces that make reasoning regressions actionable:

Gate in CIβ€” a server-lessdprovenancekit gate

CLI, plus a drop-inGitHub ActionandGitLab CI templatethat fail a PR/MR when an agent's reasoning drifts from a golden baseline, and comment the diff.Out-of-the-box anomaly rulesβ€” Tool Drop and Looping detection with a JSON rule registry, runnable locally or on every PR.** A hosted visualizer**β€” a web dashboard (single-run span tree, JSON payload inspector, side-by-side semantic diff, shareable HTML reports) backed by a regression-gate API and multi-tenant control plane. Available as a separate commercial service.

See it all in one runnable script: python examples/end_to_end_demo.py.

This is a faithful, dependency-free port of the Swift library to Python. It keeps the same architecture and guarantees β€” synchronous non-blocking recording, priority-aware backpressure, one query language over two backends held at parity, structural diffing, formally-modeled semantic alignment, and by-tier drop accounting so load-shedding is never silent.

The original Swift package is unchanged; this is a parallel implementation.

The Swift library targets Apple-platform and on-device AI. This port brings the same reasoning-layer observability to Python codebases β€” agent frameworks, LLM workflows, tool-using models β€” with zero third-party dependencies (it uses only the standard library: sqlite3

, contextvars

, threading

, json

, hashlib

, uuid

, urllib

).

From PyPI (released builds):

pip install dprovenancekit
pip install "dprovenancekit[langchain]"        # + LangChain adapter
pip install "dprovenancekit[openai-agents]"    # + OpenAI Agents adapter

From a checkout (development):

pip install -e ".[dev]"

Requires Python 3.9+; the core has zero third-party dependencies. Releasing is documented in RELEASING.md.

Want the whole arc in one runnable script β€” record β†’ query β†’ gate β†’ detect anomalies β†’ diff β†’ report, then hand the same runs to the CLI? Run

[. The steps below build it up piece by piece.]python examples/end_to_end_demo.py

Any frozen dataclass that subclasses TraceableEvent

, exposing a stable type_identifier

and a priority

:

from dataclasses import dataclass
from dprovenancekit import TraceableEvent, TracePriority

@dataclass(frozen=True)
class MyAIDecision(TraceableEvent):
    kind: str           # "promptGenerated" | "documentEvaluated" | "conflictDetected" | "finalDecisionMade"
    token_count: int = 0
    document_id: str = ""
    score: float = 0.0
    reason: str = ""
    approved: bool = False

    @property
    def type_identifier(self) -> str:
        return self.kind

    @property
    def priority(self) -> TracePriority:
        if self.kind == "finalDecisionMade":
            return TracePriority.CRITICAL
        if self.kind == "conflictDetected":
            return TracePriority.DIAGNOSTIC
        return TracePriority.TELEMETRY

record(...)

is synchronous and never blocks β€” it touches only an in-memory buffer. Ambient run / engine / span context propagates through contextvars

, so nested scopes attribute events correctly with no plumbing.

from dprovenancekit import DProvenanceKit, InMemoryTraceStore

kit = DProvenanceKit(MyAIDecision)
store = InMemoryTraceStore()

with kit.run(context_id="demo_case", store=store):
    kit.record(MyAIDecision(kind="documentEvaluated", document_id="DocA", score=0.95))
    kit.record(MyAIDecision(kind="conflictDetected", reason="timeline_inconsistency"))
    kit.record(MyAIDecision(kind="finalDecisionMade", approved=False))
python
from dprovenancekit import TraceQueryDSL

suspicious = store.query_runs(
    TraceQueryDSL()
        .requiring_step("conflictDetected")
        .missing_step("documentEvaluated")
)

Find runs where a conflict was reported but no document was ever evaluated. The same DSL compiles to SQL for SQLiteTraceStore

and is evaluated in memory for InMemoryTraceStore

β€” the two backends are held in lockstep by a parity test suite.

from dprovenancekit import TraceDiffEngine

diff = TraceDiffEngine().diff(base=run_a, comparison=run_b)
print(diff.changes)   # structural steps that appeared, disappeared, or moved

TraceAlignmentEngine

decides whether two executions are behaviorally equivalent within a formally-defined semantic model, even when payloads vary slightly:

from dprovenancekit import (
    AlignmentConfiguration, AlignmentProfile, AnyEquivalenceEvaluator, TraceAlignmentEngine,
)

config = AlignmentConfiguration(
    profile=AlignmentProfile.strict_audit_v1,
    equivalence_evaluator=AnyEquivalenceEvaluator(
        evaluator_identifier="MyAIDecision_Semantic",
        evaluator=lambda a, b: 1.0 if a == b else 0.0,
    ),
)
result = TraceAlignmentEngine(config).align(base=run_a, comparison=run_b)
print(result.regression_risk.level)
python
from dprovenancekit import AnomalyDetector, AnomalyRule, TraceQueryDSL

class UnverifiedConflictRule(AnomalyRule):
    @property
    def name(self): return "unverified_conflict"
    @property
    def anomaly_query(self):
        return TraceQueryDSL().requiring_step("conflictDetected").missing_step("documentEvaluated")
    def describe(self, run): return "Conflict detected with no supporting evaluation"

anomalies = AnomalyDetector(store).detect_anomalies([UnverifiedConflictRule()])

Or drop in ready-made rules from the built-in library instead of writing your own:

from dprovenancekit import AnomalyDetector, LoopingRule, ToolDropRule

anomalies = AnomalyDetector(store).detect_anomalies([
    ToolDropRule("safety_check"),              # never ran a required step
    LoopingRule("web_search", max_repeats=5),  # stuck repeating the same tool call
])

Run the regression gate in CI with no server β€” point it at a local SQLite trace database and a golden/candidate run id. Exit code is 0

(pass), 1

(regression), or 2

(usage error):

dprovenancekit gate --db traces.sqlite --golden "$GOLDEN_RUN_ID" --candidate "$CANDIDATE_RUN_ID"
dprovenancekit gate --db traces.sqlite --golden "$G" --candidate "$C" --max-level low --json

GOLDEN=$(dprovenancekit runs --db baseline.sqlite --context my-agent --latest --format id)
dprovenancekit gate --golden-db baseline.sqlite --golden "$GOLDEN" \
                    --candidate-db candidate.sqlite --candidate "$CANDIDATE_RUN_ID"

Prebuilt CI integrations wrap this and comment the diff on the PR/MR: a GitHub Action and a GitLab CI template.

The library ships the same validation corpus as the Swift version. The headless CLI runs it through the real benchmark runner:

dprovenancekit evaluate     # precision/recall/F1 over the standard + adversarial corpora
dprovenancekit diagnose     # causal ranking of failure modes
dprovenancekit stability    # determinism boundary: isolated vs perturbed F1 variance

Both corpora score Precision 1.000 / Recall 1.000 / F1 1.000 β€” 8 standard scenarios (reordering, semantic evolution, noise injection, branch collapse, …) and 5 adversarial robustness traps (dependency inversion, partial truncation, semantic substitution, …) β€” matching the Swift implementation case-for-case.

Component Module
Event model, priority tiers, drop accounting event , priority , drop_stats
Recording API + ambient context kit , context
Stores (in-memory, WAL SQLite, raw read, cloud) store , sqlite_store , raw_store , cloud_store
Priority-aware write buffer write_buffer
Query DSL + two backends (AST eval + SQL compiler) query
Live querying + anomaly detection + rule library live_engine , anomaly , rules
Structural diff + span-aware snapshot diff diff , snapshot_diff
Deterministic replay replay
Semantic alignment engine + evidence + verification alignment_* , verification
Benchmark harness, failure diagnoser, corpus benchmark , corpus
Pure view models for a trace viewer viewmodel
Framework adapters (LangChain / LangGraph) integrations.langchain
Framework adapters (OpenAI Agents SDK) integrations.openai_agents
Regression-gate test helper testing
Shareable HTML regression report report
Framework-agnostic instrumentation (decorators) instrument
Headless CLI β€” gate , anomalies , runs , evaluate
cli

The SwiftUI DProvenanceUI

target is intentionally not ported (it is Apple-platform UI); its pure value-model layer (SpanViewModel

, flattening) is ported in viewmodel

.

Keeping the Swift and Python SDKs behaviorally equivalent is enforced, not hoped for. conformance/ holds

Trace Specification v1β€” a language-neutral contract plus frozen golden vectors that pin the run fingerprint, the alignment profile hash, canonical payload encoding, query semantics, and alignment verdicts.

python -m pytest tests/test_conformance.py   # the Python SDK's claim of conformance
python conformance/generate_vectors.py        # intentionally re-freeze the contract

The committed conformance/vectors/*.json

are the contract: any SDK β€” Swift today, Rust or TypeScript later β€” proves equivalence by reproducing the same files. See conformance/TRACE_SPEC_v1.md.

Framework adapters live in dprovenancekit.integrations

and are the only parts of the package with third-party dependencies β€” the core stays pure standard library, and nothing imports an adapter unless you do.

pip install dprovenancekit[langchain]
python
from dprovenancekit import SQLiteTraceStore
from dprovenancekit.integrations.langchain import DProvenanceTracer, LangChainTraceEvent

store = SQLiteTraceStore(LangChainTraceEvent, "traces.sqlite")
tracer = DProvenanceTracer(store)

with tracer.trace(context_id="customer-42") as cb:
    answer = chain.invoke(question, config={"callbacks": [cb]})

DProvenanceCallbackHandler translates LangChain's callback stream into a trace: each

on_llm_start

/ on_tool_start

/ on_retriever_start

/ on_chain_start

(and its completion) becomes a typed event in execution order, LangChain's run_id

/parent_run_id

become the trace's span tree, the active model/tool/retriever becomes the

engine, and (by default) lifecycle

provenance edges are emitted (

DERIVED_FROM

start→completion, INFORMED

parent→child). Because events flow through the same recording path as hand-written ones, the whole toolkit applies: a run's fingerprint is the structural identity of the agent's execution path, so two runs that diverge (a tool called in a different order, a retrieval step skipped) produce different fingerprints — a cheap regression signal. Options:

capture_payloads

(prompt/completion/IO previews), link_lifecycle

(edges), record_chains

(LCEL/LangGraph chain noise).

pip install dprovenancekit[openai-agents]
python
from dprovenancekit import SQLiteTraceStore
from dprovenancekit.integrations.openai_agents import register, OpenAIAgentsTraceEvent

store = SQLiteTraceStore(OpenAIAgentsTraceEvent, "traces.sqlite")
register(store)   # registers a global tracing processor

DProvenanceTracingProcessor implements the SDK's

TracingProcessor

: each agent run becomes a trace-run (context_id

= the trace name), and every span start/end becomes a typed event β€” agent.start

, generation.end

, function.start

, guardrail.error

, … β€” in execution order. The span's span_id

/parent_id

become the span tree, the active agent/tool/model becomes the

engine, errors and triggered guardrails are recorded at

CRITICAL

, and lifecycle provenance edges are emitted (same

DERIVED_FROM

/INFORMED

model). One registered processor captures every run; the same fingerprint

/diff/align tooling then applies.dprovenancekit.testing

turns "did my agent regress?" into one assertion you can drop into any test or CI step. Give it a golden run (known-good) and a candidate run (what your current code produced); it aligns them and fails with a readable diagnostic if the candidate diverged.

from dprovenancekit.testing import assert_no_regression

assert_no_regression(golden=golden_run, candidate=candidate_run)

Strict by default β€” any removed, added, or changed (ambiguous) step fails, and a removed or reordered CRITICAL step is additionally a HIGH-severity regression (reordering a critical step can invert a dependency). Loosen with max_regression_level

(gate only on severity) or allow_divergent_steps

(tolerate benign per-step changes), or pass a custom evaluator

to define what "equivalent" means (e.g. ignore volatile fields like token counts). RegressionGate(...).check(...)

returns a RegressionReport

(no raise) for richer assertions. Detecting reordered steps requires a span-aware profile (AlignmentProfile.developer_debug_v1

); the default linear profile treats a pure reorder as still-matching. Complements AlignmentSnapshotValidator

(an exact output-hash snapshot): the gate works on two runs and reasons about regression severity.

examples/regression_testing.py is the end-to-end story in ~150 readable lines: record a

golden run of a fact-checking agent (retrieve β†’ verify β†’ decide), then catch a later run that skips its verification step β€” via both the fast

fingerprint check and the detailed

alignment verdict (which flags the dropped

claimVerified

step as a HIGH regression).

python examples/regression_testing.py

It self-asserts its verdicts, so it doubles as an executable test of the headline use case.

Not using a framework? Instrument a hand-written agent loop directly β€” no event type to define, zero dependencies (ships in core as dprovenancekit.instrument

):

from dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event

@traced
def search(query): ...

@traced
def answer(question, sources): ...

store = InMemoryTraceStore()
with traced_run(store, context_id="ticket-42"):
    sources = search(question)
    record_event("plan.chosen", {"strategy": "rag"})
    reply = answer(question, sources)

@traced

records a "<name>.start"

/ ".end"

/ ".error"

event pair per call in its own span (the function name is the engine), nests calls in the span tree, and emits the same DERIVED_FROM

/ INFORMED

provenance edges as the framework adapters. record_event(...)

drops an ad-hoc event (a decision, a chosen branch). Plain functions, async def

, generators, and async generators are all supported (for a generator, start/end bracket the full iteration). Instrumentation never changes behavior β€” capture is failure-proof and exceptions pass through unchanged. Outside a traced_run

the decorators are transparent, so instrumented code is safe to call untraced. The trace it produces is identical in shape to the adapter-produced ones, so fingerprint / diff / align / the regression gate all apply.

examples/regression_testing.py is the end-to-end story in ~150 readable lines: record a

golden run of a fact-checking agent, then catch a later run that skips its verification step β€” via both the fast

fingerprint check and the detailed

alignment verdict (which flags the dropped

claimVerified

step as a HIGH regression).

python examples/regression_testing.py

It self-asserts its verdicts, so it doubles as an executable test of the headline use case.

python -m pytest

168 tests: 80 ported from the Swift suite (query parity, write-buffer backpressure, SQLite stress + drop accounting, alignment, replay, snapshot diff, explainability fidelity, benchmark scoring, cloud chaos, …), 28 cross-language conformance checks against the frozen Trace Specification v1 vectors, 14 LangChain integration tests, 16 OpenAI Agents SDK integration tests, 16 instrumentation-layer tests, 13 regression-gate tests, and the regression-testing example run as a self-asserting test. (The real-framework tests run only when langchain-core

/ openai-agents

are installed, otherwise skipped.)

Distributed under the Apache License 2.0. See LICENSE.

── more in #ai-tools 4 stories Β· sorted by recency
── more on @dprovenancekit 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/dprovenancekit-execu…] indexed:0 read:11min 2026-07-01 Β· β€”