# DProvenanceKit: Execution Provenance for AI Systems

> Source: <https://github.com/Therealdk8890/DProvenanceKitPython>
> Published: 2026-07-01 06:35:10+00:00

[
](https://github.com/Therealdk8890/DProvenanceKitPython/actions/workflows/ci.yml)[
](https://pypi.org/project/dprovenancekit/)**Reasoning observability and regression testing for AI systems — a Python port of the Swift DProvenanceKit.**

When an agent's reasoning drifts between runs, DProvenanceKit turns each execution into a queryable, diffable trace so you can see *what changed and why* — not just *what happened*.

Run → Record → Query → Diff → Detect regressions → Gate in CI

**It's not just the library** — it ships the surfaces that make reasoning regressions actionable:

**Gate in CI**— a server-less`dprovenancekit gate`

CLI, plus a drop-in[GitHub Action](/Therealdk8890/DProvenanceKitPython/blob/main/action/README.md)and[GitLab CI template](/Therealdk8890/DProvenanceKitPython/blob/main/gitlab/README.md)that fail a PR/MR when an agent's reasoning drifts from a golden baseline, and comment the diff.**Out-of-the-box anomaly rules**— Tool Drop and Looping detection with a JSON rule registry, runnable locally or on every PR.** A hosted visualizer**— a web dashboard (single-run span tree, JSON payload inspector, side-by-side semantic diff, shareable HTML reports) backed by a regression-gate API and multi-tenant control plane. Available as a separate commercial service.

See it all in one runnable script: [ python examples/end_to_end_demo.py](/Therealdk8890/DProvenanceKitPython/blob/main/examples/end_to_end_demo.py).

This is a faithful, dependency-free port of the Swift library to Python. It keeps the same architecture and guarantees — synchronous non-blocking recording, priority-aware backpressure, one query language over two backends held at parity, structural diffing, formally-modeled semantic alignment, and by-tier drop accounting so load-shedding is never silent.

The original Swift package is unchanged; this is a parallel implementation.

The Swift library targets Apple-platform and on-device AI. This port brings the same reasoning-layer observability to Python codebases — agent frameworks, LLM workflows, tool-using models — with **zero third-party dependencies** (it uses only the standard library: `sqlite3`

, `contextvars`

, `threading`

, `json`

, `hashlib`

, `uuid`

, `urllib`

).

From PyPI (released builds):

```
pip install dprovenancekit
pip install "dprovenancekit[langchain]"        # + LangChain adapter
pip install "dprovenancekit[openai-agents]"    # + OpenAI Agents adapter
```

From a checkout (development):

```
pip install -e ".[dev]"
```

Requires Python 3.9+; the core has **zero third-party dependencies**. Releasing is documented
in [RELEASING.md](/Therealdk8890/DProvenanceKitPython/blob/main/RELEASING.md).

Want the whole arc in one runnable script — record → query → gate → detect anomalies → diff → report, then hand the same runs to the CLI? Run

[. The steps below build it up piece by piece.]`python examples/end_to_end_demo.py`

Any frozen dataclass that subclasses `TraceableEvent`

, exposing a stable `type_identifier`

and a `priority`

:

``` python
from dataclasses import dataclass
from dprovenancekit import TraceableEvent, TracePriority

@dataclass(frozen=True)
class MyAIDecision(TraceableEvent):
    kind: str           # "promptGenerated" | "documentEvaluated" | "conflictDetected" | "finalDecisionMade"
    token_count: int = 0
    document_id: str = ""
    score: float = 0.0
    reason: str = ""
    approved: bool = False

    @property
    def type_identifier(self) -> str:
        return self.kind

    @property
    def priority(self) -> TracePriority:
        if self.kind == "finalDecisionMade":
            return TracePriority.CRITICAL
        if self.kind == "conflictDetected":
            return TracePriority.DIAGNOSTIC
        return TracePriority.TELEMETRY
```

`record(...)`

is synchronous and never blocks — it touches only an in-memory buffer. Ambient run / engine / span context propagates through `contextvars`

, so nested scopes attribute events correctly with no plumbing.

``` python
from dprovenancekit import DProvenanceKit, InMemoryTraceStore

kit = DProvenanceKit(MyAIDecision)
store = InMemoryTraceStore()

with kit.run(context_id="demo_case", store=store):
    kit.record(MyAIDecision(kind="documentEvaluated", document_id="DocA", score=0.95))
    kit.record(MyAIDecision(kind="conflictDetected", reason="timeline_inconsistency"))
    kit.record(MyAIDecision(kind="finalDecisionMade", approved=False))
python
from dprovenancekit import TraceQueryDSL

suspicious = store.query_runs(
    TraceQueryDSL()
        .requiring_step("conflictDetected")
        .missing_step("documentEvaluated")
)
```

Find runs where a conflict was reported but no document was ever evaluated. The same DSL compiles to SQL for `SQLiteTraceStore`

and is evaluated in memory for `InMemoryTraceStore`

— the two backends are held in lockstep by a parity test suite.

``` python
from dprovenancekit import TraceDiffEngine

diff = TraceDiffEngine().diff(base=run_a, comparison=run_b)
print(diff.changes)   # structural steps that appeared, disappeared, or moved
```

`TraceAlignmentEngine`

decides whether two executions are behaviorally equivalent within a formally-defined semantic model, even when payloads vary slightly:

``` python
from dprovenancekit import (
    AlignmentConfiguration, AlignmentProfile, AnyEquivalenceEvaluator, TraceAlignmentEngine,
)

config = AlignmentConfiguration(
    profile=AlignmentProfile.strict_audit_v1,
    equivalence_evaluator=AnyEquivalenceEvaluator(
        evaluator_identifier="MyAIDecision_Semantic",
        evaluator=lambda a, b: 1.0 if a == b else 0.0,
    ),
)
result = TraceAlignmentEngine(config).align(base=run_a, comparison=run_b)
print(result.regression_risk.level)
python
from dprovenancekit import AnomalyDetector, AnomalyRule, TraceQueryDSL

class UnverifiedConflictRule(AnomalyRule):
    @property
    def name(self): return "unverified_conflict"
    @property
    def anomaly_query(self):
        return TraceQueryDSL().requiring_step("conflictDetected").missing_step("documentEvaluated")
    def describe(self, run): return "Conflict detected with no supporting evaluation"

anomalies = AnomalyDetector(store).detect_anomalies([UnverifiedConflictRule()])
```

Or drop in ready-made rules from the built-in library instead of writing your own:

``` python
from dprovenancekit import AnomalyDetector, LoopingRule, ToolDropRule

anomalies = AnomalyDetector(store).detect_anomalies([
    ToolDropRule("safety_check"),              # never ran a required step
    LoopingRule("web_search", max_repeats=5),  # stuck repeating the same tool call
])
```

Run the regression gate in CI with no server — point it at a local SQLite trace database
and a golden/candidate run id. Exit code is `0`

(pass), `1`

(regression), or `2`

(usage error):

```
dprovenancekit gate --db traces.sqlite --golden "$GOLDEN_RUN_ID" --candidate "$CANDIDATE_RUN_ID"
dprovenancekit gate --db traces.sqlite --golden "$G" --candidate "$C" --max-level low --json

# Gate across separate databases (a restored baseline vs. this PR's run), resolving
# the golden run id from the baseline instead of hardcoding it:
GOLDEN=$(dprovenancekit runs --db baseline.sqlite --context my-agent --latest --format id)
dprovenancekit gate --golden-db baseline.sqlite --golden "$GOLDEN" \
                    --candidate-db candidate.sqlite --candidate "$CANDIDATE_RUN_ID"
```

Prebuilt CI integrations wrap this and comment the diff on the PR/MR:
a [GitHub Action](/Therealdk8890/DProvenanceKitPython/blob/main/action/README.md) and a [GitLab CI template](/Therealdk8890/DProvenanceKitPython/blob/main/gitlab/README.md).

The library ships the same validation corpus as the Swift version. The headless CLI runs it through the real benchmark runner:

```
dprovenancekit evaluate     # precision/recall/F1 over the standard + adversarial corpora
dprovenancekit diagnose     # causal ranking of failure modes
dprovenancekit stability    # determinism boundary: isolated vs perturbed F1 variance
```

Both corpora score **Precision 1.000 / Recall 1.000 / F1 1.000** — 8 standard scenarios (reordering, semantic evolution, noise injection, branch collapse, …) and 5 adversarial robustness traps (dependency inversion, partial truncation, semantic substitution, …) — matching the Swift implementation case-for-case.

| Component | Module |
|---|---|
| Event model, priority tiers, drop accounting | `event` , `priority` , `drop_stats` |
| Recording API + ambient context | `kit` , `context` |
| Stores (in-memory, WAL SQLite, raw read, cloud) | `store` , `sqlite_store` , `raw_store` , `cloud_store` |
| Priority-aware write buffer | `write_buffer` |
| Query DSL + two backends (AST eval + SQL compiler) | `query` |
| Live querying + anomaly detection + rule library | `live_engine` , `anomaly` , `rules` |
| Structural diff + span-aware snapshot diff | `diff` , `snapshot_diff` |
| Deterministic replay | `replay` |
| Semantic alignment engine + evidence + verification | `alignment_*` , `verification` |
| Benchmark harness, failure diagnoser, corpus | `benchmark` , `corpus` |
| Pure view models for a trace viewer | `viewmodel` |
| Framework adapters (LangChain / LangGraph) | `integrations.langchain` |
| Framework adapters (OpenAI Agents SDK) | `integrations.openai_agents` |
| Regression-gate test helper | `testing` |
| Shareable HTML regression report | `report` |
| Framework-agnostic instrumentation (decorators) | `instrument` |
Headless CLI — `gate` , `anomalies` , `runs` , `evaluate` |
`cli` |

The SwiftUI `DProvenanceUI`

target is intentionally **not** ported (it is Apple-platform UI); its pure value-model layer (`SpanViewModel`

, flattening) is ported in `viewmodel`

.

Keeping the Swift and Python SDKs behaviorally equivalent is enforced, not hoped for. [ conformance/](/Therealdk8890/DProvenanceKitPython/blob/main/conformance) holds

**Trace Specification v1**— a language-neutral contract plus frozen golden vectors that pin the run fingerprint, the alignment profile hash, canonical payload encoding, query semantics, and alignment verdicts.

```
python -m pytest tests/test_conformance.py   # the Python SDK's claim of conformance
python conformance/generate_vectors.py        # intentionally re-freeze the contract
```

The committed `conformance/vectors/*.json`

are the contract: any SDK — Swift today, Rust or TypeScript later — proves equivalence by reproducing the same files. See [ conformance/TRACE_SPEC_v1.md](/Therealdk8890/DProvenanceKitPython/blob/main/conformance/TRACE_SPEC_v1.md).

Framework adapters live in `dprovenancekit.integrations`

and are the only parts of the package with third-party dependencies — the core stays pure standard library, and nothing imports an adapter unless you do.

```
pip install dprovenancekit[langchain]
python
from dprovenancekit import SQLiteTraceStore
from dprovenancekit.integrations.langchain import DProvenanceTracer, LangChainTraceEvent

store = SQLiteTraceStore(LangChainTraceEvent, "traces.sqlite")
tracer = DProvenanceTracer(store)

with tracer.trace(context_id="customer-42") as cb:
    answer = chain.invoke(question, config={"callbacks": [cb]})

# The run is now recorded — query it, diff it against a known-good run, or
# compare run fingerprints to detect when the agent took a different path.
```

[ DProvenanceCallbackHandler](/Therealdk8890/DProvenanceKitPython/blob/main/src/dprovenancekit/integrations/langchain.py) translates LangChain's callback stream into a trace: each

`on_llm_start`

/ `on_tool_start`

/ `on_retriever_start`

/ `on_chain_start`

(and its completion) becomes a typed event in execution order, LangChain's `run_id`

/`parent_run_id`

become the trace's **span tree**, the active model/tool/retriever becomes the

**engine**, and (by default) lifecycle

**provenance edges** are emitted (

`DERIVED_FROM`

start→completion, `INFORMED`

parent→child). Because events flow through the same recording path as hand-written ones, the whole toolkit applies: a run's **fingerprint** is the structural identity of the agent's execution path, so two runs that diverge (a tool called in a different order, a retrieval step skipped) produce different fingerprints — a cheap regression signal. Options:

`capture_payloads`

(prompt/completion/IO previews), `link_lifecycle`

(edges), `record_chains`

(LCEL/LangGraph chain noise).

```
pip install dprovenancekit[openai-agents]
python
from dprovenancekit import SQLiteTraceStore
from dprovenancekit.integrations.openai_agents import register, OpenAIAgentsTraceEvent

store = SQLiteTraceStore(OpenAIAgentsTraceEvent, "traces.sqlite")
register(store)   # registers a global tracing processor

# ... run your agents normally; each run is recorded ...
```

[ DProvenanceTracingProcessor](/Therealdk8890/DProvenanceKitPython/blob/main/src/dprovenancekit/integrations/openai_agents.py) implements the SDK's

`TracingProcessor`

: each agent run becomes a trace-run (`context_id`

= the trace name), and every span start/end becomes a typed event — `agent.start`

, `generation.end`

, `function.start`

, `guardrail.error`

, … — in execution order. The span's `span_id`

/`parent_id`

become the **span tree**, the active agent/tool/model becomes the

**engine**, errors and triggered guardrails are recorded at

`CRITICAL`

, and lifecycle **provenance edges** are emitted (same

`DERIVED_FROM`

/`INFORMED`

model). One registered processor captures every run; the same `fingerprint`

/diff/align tooling then applies.`dprovenancekit.testing`

turns "did my agent regress?" into one assertion you can drop into any test or CI step. Give it a *golden* run (known-good) and a *candidate* run (what your current code produced); it aligns them and fails with a readable diagnostic if the candidate diverged.

``` python
from dprovenancekit.testing import assert_no_regression

assert_no_regression(golden=golden_run, candidate=candidate_run)
```

Strict by default — any removed, added, or changed (ambiguous) step fails, and a removed *or reordered* CRITICAL step is additionally a HIGH-severity regression (reordering a critical step can invert a dependency). Loosen with `max_regression_level`

(gate only on severity) or `allow_divergent_steps`

(tolerate benign per-step changes), or pass a custom `evaluator`

to define what "equivalent" means (e.g. ignore volatile fields like token counts). `RegressionGate(...).check(...)`

returns a `RegressionReport`

(no raise) for richer assertions. Detecting *reordered* steps requires a span-aware profile (`AlignmentProfile.developer_debug_v1`

); the default linear profile treats a pure reorder as still-matching. Complements `AlignmentSnapshotValidator`

(an exact output-hash snapshot): the gate works on two runs and reasons about regression severity.

[ examples/regression_testing.py](/Therealdk8890/DProvenanceKitPython/blob/main/examples/regression_testing.py) is the end-to-end story in ~150 readable lines: record a

**golden** run of a fact-checking agent (retrieve → verify → decide), then catch a later run that skips its verification step — via both the fast

**fingerprint** check and the detailed

**alignment** verdict (which flags the dropped

`claimVerified`

step as a HIGH regression).

```
python examples/regression_testing.py
```

It self-asserts its verdicts, so it doubles as an executable test of the headline use case.

Not using a framework? Instrument a hand-written agent loop directly — no event type to define, zero dependencies (ships in core as `dprovenancekit.instrument`

):

``` python
from dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event

@traced
def search(query): ...

@traced
def answer(question, sources): ...

store = InMemoryTraceStore()
with traced_run(store, context_id="ticket-42"):
    sources = search(question)
    record_event("plan.chosen", {"strategy": "rag"})
    reply = answer(question, sources)
```

`@traced`

records a `"<name>.start"`

/ `".end"`

/ `".error"`

event pair per call in its own **span** (the function name is the **engine**), nests calls in the span tree, and emits the same `DERIVED_FROM`

/ `INFORMED`

provenance edges as the framework adapters. `record_event(...)`

drops an ad-hoc event (a decision, a chosen branch). Plain functions, `async def`

, generators, and async generators are all supported (for a generator, start/end bracket the full iteration). Instrumentation never changes behavior — capture is failure-proof and exceptions pass through unchanged. Outside a `traced_run`

the decorators are transparent, so instrumented code is safe to call untraced. The trace it produces is identical in shape to the adapter-produced ones, so fingerprint / diff / align / the regression gate all apply.

[ examples/regression_testing.py](/Therealdk8890/DProvenanceKitPython/blob/main/examples/regression_testing.py) is the end-to-end story in ~150 readable lines: record a

**golden** run of a fact-checking agent, then catch a later run that skips its verification step — via both the fast

**fingerprint** check and the detailed

**alignment** verdict (which flags the dropped

`claimVerified`

step as a HIGH regression).

```
python examples/regression_testing.py
```

It self-asserts its verdicts, so it doubles as an executable test of the headline use case.

```
python -m pytest
```

168 tests: 80 ported from the Swift suite (query parity, write-buffer backpressure, SQLite stress + drop accounting, alignment, replay, snapshot diff, explainability fidelity, benchmark scoring, cloud chaos, …), 28 cross-language conformance checks against the frozen Trace Specification v1 vectors, 14 LangChain integration tests, 16 OpenAI Agents SDK integration tests, 16 instrumentation-layer tests, 13 regression-gate tests, and the regression-testing example run as a self-asserting test. (The real-framework tests run only when `langchain-core`

/ `openai-agents`

are installed, otherwise skipped.)

Distributed under the **Apache License 2.0**. See [LICENSE](/Therealdk8890/DProvenanceKitPython/blob/main/LICENSE).
