The contract is the interface: agent-driven Steampipe Stave in one command

The article describes Stave, a cloud-security tool that replaces the traditional collector-based onboarding model with a "contract" approach. Instead of shipping a collector that must be configured for each customer's environment, Stave provides JSON schemas and Steampipe-to-Stave column mappings that allow any agent to ingest data by satisfying the contract. The system uses a declarative YAML format with four operation types (field, static, extract, computed) to transform data, and ships 3,957 controls across 109 asset types with auto-generated JSON schemas.

Consider a typical cloud-security tool's onboarding flow. A customer installs the tool. The tool's collector tries to authenticate to AWS, fails because the role isn't there yet, the customer follows three pages of setup docs, the role gets created, the collector authenticates, the collector runs, the collector finds nothing because the tool only knows about S3 and IAM and the customer's workload is on EKS. End of week one. We don't ship a collector. Stave https://github.com/sufield/stave evaluates obs.v0.1 JSON snapshots — whatever produces them. That decision sounds extreme until you've watched the same "the collector doesn't see our environment" conversation play out three times. So instead of a collector, Stave ships a contract : per-asset JSON Schemas, per-asset Steampipe→Stave column mappings, and one command stave contract show that emits everything an agent needs to author its own ingest. The customer's preferred source Steampipe, AWS Config, Terraform state, an internal inventory API plugs in by satisfying the contract. This post walks through the steps that closes the pipeline. What the customer sees bash $ stave contract show --asset-type aws s3 bucket Contract: aws s3 bucket Schema: schemas/observation/v1/asset-types/aws s3 bucket.schema.json Controls: 102 | Chains: 15 Property paths catalog reads these — sorted by chain unlock, then control unlock : PATH CONTROLS CHAINS SEVERITY NOTE ──── ──────── ────── ──────── ──── storage.kind 91 15 critical storage.tags.data-classification 14 2 critical intent storage.access.public read 8 2 critical storage.controls.public access fully blocked 3 1 critical ... Steampipe mapping: contracts/steampipe/aws s3 bucket.yaml That output names everything the customer's ingest agent needs: - The schema — the JSON Schema the agent's output must satisfy - The property paths — what fields the catalog actually reads on this asset type, ranked by how many controls and chains they unlock - The mapping — a ready-to-run YAML telling the agent which Steampipe column maps to which Stave property path For the 17 most catalog-impactful asset types, the mapping is committed. For the rest, the customer's agent has the schema; it can author its own. The YAML mapping format The Steampipe→Stave mapping is one ordered list of operations per asset type. Four operation kinds cover every transform shape: - field — direct column → property mapping with optional coerce/default - static — a fixed value e.g. properties.storage.kind: bucket - extract — pull a nested JSON value from a JSON-shaped column - computed — derive from already-set property paths all / any reduction Operations run in YAML order; later ops can read paths written by earlier ones. The first mapping we wrote — contracts/steampipe/aws s3 bucket.yaml — replaced a Python function with a declarative file. The loader changes are 100 lines; the resulting observation is byte-identical to what the imperative function produced. operations: - kind: static path: properties.storage.kind value: bucket - kind: field path: properties.storage.tags column: tags default: {} type: dict - kind: extract path: properties.storage.encryption.algorithm column: server side encryption configuration json path: "Rules.0.ApplyServerSideEncryptionByDefault.SSEAlgorithm" key variants: Rules: rules SSEAlgorithm: sse algorithm default: "none" - kind: computed path: properties.storage.controls.public access fully blocked op: all inputs: - properties.storage.controls.public access block.block public acls - properties.storage.controls.public access block.block public policy - properties.storage.controls.public access block.ignore public acls - properties.storage.controls.public access block.restrict public buckets The format is the contract. Any agent in any language can parse the YAML and produce conforming observations. Per-asset JSON Schemas The catalog ships 3,957 controls; together they declare applicable asset types for 109 distinct asset types. To validate that a mapping's target paths are real, we needed a JSON Schema per asset type. Hand-authoring 109 schemas is a Tuesday lost; the schema generator already existed it walks every control's predicate AST and infers the property paths + types , but defaulted to the top-3 most-used types. go run ./internal/tools/genassetschemas/... -top 200 make sync-schemas Output: 109 per-asset schemas under schemas/observation/v1/asset-types/ . Every level is additionalProperties: true — the schemas are discoverability artifacts , not restrictive gates. A schema that lists one property security hub.enabled on aws securityhub account , for example tells an agent "this asset type matters to the catalog; here is the one property to populate." Thin schemas are still useful. Ten hand-authored mappings The next 10 asset types by control coverage — aws iam role , aws lambda function , aws cognito user pool , aws cloudtrail trail , aws kms key , aws ec2 instance , aws sqs queue , aws iam user , aws opensearch domain , aws stepfunctions state machine — got hand-authored mappings. They served two purposes: actual coverage for the most-asked-for types, and a ground-truth corpus to validate Iter 5's auto-generator against. Every mapping carries a derived properties: block listing the catalog-read properties that cannot come from a single Steampipe column. Example from aws iam role.yaml : derived properties: - path: properties.identity.role.cross account trust without external id source: "Parse trust policy — detect external Account in Principal without sts:ExternalId condition" - path: properties.identity.permission categories.has incompatible categories source: Policy analysis against controldata/taxonomy/permission categories.yaml - path: properties.identity.access advisor.available source: iam:GenerateServiceLastAccessedDetails + iam:GetServiceLastAccessedDetails separate API call per role That block is the agent's TODO list. Silently producing an observation without those derived properties is the failure mode the derived properties: section prevents — Stave's controls don't see the property, the catalog finds nothing wrong, the breach happens anyway. The Contract Show Command The three sources — schema, predicate index, mapping file — already existed. Joining them required three separate file reads. The new command joins them once: stave contract show --asset-type aws iam role --format json { "asset type": "aws iam role", "has schema": true, "schema path": "schemas/observation/v1/asset-types/aws iam role.schema.json", "controls count": 198, "chains count": 38, "property paths": { "path": "properties.identity.kind", "controls count": 196, "chains count": 35, "max severity": "critical", "is intent property": false }, ... , "steampipe mapping": "contracts/steampipe/aws iam role.yaml" } Or: stave contract show --list Asset types with controls: 109 schema: 109, steampipe mapping: 17 TYPE SCHEMA CONTROLS CHAINS MAPPING ──── ────── ──────── ────── ─────── aws iam role yes 198 38 steampipe aws s3 bucket yes 102 15 steampipe aws lambda function yes 169 12 steampipe aws bedrock agent yes 24 5 - ... The implementation reuses everything already in the codebase: compose.LoadControlsFrom , compose.LoadChainDefinitions , predindex.Build the same index the stave gaps command uses , and a 50-line helper in internal/contracts/schema/load.go to access the embedded per-asset schemas. The command is ~330 lines; nothing is new data — it's projection over existing data. Auto-generator The remaining ~98 asset types could be hand-authored or auto-generated. We tried auto. The generator joins the cached Steampipe column catalog with each per-asset schema's property paths, applies a four-rule matching priority per-asset overrides, schema-path lookup with multi-token scoring, tags convention, fallback to properties.<ns .<col , and emits a YAML in the same operations-list format Iter 1 established. make gen-steampipe-mappings generate, skip existing make gen-steampipe-mappings-validate measure accuracy Validation runs the generator against the 11 hand-authored YAMLs Iter 1 + Iter 3 and compares the auto-generated column, path tuples against the ground truth: Overall: 149/177 = 84% accuracy across 17 type s 84% — past the 80% target. The remaining 16% are the multi-target JSON-path extracts the brief flagged as inherently manual one column → two property paths is not something a name-similarity heuristic can synthesise . Auto-generated YAMLs carry auto generated: true + review required: N + unmatched paths: ... so the reviewer's surface is bounded. The detailed story of the heuristic — and how it went from 8% accuracy on the first pass to 84% on the fourth — is its own post. The point here is what's committed : 17 total mappings 11 hand-authored, 6 auto-generated , every one of them an artifact a customer's agent can read in any language. Who owns contract sits where it does The architecture choice that makes this work: extractors are client-owned. Stave does not ship a collector. The contracts/steampipe/ directory contains instructions , not code . An agent reads the schema and the mapping; the agent produces the observation; Stave evaluates the observation. The collector boundary is a file, not a process. This decision has been in our architecture docs since the project started, but until now there was no single command that surfaced the contract to an agent. An agent that wanted to author a Steampipe ingest for a new asset type had to: - Find the per-asset schema one of several embedded directories - Decide what property paths to populate no canonical list — derive from controls - Map Steampipe columns to those paths no template — invent it The agent runs one command and gets all three. The agent runs make gen-steampipe-mappings and gets a starting-point YAML it can refine. The integration is a lot easier. What stayed out of Stave Nothing in the Stave Go binary changed across the five iterations except the new cmd/contract/ directory one file, ~330 LOC . The agent infrastructure is: - examples/agents/stave transform.py — reference loader Python - contracts/steampipe/ .yaml — 17 mappings committed - scripts/gen-steampipe-mappings.py — auto-generator Python, ~280 LOC - scripts/steampipe-columns.json — cached column catalog refreshable from a live Steampipe install The deterministic policy engine is unchanged. The contract evolves; the engine doesn't. The Generic Pipeline Shape Replace Steampipe with any external data source — AWS Config, Terraform state, your internal inventory, Salesforce, OpenAPI specs — and the pipeline shape is the same: Define the canonical target contract. For Stave it's obs.v0.1 JSON with per-asset-type sub-schemas. For your tool, it's whatever shape your engine reads. Author one mapping per source per asset type. YAML is fine. Operations list with field/static/extract/computed semantics covers most transform shapes. Ship a discovery command. One CLI that joins the schema + the path list + the mapping into a single agent-readable output. The agent stops needing your team's docs. Auto-generate the boring half. Most column→path mappings are name-similarity. The exceptions are rare enough to hand-author. Use the hand-authored set as a ground-truth corpus to measure your generator's accuracy. Mark uncertainty explicitly. review required , unmatched paths , derived properties: . Silent gaps are worse than loud ones. Five points, one functioning pipeline. The customer who needed three pages of collector setup now needs make gen-steampipe-mappings and an agent that can read a YAML.