The Minimum Viable Ontology: Building an Operating-Layer Knowledge Graph You Can Actually Trust

Andrea Volpini argues that ontologies are becoming the control plane for AI agents, while Sergey Vasiliev warns that scaling unvalidated meaning creates ontology debt. A 2025 systematic review confirms the field overinvests in generation and underinvests in validation. This article proposes a pipeline for SaaS that pairs LLM generation with formal validation to build trustworthy production ontologies.

Andrea Volpini says the ontology is becoming the control plane for AI agents. Sergey Vasiliev warns we’re scaling unvalidated meaning into it. Both are right — and here’s the pipeline that resolves the tension, scoped to SaaS so there’s nothing to hand-wave. Yes, I’m still calling it an ontology. The argument has moved, and it’s worth saying where it moved to before adding to it. For most of their history, ontologies lived in the back office — reference models for search, data integration, and semantic publishing. Andrea Volpini’s recent WordLift article names the shift cleanly: in the agentic web, the ontology stops being a back-office artifact and becomes the control plane https://wordlift.io/blog/en/ontologies-for-the-agentic-web/ — the layer agents use to retrieve, remember, validate, plan, and act without losing meaning. I think he’s right. It’s also the most consequential thing anyone has said about why this work suddenly matters, because it changes the cost of being wrong. The moment an ontology moves into the operating layer, a bad concept stops being a documentation problem and becomes an infrastructure problem. Sergey Vasiliev’s “ Ontology Trap https://sergeyvasiliev.substack.com/p/the-ontology-trap-when-ai-scales ” names that danger precisely: producing structure is not the same as verifying it, and a language model can generate plausible-but-unchecked meaning far faster than any organization can validate it. You don’t get intelligence at scale. You get confident, well-written error wired into the systems that now depend on it — duplication, false equivalence, vague relations, missing provenance, drift — hardening into what he calls ontology debt. And the research backs the worry with a map. A 2025 systematic review of LLMs in ontology engineering https://www.semantic-web-journal.net/system/files/swj3864.pdf Li, Garijo, and Poveda-Villalón, 36 papers across 49 task-level studies found the work clusters hard at the generative front of the lifecycle — implementation and requirements account for the overwhelming majority — while ontology evaluation and maintenance are barely studied at all; maintenance shows up in a single study in the entire corpus, and only a handful of approaches keep a human meaningfully in the loop. The field, in other words, has gotten very good at making ontologies and has badly underinvested in checking and maintaining them. Its own headline recommendation is hybrid neuro-symbolic workflows that pair LLM generation with formal validation, plus provenance and human oversight. That is the gap, stated by the people who counted. This article is my working answer to all three, scoped deliberately to one domain — software-as-a-service — so there is nothing to wave away. The thesis is simple and a little stubborn: the generative flexibility Volpini wants and the validation discipline Vasiliev demands are not in tension. They are the two halves of one pipeline, where generation is cheap and validation is the gate. A warning to the people who own the word: I’m going to call the artifact at the center of this an ontology , and mean something a formal-ontology purist will object to. It is machine-generated from buyer questions, stored across three engines at once, and gated by Python rather than proved by a reasoner. I’ll make my peace offering to the purists shortly — a real one. But the position up front is this: an ontology you can trust in production beats both a formally perfect one nobody finished and a pile of generated JSON nobody can check. Circumscription isn’t a cop-out here; it’s the argument. SaaS has a naturally bounded, repeatable entity model. Almost everything a buyer needs to know about a B2B product collapses into a small set of types — Product, Feature, Plan, Price, Integration, UseCase, Limitation — and a small set of relations: a product has features, has plans, integrates with other tools; a plan includes features and has a price; a feature supports a use case and has limitations. Roughly two dozen predicates cover the working surface of an entire category. That’s exactly the condition under which a minimum viable ontology is genuinely minimal and genuinely viable. You don’t need a foundational upper ontology or a description-logic reasoner running in production to capture “Plan Pro includes SSO and costs $40/seat.” You need a tight schema, faithful extraction, and provenance on every claim. SaaS is where operational ontology earns its keep without the heavy formal machinery — which makes it the honest place to show the method working, and the honest place to show validation actually holding. Here’s the first thing that annoys the purists, and the first thing I’d defend to the death. The schema is grown bottom-up from the questions buyers actually ask, not designed top-down by a modeller deciding what the domain “is.” The input is a set of dealbreaker questions — the things a buyer needs answered before they’ll commit. Does it support SAML SSO on the mid-tier plan? What’s the API rate limit? Does it integrate with the tool we already pay for? The semantic-web tradition has a name for these: competency questions. Both Volpini and the systematic review land on the same point from different directions — competency questions are where LLMs are genuinely, reliably useful, and they should double as regression tests for the ontology, not just requirements. That’s exactly the role they play here: the dealbreaker questions seed the schema and then become the bar the finished graph has to answer. Why invert it? Because an ontology designed in the abstract encodes what an expert thinks matters; an ontology grown from real questions encodes what gets used . A top-down model will lovingly axiomatize distinctions no buyer ever asks about and quietly omit the one comparison that closes deals. Intent-derived structure is more faithful to the knowledge’s actual job. It is, admittedly, less principled. I’ll take faithful over principled every time I’m building something that has to answer real questions. The generation pipeline turns those questions into a Minimum Viable Ontology — a JSON-Schema artifact — in three phases. Scoping. A model reads the question corpus and proposes five to seven knowledge domains, run through a validation-and-repair loop that enforces a domain count in range, snake case naming, and mutual exclusivity — overlapping domains get sent back with stronger constraints, up to a few retries. The output is a clean, non-overlapping partition of the space the questions actually cover. Dual-model generation. Here’s a deliberate choice worth dwelling on, because there’s now hard data behind it. The schema isn’t drafted once; two independent models each produce a full draft. This hedges against single-model idiosyncrasy — the failure where one model’s quirks or blind spots get baked into your ground truth and you never notice because there’s nothing to compare against. How much does that matter? A 2025 Frontiers study on LLM-driven medical ontology mapping https://pmc.ncbi.nlm.nih.gov/articles/PMC12061982/ Mavridis et al. ran six systems against the same SNOMED CT task and got F1 scores ranging from the high 20s to the mid 90s — the same task , swinging by sixty-plus points purely on model choice. If which model you pick can move your results that far, betting your ground truth on one model is the failure mode the data is screaming about. Two drafts give you a disagreement signal. Synthesis. The two drafts are merged and gated by a semantic-similarity compliance check plus hard structural rules. Every object is closed additionalProperties: false . Every string field carries a description. And — the part that matters — every leaf node must carry a meta block with a rationale and source references. // fragment of a generated MVO illustrative "sso": {"type": "string","description": "Single sign-on protocols supported e.g. SAML, OIDC ."," meta": {"rationale": "Recurring dealbreaker in security-led evaluations.","source references": "acme-pm.example/security", "acme-pm.example/enterprise" }} Provenance isn’t bolted on after extraction. It’s a structural requirement of the schema itself. A field that can’t say why it exists and where it’s grounded doesn’t get in. Both Volpini and Vasiliev put provenance-on-every-fact near the top of their design lists; here it’s a precondition of generation, not a later annotation pass. I promised one, and it’s real. The MVO serializes to OWL. The same schema can be emitted as Turtle and run through an actual description-logic reasoner — HermiT — for a consistency check. So when a semantic-web purist says “that’s not an ontology, it has no formal semantics,” the honest answer is: it has exactly as much formal semantics as you’d like, on demand. We can go fully formal; we’ve made an engineering decision not to run production on the formal layer, because for this job the reasoner buys little and costs flexibility and speed. This is precisely the split Volpini argues for and the systematic review recommends: keep OWL for what OWL is good at — conceptual structure and consistency — and push validation of usable, actionable data into a deterministic layer. OWL defines what the world means; the validation layer defines what valid data looks like. Formal semantics as a checkable export, not a runtime dependency. Keep the rigor available; don’t pay for it where it doesn’t earn out. A schema with no instances is a diagram. The second pipeline fills it. The naive approach hands a model some scraped text and asks for JSON. You get JSON — and a different shape every run, invented field names, no types, no evidence. It looks like knowledge and behaves like a guess. This is the exact mechanism behind Vasiliev’s failure modes: free-form generation invents Customer, Client, and Account Holder as three concepts, collapses a supplier into a manufacturer because the words rhyme, and dumps everything into a RELATED TO edge that means nothing. The alternative is schema-guided extraction, and for the typed-triple layer I lean on OneKE — a dockerized, schema-guided, multi-agent extraction framework https://arxiv.org/abs/2412.20005 Luo et al., WWW Companion ’25; OpenSPG whose Schema, Extraction, and Reflection agents are driven by a schema rather than a freeform prompt. In my pipeline the MVO is adapted into OneKE’s schema format — the SaaS types become entity types, the relations become constrained predicates — and extraction returns typed subject–predicate–object triples with evidence spans, not loose JSON. The difference is the entire game — and it’s measurable. In the Mavridis study, schema-guided prompting with vector-retrieval grounding and an expert-validated reference beat a conventional ontology-matching baseline by more than forty points of precision. Constraint plus grounding isn’t an aesthetic preference; it’s the difference the numbers keep showing. A free-form extractor reads “Acme PM includes email automation for marketing teams” and gives you {“feature”: “email automation”} — untyped, predicate-less, unverifiable, renamed next Tuesday. The schema-guided extractor returns a triple whose predicate Product hasFeature is drawn from a fixed, schema-defined set, whose subject and object are typed, and which carries the exact character span of its evidence. One of these you can query in SPARQL or Cypher and validate in code. The other you can only hope about. Schema-guided extraction makes the output well-shaped . It does not make it true . Between the model and the graph sits a validator that is pure, boring, deterministic code — and it’s the most important component in the system, because it’s where Vasiliev’s “AI proposes, systems validate, humans decide, the graph records” stops being a slogan and becomes running software. The governing rule across everything I build: LLMs classify, Python calculates. The model is allowed to propose a triple. It is not allowed to decide whether the triple is admissible. Every extracted triple is checked against four constraints before it enters the graph: python def admit triple, schema :return all triple.subject type == schema.domain of triple.predicate , domaintriple.object type == schema.range of triple.predicate , rangespans triple.subject, triple.object, triple.evidence , evidencetriple.confidence = THRESHOLD, confidence Domain and range checks reject triples whose types don’t fit the predicate — that’s false equivalence and vague-relationship debt caught at the door. The evidence check requires that the subject and object text actually appear in the cited span: no span, no entry. That single rule is what catches the most dangerous failure — a confident, well-formed, completely fabricated fact — because a fabrication rarely survives the evidence check. The confidence gate drops the marginal. Failing triples are filtered, counted, and logged, not quietly waved through. This is not a vibe; it’s a measured bar. The quality targets I hold the typed-triple layer to are explicit: predicate validity above 95% triples obey domain/range , evidence quality above 90% subject and object genuinely present in the span , extraction stability above 85% across repeated runs, recall above 80%, precision above 85%. The validator is the line between a draft and a fact — and it is exactly the external validation mechanism the systematic review says the field is missing, sitting in the evaluation and maintenance stages it says nobody studies. Map the whole pipeline onto Vasiliev’s operating model and it lines up beat for beat: OneKE proposes ; the typed-triple layer stages ; the deterministic validator validates ; a human reviews the contradicted and the low-confidence and decides ; provenance records the decision. Generation is one stage of six, and it’s the only one a model is trusted to run alone. One more point both Volpini and Vasiliev raise, and the one most systems get dangerously wrong: open-world versus closed-world. If a product has no allergen relationship, does that mean it has no allergens — or that nobody loaded the data? For an agent about to act, the difference between unknown , false , and not allowed changes what it does. A model should never silently decide which of those a missing fact means. So the system doesn’t. A feature we haven’t seen evidence for is recorded as unobserved , not absent — open-world by default. The only negative signal treated as externally defensible is contradicted : a claim the corpus actively disproves, not merely one it fails to mention. Unsupported-but-not-contradicted stays internal and diagnostic; it never becomes a public “this product can’t do X.” Absence of evidence is not evidence of absence, and the graph is built to keep those two apart — which is the closed-world discipline Vasiliev asks for, applied only where it’s earned. The output isn’t a static graph you generate once and admire. Extracted facts are linked, bidirectionally, to the exact text chunks they came from — structured facts in a graph store, source chunks in a vector store, joined by a deterministic content-addressed identifier so the link survives re-ingestion. Ask the graph a question and you can walk straight to the passage that grounds the answer; read a passage and you can see which facts it supports. Facts are also versioned and timestamped. When a vendor changes a price, the old value isn’t overwritten — it’s superseded, with history retained, point-in-time queries supported, and cross-source conflicts surfaced rather than silently resolved. This is the part the systematic review found almost nobody works on: maintenance, the single most underrepresented phase in the entire literature. An ontology that can’t tell you when something was true, or that two sources disagree, isn’t grounding anything; it’s asserting. Versioned, provenance-linked facts are what let the structure behave like evidence instead of opinion — and what let it survive contact with a domain that changes every quarter. Which brings the build back to the oldest argument in this space, now concrete. The same governed knowledge — the MVO, the validated triples, the linked facts — is projected into whichever paradigm the question demands. Need formal reasoning or a consistency proof? RDF/OWL, queryable in SPARQL, checkable under HermiT. Need to traverse relationships — what integrates with what, which plans include which features? A property graph, queried in Cypher. Need fuzzy semantic retrieval over the source text? Vectors, in a hybrid store — which is also the lexical/retrieval layer Volpini argues the agent actually enters through, since an agent starts from a question, not a SPARQL query. Same truth, three shapes. This is the payoff of refusing to pick a side. The RDF-versus-property-graph war assumes you must choose one representation and live with its weaknesses. You don’t. You choose a single governed source of truth — schema-constrained, validated, provenance-bearing — and project it into whatever the consuming system needs. The hybrid isn’t a compromise between camps. It’s the recognition that representation is a serving decision, not a religious one. Strip away the implementation and the claim is small and stubborn. Volpini is right that the ontology is becoming the operating layer for agents. Vasiliev is right that scaling unvalidated meaning into that layer is how you get confident error as infrastructure. These are not opposing views — they’re the same requirement seen from two sides. The agentic web needs the ontology in the operating layer and cannot afford ungoverned generated meaning there. The resolution isn’t slower generation or smaller ambition. It’s a gate. The pipeline I’ve described is what the gate looks like in one domain: a schema grown from real questions, drafted by two models and synthesized, filled by schema-guided extraction, admitted only by deterministic validation, linked to its evidence, versioned over time, honest about what it hasn’t seen, and projectable into RDF, a property graph, or vectors as the job demands. Every one of those choices trades a little formal purity for operational trust. On purpose. So — to the purists, and to Andrea and Sergey, who I hope show up in the comments: tell me where this breaks. Where does skipping the reasoner at runtime actually cost me? Where does question-derived scoping miss something a principled domain model would have caught? Where does the validator let something through that it shouldn’t? I’ve made the trade deliberately, and I’d rather argue it with you than around you. The reasoner export is right there. It passes. Now let’s talk about whether generation, on its own, should ever be allowed to write to the graph. The Minimum Viable Ontology: Building an Operating-Layer Knowledge Graph You Can Actually Trust https://pub.towardsai.net/the-minimum-viable-ontology-building-an-operating-layer-knowledge-graph-you-can-actually-trust-379cc51a5eec was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.