# GENPROHYFRA: Generative-Programmatic Hybrid Framework | An AI design philosophy

> Source: <https://gist.github.com/cryobiochem/5460e7770d63f2df26b80f3141dce694>
> Published: 2026-05-12 11:31:08+00:00

**Topics:** AI Economy · Skilltree Architecture · GPH Framework · Token Rationing · The Future of Enterprise AI

Hello world.

This piece is a direct challenge to the status quo of AI adoption at all costs. From personal experience, I'm seeing the industry spiral into careless packaging of business solutions as AI products that are fully automated, fully generative, fully unintelligible, and full of slop. There is little to no consideration for agentic architecture, and no consideration at all for designing economically efficient pipelines. Subagents are packaged as independent on-call specialists, regulated by a single orchestrator bloated with instructional workflows, and this framework — which seems to me unsustainable at scale — is wrapped into an AI product. Programmatic code is being fully abandoned because AI can generate anything. But we should ask ourselves: if AI can generate anything,

should it?

Personally, I reject black-box AI systems and fully generative frameworks for deterministic problem scenarios. I am an advocate for fully controllable products, where performance is tied to in-house pipeline checkpoints rather than full dependence on the performance of an AI model provider. I argue that companies should take ownership of the performance of their AI products through precision design of hybrid pipelines with conditional checkpoints. This is why I conceptualisedGENPROHYFRA, and I apply this design philosophy daily in my own work and that of my team to minimise token costs.

Companies are stretching themselves thin to adopt AI at all costs in order to remain competitive. The next phase will be AI economic rationing, and retrofitting all those careless implementations will be a nightmare.

I hope you connect with my views, and I am open to your feedback in improving this philosophy further. Thank you for reading.— Bruno

May 6th, 2026

**1. Don't generate what can be computed.**`If code answers the question, don't ask an LLM.`

**2. Tier your models.**`Every task has a minimum viable model, using Opus to reformat JSON is wasteful.`

**3. Route, don't broadcast.**`Give each node only the context needed to do its job, not a shared message board.`

**4. Make branching explicit.**`Adopt true decision tree architectures instead of prose instruction.`

**5. Observe before you optimise.**`Can't save on what you can't measure, monitor token consumption on meta-runs.`

**6. Don't delegate to AI what AI doesn't need to decide.**`The orchestrator is not a god.`

**7. Own your pipeline's economics.**`AI providers will not optimise on your behalf, retrofitting is a nightmare.`

**8. Design for a thousand agents.**`If the architecture that works for your pilot will collapse at scale, don't.`

**9. Invest tokens once.**`Build an excellent hatchet for every tree, not a good one for each.`

**10. Don't tokenmax.**`The measure of an effective pipeline is in how many tokens it didn't need to use.`

[Author's Note](https://gist.github.com/starred.atom#authors-note)[The 10 Commandments](https://gist.github.com/starred.atom#the-10-commandments)[Abstract](https://gist.github.com/starred.atom#abstract)[Section 1 — Introduction](https://gist.github.com/starred.atom#1-introduction)[Section 2 — The Problem: Token Overconsumption in Modern AI Architectures](https://gist.github.com/starred.atom#2-the-problem-token-overconsumption-in-modern-ai-architectures)[Section 3 — GENPROHYFRA: The Framework](https://gist.github.com/starred.atom#3-genprohyfra-the-framework)[Section 4 — Skilltree: The Tool](https://gist.github.com/starred.atom#4-skilltree-the-tool)[4.0 Core Intuition: Two Ways to Build a Thinking Machine](https://gist.github.com/starred.atom#40-core-intuition-two-ways-to-build-a-thinking-machine)[4.1 Architecture Overview](https://gist.github.com/starred.atom#41-architecture-overview)[4.2 The Pipeline Schema](https://gist.github.com/starred.atom#42-the-pipeline-schema)[4.3 The Tokenconomics Dashboard](https://gist.github.com/starred.atom#43-the-tokenconomics-dashboard)[4.4 Brownfield Support and the Interpret Command](https://gist.github.com/starred.atom#44-brownfield-support-and-the-interpret-command)[4.5 Intent Routing: Structural Branch Elimination](https://gist.github.com/starred.atom#45-intent-routing-structural-branch-elimination)[4.6 Pipeline Execution Preview](https://gist.github.com/starred.atom#46-pipeline-execution-preview)[4.7 The Agent Instruction Quality Standard](https://gist.github.com/starred.atom#47-the-agent-instruction-quality-standard)[4.8 Feedback and Friction Observability](https://gist.github.com/starred.atom#48-feedback-and-friction-observability)

[Section 5 — Architectural Comparison](https://gist.github.com/starred.atom#5-architectural-comparison)[Section 6 — Token Economy: Quantitative Analysis](https://gist.github.com/starred.atom#section-6--token-economy-quantitative-analysis)[Section 7 — Case Studies](https://gist.github.com/starred.atom#7-case-studies)[Section 8 — Status Quo: The Agent Skills Standard](https://gist.github.com/starred.atom#8-status-quo-the-agent-skills-standard)[Section 9 — Discussion on Limitations](https://gist.github.com/starred.atom#9-discussion-on-limitations)[Section 10 — Towards an AI Economy-first Enterprise](https://gist.github.com/starred.atom#10-towards-an-ai-economy-first-enterprise)[Section 11 — Conclusion](https://gist.github.com/starred.atom#11-conclusion)[References](https://gist.github.com/starred.atom#references)[License](https://gist.github.com/starred.atom#license)

The rapid enterprise adoption of generative AI — first through GitHub Copilot in late 2025, then the Claude Code boom of 2026 — has exposed a critical economic fault-line: organisations are consuming AI tokens far faster than their budgets allow. AI is a genuine accelerator, but the way it is currently deployed is fundamentally wasteful. The dominant pattern orchestrates LLM subagents to compartmentalise tasks, equips each with a set of MCP tools, and ships them as one-size-fits-all solutions that lean almost entirely on expensive generative inference. The prevailing instinct has become *use AI for everything you* **can**, *not everything you* **should**.

This white paper introduces **tokenconomics**: the discipline of designing AI systems to maximise value per token consumed. Its architectural expression is the **Gen** erative–**Pro** grammatic **Hy** brid **Fra** mework (**GENPROHYFRA**), a four-pillar standard built on one radical idea — **AI should operate at minimum necessity**. Generative faculties are inserted only where they are genuinely valuable; reliable deterministic code does the rest, and the boundary between the two is made explicit.

Putting this into practice requires rethinking how subagents are built, called, connected, and governed. Markdown files are excellent for natural-language instructions, but a flat collection of `.md`

files invoked by an `.md`

orchestrator is a primitive, loop-prone architecture that will not survive the scale ahead — hundreds or thousands of agents per codebase, each driving defined process pipelines. To demonstrate the alternative, we built **Skilltree**, a VS Code extension that restructures pipeline design from connected Markdown files into a YAML graph-based decision tree, with a visual IDE that Copilot Chat lacks and full reliance on embedded GitHub Copilot LLMs for enterprise safety and privacy — no external front-end, no outbound API calls. More than a boilerplate agent designer, Skilltree is the proving ground for the GENPROHYFRA philosophy and the problem of AI token economy.

The argument is simple: the measure of an effective AI system is not how many tokens it uses, but how few it needs to achieve the same result. *Tokenconomics* is the opposite of *tokenmaxxing*, and the organisations that internalise that distinction earliest will hold a compounding architectural and financial advantage in the AI-intensive decade ahead.

In April 2026, Reuters reported that Uber had exhausted its entire annual budget for Claude Code in four months [9]. When a company burns twelve months of AI spend in a third of the time, the lesson is not that AI is too expensive — it is that nobody designed the pipelines to be economically efficient.

When I started writing the first draft of this paper in late April 2026, I wrote next: "**I expect this to become a recurring headline.**" Today, Uber is not an isolated case — it is the most visible point in a wave of 2026 retrenchments that all trace back to the same root cause:

**Uber** burned its entire 2026 AI coding budget in four months.**Microsoft** cancelled most of its internal Claude Code licenses.**GitHub** began moving Copilot off flat-rate pricing toward per-token billing.**Cursor** scrapped its "unlimited" plan after usage bills spiked.**Klarna** started rehiring humans after AI support quality dropped.**Duolingo** pulled AI back out of employee performance reviews.

The pattern is consistent. AI output is not linear: it takes several tries, repeated inputs, and sustained effort to extract a great result — and every one of those tries is heavy token-consuming work. Costs stack up fast precisely because the *usage* is undisciplined, not because the technology is unaffordable. One independent developer reportedly burned through $1.3M of tokens on a single project; multiply that by an engineering team and the invoice will be a horror movie. So, executives who first jumped on AI transformation are now scrutinising the bill line by line.

That incident is what makes the concept of *tokenconomics* urgent, and it needs to arrive before every engineering team is running thousands of expensive pipeline executions a day. The organisations that establish cost-aware architecture ahead of mass adoption will lock in structural efficiencies that compound; those that spend mindlessly will face the far harder problem of retrofitting efficiency onto entrenched systems. AI is not going away — but the unlimited-AI fantasy is, and the future belongs to teams that use AI *where it makes sense*.

Key ObservationYes, AI providers can optimize token consumption on their end, and this has been the case with the natural evolution of AI technology. But passively relying on this creates a

critical dependencythat teams should try to avoid by implementing best practices on their own end.

The dominant pattern for AI agent deployment in 2026 is a **linear hierarchical chain**: a central orchestrator — typically an `AGENTS.md`

or equivalent routing file — delegates tasks to a flat list of specialised subagents in an `agents/`

folder. Each agent runs on the same model, processes the full shared context, and hands results back to the orchestrator in strict sequence (*see LangChain*).

This does not scale in principle. Consider a company like Google Cloud or a large bank that, instead of 30 subagents, has 5,000. A single `AGENTS.md`

controlling 5,000 flat `.md`

files is not an architecture. The current mitigation is to empower the orchestrator with workflow instructions, an agent roster, task tiers, and intent-based subgroups, creating dedicated switchable pipelines (*see LangGraph*). This works for now, but at scale it

*deepens*the problem:

**bloating the orchestrator with instructional routing produces a pseudo-linear workflow, not a true decision tree.**

To understand the right abstraction, let's use mathematics as an analogy. You solve a system of two equations by solving for one variable, then substituting it in the other equation to solve for the other variable. But scale to a hundred equations and substitution collapses — you need a higher-dimensional construct: matrix algebra, which sustains indefinite complexity, working identically for a 2×2 or an *n*×*n* system. As detailed below, one of GENPROHYFRA's pillars is exactly this: a true decision-tree subagent schema as a structural abstraction that does not degrade as agent counts grow.

There is a compounding failure mode hiding in the flat pattern. When orchestrators are themselves LLMs, you get an AI controlling an AI controlling an AI, with no deterministic exit conditions — and you quickly find yourself in orchestration loops, context drift, and runaway delegation. Need to orchestrate the orchestrators? You add a super-orchestrator, and the regression continues indefinitely. This is a textbook *argument from poor design* (the *God Architect* fallacy): flat hierarchies have no natural boundary on where delegation ends.

Three fundamental inefficiencies characterise the current state of AI agent architectures in companies:

**Degenerative linear topology:** No matter how many branching instructions are written in prose, subagents degenerate into sequential pipelines. There is no*explicit conditional switching*based on each subagent's performance or output.**Blind model homogeneity:** Every step uses the same model tier, even when task complexity varies wildly. Why use Claude Opus to format documentation? A cheaper model — or no model at all — will do.**AI-bloated checkpoints:** Validation and other purely deterministic tasks are routinely handed to generative models when a programmatic validator (e.g.`pytest`

) gives the same answer at zero cost. Asking an LLM whether the tests pass is overkill.

GENPROHYFRA, described in the sections that follow, addresses all three. It is not a replacement for generative AI; it is a discipline for deploying generative AI at *minimum necessity*, substituting deterministic alternatives wherever they suffice.

To see where tokens are wasted, take an exemplary enterprise pipeline: a code-review workflow. In a naïve implementation (by naïve I mean: built with no regard for economic efficiency) a developer submits a pull request and the following sequence runs:

- An orchestrator agent reads the entire diff and project context: ~
**40,000 tokens**. - A technical reviewer re-reads the same diff plus architecture docs: another ~
**45,000 tokens**. - A practical reviewer re-reads the diff: ~
**35,000 tokens**. - A UX-impact reviewer re-reads the diff plus UX guidelines: ~
**40,000 tokens**. - A verdict synthesiser aggregates all three reviews: ~
**20,000 tokens**. - A quality-gate evaluator decides whether the review passes: ~
**15,000 tokens**.

Now ask yourself which of these genuinely required a frontier model:

- Step 6 certainly did not — a pass/fail check is deterministic.
- Step 4 did not need to run at all if the diff touched no front-end files.
- The orchestrator in step 1 did not need an LLM to decide which reviewers to invoke; given a solid deterministic way-of-working, that choice follows directly from the diff's file list.

**Verdict:** three of six steps could have been rationed away or stripped of AI cost entirely. The diagram below sketches what that optimisation looks like.

[Figure 1 — A naïve six-step code-review pipeline where every step invokes a frontier model. Three of the six steps could be eliminated, made programmatic, or use cheaper models. See the companion diagram for the GENPROHYFRA-optimised version.](https://private-user-images.githubusercontent.com/33891979/601696617-e566d687-310d-4154-a353-5acccbcd1a18.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3ODA1Mzc2NjUsIm5iZiI6MTc4MDUzNzM2NSwicGF0aCI6Ii8zMzg5MTk3OS82MDE2OTY2MTctZTU2NmQ2ODctMzEwZC00MTU0LWEzNTMtNWFjY2NiY2QxYTE4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA2MDQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNjA0VDAxNDI0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ0OTEyMjhkNmE0MjVlYTdiY2FhMjYzMGIyODAyNGZmMTFlNTgzZDMxYzg2ZWI1YjQ1ZDQyOWUzNzE2MTEyNmEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JnJlc3BvbnNlLWNvbnRlbnQtdHlwZT1pbWFnZSUyRnBuZyJ9.Yxt1sUxakJbBXHAabh4c_PYoMOCX2uxG9cPzmmwPXdg)

The current `AGENTS.md`

pattern establishes a flat registry of agents beneath a single orchestrator. At scale, this topology has three foreseeable cost problems.

**1. It encourages over-delegation.** Because the orchestrator is an LLM, it tends to invoke or contextualise more agents than strictly necessary. LLMs are trained to be thorough, not economical.

**2. It establishes static agent fleets per task.** If the intent is feature development, the recruited fleet is typically "*architect -> dev -> tester -> reviewer -> PR agent*." But without explicit conditions on when or whether each runs, every agent is in scope for every task. The pipeline is static and cannot be modulated (*see the Alternative Splicing note below*).

**3. It lacks explicit routing logic.** Efficiency improvements then require editing prompts rather than relying on structure, which is a brittle non-deterministic approach.

And as Section 1.2 noted, the flat pattern also opens the door to **orchestration loops**: an LLM orchestrator that spawns sub-orchestrators which spawn workers, with no deterministic termination — a costly [ant mill](https://en.wikipedia.org/wiki/Ant_mill) of decision conflict.

Decision-tree architectures can capture all of this organically:

- Every branch has an explicit activation condition;
- Every condition has an explicit exit;
- Pipelines cascade unidirectionally;
- You can cap the number of retry cycles when conditions are unfulfilled;
- Routing is rule-based or democratic, not left to the discretion of conflicting LLMs.

Drawing Inspiration from GeneticsOur genes define who we are and what proteins we produce. Genes are composed of codons (sequences of 3 units). The central dogma in biology is a 1:1 relationship - one gene produces one protein. However, the

samegene sequence can producedifferentproteins if only certain codons are activated. This modulation is calledAlternative Splicing, and it conserves storage (fewer unique genes needed) and energy (no need to activate unnecessary codons). By analogy, a single agentic pipeline can be conditionally modulated to activate only the subagents a task needs (saving energy; i.e. tokens/money) without building bloated orchestrators with one workflow per task (saving storage; i.e. pipeline complexity).

Perhaps the most egregious token waste is using generative AI to validate outputs that could be validated deterministically. I have repeatedly seen these in production pipelines:

**"Does this Python code pass its unit tests?"**— answerable by running`pytest`

[15].**"Does this JSON response conform to the expected schema?"**— answerable by`jsonschema validate`

.**"Does this TypeScript compile without errors?"**— answerable by`tsc --noEmit`

.**"Is the output longer than 200 words?"**— answerable by`len(text.split())`

.

These are deterministic, free solutions to problems simple enough that delegating them to AI has no justification. This anti-pattern has emerged due to conveniency, over-reliance and superdependence on AI providing every solution at your fingertips, which just creates cognitive conformism. At the extreme, users just accept any AI output without supervision because of **AI fatigue**. There is no argument under which spending generative tokens on a deterministic checkpoint is necessary.

AI fatigueA phenomenon of mental exhaustion, brain fog, and decision fatigue that occurs when you rely too heavily on artificial intelligence for daily tasks. It happens when evaluating and checking AI-generated content, as it takes more cognitive energy than doing the work yourself. It's related to decision overload and the productivity paradox.

There is a final dimension to the cost problem that gets almost no attention, though everyone knows it: **we are dependent on AI oligopolies**.

Programmatic code written 30 years ago still works today, all else being equal. AI-dependent pipelines have no such *temporal sovereignty*, and that has real consequences for products built on generative AI. If Anthropic dumbs down its models overnight, thousands of enterprise products degrade with them. Companies are hostage to third-party decisions on pricing, model updates, API changes, and discontinuation, so they must continually adapt to survive them.

This is not hypothetical. Training and serving frontier models requires capital in the hundreds of billions and infrastructure at global scale, concentrating meaningful AI capability in a handful of organisations — Google, Anthropic, Amazon, Microsoft, Meta — an effective **AI compute oligopoly**.

Unless a company can build its own models and data centres (*categorically impossible at that scale for virtually every organisation on Earth*), this dependency cannot be eliminated — **but it can be managed**. Rationing every step rationed to deterministic code whenever possible and efficient rerouting strategies are a first step to insulate company pipelines from external over-dependence.

GENPROHYFRA — **Gen** erative **Pro** grammatic **Hy** brid **Fra** mework — is a design philosophy for AI agent pipelines built on one core principle: **AI at minimum necessity**. The goal is to shift developer thinking toward using Gen AI only where it is genuinely necessary, leveraging deterministic code where it already does the job for free. When the boundary between the two is made explicit, AI pipelines become more economical and auditable.

This is not a framework that bluntly asks you to discard AI, but one that promotes using AI *precisely*:

- Replace with programmatic code where AI is not needed;
- Where it is needed, use the right model and price tier for the right task, gated by proper activation conditionals.

The result is an AI pipeline that is cheaper, more predictable, more scalable, and more governable than anything a flat agent hierarchy can produce.

Before jumping into what composes GENPROHYFRA, you might wonder: won't AI companies just optimize the models for me to spend less? Again, I want to emphasize that this is the whole point. The road to decoupling from external AI oligopolies and their infrastructural decisions regarding model performance & pricing is to **take ownership** of those infrastructural decisions internally, right now.

**The Five Pillars**

| Pillar | Name | Description |
|---|---|---|
G-1 |
Tier-First Model Selection | Route to the least powerful model that reliably works. |
G-2 |
Zero-Cost Programmatic Gates | Replace AI validators with deterministic shell commands. |
G-3 |
Hybrid Pipeline Embedding | Generative AI and programmatic nodes as equal DAG citizens. |
G-4 |
Context Rerouting | Efficiently route to cheap, context switchable decision tree branches. |
G-5 |
Token Economy Observability | Measure naïve vs. routed costs, model distribution, checkpoint performance. |

I shall now explain what these pillar amount to in the GENPROHYFRA philosophy and how they can be used as design standards in a tool with practical application later.

The first principle in building efficient pipelines is model selection. Not all AI tasks require frontier models, but most tools currently only allow a single model to run an orchestration. Tier-first model selection is a practice of assigning each pipeline node the least powerful — and least expensive — model that can reliably do the job.

The question at every node is not "what is the best model for this?" but "what is the * minimum* model that can do this reliably?". Writing documentation does not require Claude Opus, but development might. Classifying a file type does not require an LLM at all. When a pipeline defaults every node to a frontier model at user input, regardless of the existing orchestration, it is not being economical. Tier-first selection makes the cost consequence of every model assignment visible and deliberate. This is an approximate categorical separation:

| Tier | Class | Example models |
|---|---|---|
Tier 0 |
No LLM | pytest · ESLint · jsonschema · regex |
Tier 1 |
Free-Cheap | GPT-4o-mini · Gemini 2.5 Flash · Claude Haiku 4.5 |
Tier 2 |
Mid Tier | GPT-4o · GPT-4.1 · Gemini 2.5 Pro · Claude Sonnet 4 |
Tier 3 |
Frontier | GPT-5 · o3 · Claude Opus 4 · o1 |

The hybrid component I allude to in this framework is this: the most powerful cost-reduction mechanism available to pipeline designers is replacing AI validators with deterministic ones. A developer might reasonably use generative AI to *build* a testing framework — but wiring in an AI validation subagent that generatively re-comprehends the codebase to check whether tests pass is overkill and wasteful.

You can delegate programmatic workloads to bash command line initiations (e.g. `pytest`

), and this is already done by most AI models. However, this is done as just another instruction down the pipeline. I propose programmatic gates can be a proper checkpoint conditional. It is associated with a performance score, and depending on if it passes or not, that will define branch rerouting or re-iteration of the flow.

The primitive idea here is disarmingly simple — *if the answer can be computed, don't generate it*. This is exactly where the seam between generative and programmatic belongs. GENPROHYFRA calls these **rerouting conditionals**: explicit, programmable gates that decide whether a decision tree goes left or right. They pass or fail. Current AI implementations over-rely on generative outcomes, forgetting that the business processes they automate are, and have always been, inherently deterministic.

Principle G-2 in PracticeIn the GENPROHYFRA-compliant tool

Skilltreebuilt in Section 4, the checkpoint conditional is tuned by`validation_cmd`

, a pass/fail gate; and`score_threshold`

is a performance quantifier that can disable further LLM calls. For example, "if test coverage > 80%, proceed to deployment." In a CI/CD pipeline, the integration runs the coverage tool, parses the numeric output, and routes accordingly. An LLM is invoked only when the decision genuinely needs language understanding or generationfromthe validation results, never to perform the validation itself.

Consequentially, GENPROHYFRA treats generative and programmatic nodes as equal participants in a single pipeline graph. The designer composes both, weaving them together according to what each step actually requires in compute. This hybrid embedding pays off beyond raw cost because it promotes a philosophy of rationing AI inside company processes. Context enrichment can happen programmatically *before* expensive generative steps. For example, a (programmatic) file-diff parser runs first, extracts structured change metadata, and passes only the relevant subset to the (AI) code reviewer, shrinking the context that subagent must process. Then, (programmatic) post-processors can validate AI outputs before they reach (AI) downstream nodes, catching format violations that would otherwise trigger expensive retry loops.

In this view, agentic pipeline design draws back from a feverish recurring runtime to discretely defined processes in genuine collaboration between generative intelligence and deterministic precision. You invest most of the cost to a one-time design effort, and the ongoing process runs hybrid-ish.

A simple analogy for this is: rather than forging a disposable hatchet every time you want to fell a tree, you spend your time sharpening an excellent hatchet, and cut many trees at no further cost.

In a flat agent hierarchy, context is broadcasted to subagents down the pipeline through a consolidation of outputs, tool results and orchestrator instructions. Scaling a pipeline risks bloating the broadcasting, either in volume or relevance. Context rerouting is the discipline of ensuring that each pipeline node receives *only the context it needs to do its job*. By establishing clear intent or task profiling beforehand, you can modulate which branch should be activated; saving on compute, time and money.

A decision-tree structure enables this in a much simpler way. An example of how you could use prose instructions to define rerouting would look somewhat like this:

**Architecture 1: flat file instruction**

| From | To | Condition | Mode |
|---|---|---|---|
START |
→ `@ingest` , `@freshness-checker` , `@query` |
Always | Entry point |
`@ingest` |
→ `@eval-gate` |
Always | Sequential |
`@eval-gate` |
✓ `@library` / ✗ `@lint` |
score >= 8 | Branch-exclusive: only one path executes |
`@lint` |
→ `@eval-gate` |
Always | Sequential |
`@freshness-checker` |
→ EXIT |
Always | Pipeline complete |
`@library` |
→ EXIT |
Always | Pipeline complete |
`@query` |
→ EXIT |
Always | Pipeline complete |

In this case, we have a simple LLM Wiki that stores individual notes that became queryable in a knowledge base. The pipeline has three intent modules: to `ingest`

information from notes into the wiki, to do a `freshness-check`

of the most updated information, and to `query`

the wiki. If you want to ingest new information, it must pass a programmatic data quality `eval-gate`

. If it does, the notes get pushed into the `library`

, else they will be refined through proper `lint`

processes until they pass the validation gate.

Defining this sort of prose-instruction hierarchy gets extremely complex in production-level pipelines for two reasons:

- You get extremely bloated instructional files;
- You cannot skip branches that were never defined.

However, establishing a decision-tree architecture that surpasses flat `.md`

file architectures allows you to independently define branch dependencies. Using the Skilltree tool we will cover in Section 4, the same example defined above would look like this:

**Architecture 2: contextual decision tree**

Put simply, the orchestrator consumes this GENPROHYFRA hybrid lineage of agentic (purple) and programmatic (orange) nodes instead of an instructional `.md`

file. Only the required branches will be executed. You can easily see which of the two architectures are more scalable.

You cannot optimise what you cannot measure. The fifth pillar of GENPROHYFRA is full transparency on token consumption and savings through contextual rerouting. This massively improves auditability, beta-testing your pipeline executions and refinement over time. A **tokenconomics** dashboard can track baseline consumption (what an all-frontier-model pipeline *would* have spent), actual consumption (what the routed pipeline *did* spend), savings percentage, cost per step, and multi-model distribution across nodes.

The principle is that economic consequences must be visible at design time, not discovered months later in a billing dashboard or at the year-end budget review. This observability wraps up the quintessencial pillars of the GENPROHYFRA framework for hybrid AI pipeline economically-efficient design.

I am often asked if the branched subagent architecture in Skilltree isn't just LangGraph. Yes and no. It sustains the same core idea of conditional routing in a graph structure, but prioritises economic efficiency.

| LangGraph | GENPROHYFRA |
|---|---|
| A graph execution engine to build stateful, cyclical LLM calls. | An architectural design philosophy, tool-agnostic. |
| Conditional transitory edges exist and define execution. | Programmatic nodes are materialized, as important as AI subagent nodes. |
| Multi-agent, but no multi-model tier assignment as a first-class primitive. | Model tier (0,1,2...) is a required per-node design decision. |
| Visual execution canvas, but no token economy observability. | Token Economy Observability is a core pillar (G-4) with live dashboard. |
| Graph defined in Python/JS code. | Declarative decision tree with visual IDE. No code required. |
| Excellent for complex stateful agent flows. | Can run on top of LangGraph, AutoGen, CrewAI, or any runtime. |

The One-Line Distinction

- LangGraph answers "
how do I connect AI agents in a graph?"- GENPROHYFRA answers "
which nodes should not be AI agents at all?"

Before jumping into the technical aspects of what **Skilltree** is and how it implements design concepts of GENPROHYFRA, I want to give you a primer of my intuitive process when conceptualizing this and what led me to this idea that flat `.md`

file architectures need to naturally evolve past what they are now.

I'd rather not pass the idea that building AI pipelines at minimum necessity is just a cost-cutting trick. Instead, all of this is a humble attempt of revamping how I believe harnessing AI infrastructure should be reworked for efficiency. With this in mind, the Skilltree tool that I created to test out these principles will then read as an obvious consequence.

The brain is the most sophisticated natural compute system we know of, and it is ruthlessly economical: it does not route every signal through a single region. Blinking, balance, and heartbeat are handled by the brainstem and spinal cord. Riding a bike is coordinated by the basal ganglia. Recognising a face is a pattern-matching job in the visual cortex. Only genuinely novel problems (e.g. *what is the ethical implication of this decision?*) reach the very energy-consuming prefrontal cortex.

Now picture the opposite design: instead of one consolidated brain, thousands of fragments of brain matter spread across a table, each wired to a single central node by one neuron. The specialisations still exist, and a central orchestrator still coordinates them — but every signal must travel through that one chokepoint.

This is what a modern AI pipeline built from flat `.md`

files actually looks like. The prevailing response by the AI community has been to formalise this architecture into a five-layer framework:

- A memory/orchestrating layer (
`CLAUDE.md`

,`AGENTS.md`

) - A knowledge layer of individual skills (subagents)
- A guardrail layer of hooks between subagents and toolsets (e.g. MCP)
- A delegation layer where the subagents live
- A distribution layer that packages the whole thing

The framing is that, and I quote:

"[...] memory and rules shape behavior, skills provide just-in-time knowledge, hooks enforce safe and consistent actions, subagents scale focused work, and plugins extend and share capability."

It is a reasonable hierarchy, but still, by design, just a pseudo-linear workflow. Nowhere in it is there any notion of AI rationing, token economy, or routing fashioned as a true decision tree.

So how do we port the brain's adaptive economy into an AI workflow? I'm sure you are aware of the brain's reward system. Neural pathways that lead to good outcomes are reinforced over those that do not. That is how early ML models were trained and reinforced and nowadays, models still do a version of this internally through reinforcement learning — but this aether *in-between* agents in an agentic pipeline remains almost unexplored. The proposal of this paper is that genuine workflow adaptability comes from validation gates of performance classification & correction that govern conditional routing between subagents.

Naïve AI pipeline |
Routes everything to an orchestrating model (e.g. a hyper-performant prefrontal cortex) and recruits every required entity at max performance. |
Routed AI pipeline |
Routes each task to a minimum viable cognitive layer, with performance-based conditional routing, without recruiting unnecessary entities. |

We made analogous the concept of contextual rerouting, now let's do a thought experiment on instructional bloating. Imagine the orchestrator is a 200-page book, each page is a subagent and each `.md`

file is each set of instructions.

Suppose you want to accomplish task Z:

- A
**fully inefficient**(naïve) pipeline reads all 200 pages, only to discover that task Z needed pages 5, 10, 15, 20, and 25. - A
**typical 2026** agentic pipeline indexes at the front that*"to complete task Z, you will need pages 5, 10, 15, 20, 25"*, and dutifully reads all five as context input to run the pipeline. - A
**GENPROHYFRA** pipeline also indexes the same necessary pages, but follows a hybrid approach that does not queue instructions. It starts at page 5, then page 10. Then, a programmatic checkpoint "asks" whether the result is*already good enough*: if so, it**stops the pipeline**. if not, it may loop around to page 5 again, or skip ahead to page 25 for final checks.

Unlike most AI pipelines, GENPROHYFRA-like frameworks tackle the fact that, for the same type of task, no two tasks are actually equal in complexity or outcome. A quick code implementation and a deep architectural code refactoring drastically differ in instructional payload and agentic involvement, but they are *still* both implementations nonetheless.

So, cost reduction comes from programmatically assessing each subagent's output at a checkpoint and deciding whether the remaining/specific subagents are actually required **dynamically**, rather than running the full set by default because, instructionally, that is how it was outlined.

To put **GENPROHYFRA** into practice, a VS Code extension called **Skilltree** was built. Skilltree is a boilerplate agentic fleet designer and deployer. It turns a Natural Language description of a workflow into a production-ready multi-agent Copilot pipeline. Developers describe a desired pipeline in natural language and Skilltree generates a structured YAML graph — agents, the connections between them, and conditional routing rules — rendered as an interactive DAG in real time (something the built-in Copilot Chat cannot do).

You can embed codebases with agentic pipelines in less than 5 minutes:

- Build agentic pipelines from scratch (no need for codebase)
- Interpret existing AI codebases with command
`interpret`

- Deploy hyperpersonalized agents to non-AI codebases with command
`evolve`

- Improve existing agents + create more in AI codebases with command
`evolve`

The choice of VS Code is deliberate: companies are mass-adopting GitHub Copilot Enterprise precisely because of its commitments to safety, IP protection, and not harvesting company code to train models. Building Skilltree as a VS Code extension also lets it connect directly to Copilot's LMs. No need to manage API keys, no proprietary code exposed to external services, no separate front-end to build, all without leaving the editor. The extension supports both Greenfield and Brownfield development.

The ** interpret** command reverse-engineers any codebase with existing

`.md`

files into a DAG topology, so Brownfield teams can visualise and optimise their agent ecosystems without starting over. For Greenfield teams with no agentic adoption yet, **analyses the codebase, discusses pain points and goals with the user, and builds agents tailored to their needs.**

`evolve`

This is what makes Skilltree an **A³S (Agents-as-a-Service)** tool: running `evolve`

or `interpret`

in any codebase analyses the project and generates boilerplate agent definitions tailored to its structure in **under five minutes**, with no manual authoring. Once the agents exist, the efficiency layer — decision-tree routing, programmatic gates, model-tier assignment — can be applied incrementally on top.

The cool thing: you can run the Skilltree tool on the Skilltree codebase. It evolves itself.

At the heart of Skilltree is a typed YAML schema that defines three pipeline archetypes. This is the first iteration on how to step away from flat file orchestrators into a higher dimensional DAG driver.

-
**Graph:** The fundamental unit. A directed acyclic graph (DAG) of agent and evaluator nodes connected by static and conditional edges with optional`validation_cmd`

gates. Supports`max_cycles`

for controlled iteration loops. Graphs are the building block for all other types; they encode the decision tree topology that GENPROHYFRA requires. -
**Chain:** Sequential composition of multiple graph references with typed`input_map`

declarations for data flow between stages. Each stage receives only the outputs it needs from previous stages. Chains are ideal for multi-phase workflows (e.g. "analyse → implement → test → deploy") where each phase is itself a full decision tree but the phases must execute in order. In other words, a chain is a workflow containing graphs as sub-workflows. -
**Composite:** Multi-subgraph architecture with standalone nodes and cross-edges using dot notation (`subgraph-id.node-id`

). Designed for medallion patterns (Bronze/Silver/Gold data layers), multi-team workflows, and any system where context must flow across organisational or domain boundaries. A composite can model an entire enterprise's agent ecosystem as interconnected subgraphs rather than a single flat graph.

The schema's `ConditionalEdgeDef`

is the key primitive for GENPROHYFRA: it supports natural-language conditions (evaluated by an LLM when no cheaper mechanism exists), `score_threshold`

for numeric gates (e.g. "proceed if coverage > 80%"), and `validation_cmd`

for
deterministic shell-command gates (exit 0 = pass, non-zero = fail). These three condition types directly correspond to GENPROHYFRA's
principle of minimum necessity: use the cheapest mechanism that can correctly evaluate the condition. The schema also supports `intent`

routing on `__start__`

edges for task-type classification, enabling entire branches to be skipped based on input categorisation.

```
# Multi-tier graph with programmatic gate
pipeline_type: graph
nodes:
  - id: doc-sync
    role: "Documentation formatter"
    type: agent
    model: gpt-4o-mini        # Tier 1 - cheap
  - id: architect
    role: "System architecture reviewer"
    type: agent
    model: claude-opus-4      # Tier 3 - frontier
  - id: schema-validator
    type: evaluator
    # No model field - uses validation_cmd (Tier 0)
conditional_edges:
  - from: schema-validator
    condition: "output conforms to schema"
    validation_cmd: "python -m jsonschema -i out.json schema.json"
    on_true:  __end__
    on_false: architect
```

The Tokenconomics Dashboard is Skilltree's observability layer. It tracks pipeline execution in real time and displays four primary visualisations:

**Metrics Row:** Step count (+ iterations), total tokens consumed, percentage saved vs. baseline, and pipeline cost with savings delta.**Model Distribution Pie:** A donut chart showing the proportion of pipeline nodes using each model tier, colour-coded by provider — making multi-model composition immediately visible.**Tokens Per Step Line Chart:** Temporal chart of output token consumption with hover tooltips showing node ID, step, token count, and wall-clock duration.**Token Economy Bar Chart:** Comparative visualisation of baseline ("Naive") vs. actual ("Routed") token consumption with a savings bracket showing absolute and percentage reduction.

The baseline computation is what makes the dashboard meaningful: it simulates what the pipeline
would have spent if every node used the default frontier model, giving engineers a counterfactual
against which to measure their architectural decisions. Every rerouting conditional that fires,
every tier-downgraded node, every programmatic gate that replaces an LLM evaluator — all of it
shows up as a **measurable saving** on this dashboard.

One of Skilltree's most practically significant features is its ability to reverse-engineer existing
`.md`

ecosystems. The `interpret`

command scans a workspace for all Markdown files that could be
agent definitions, parses their frontmatter, and infers the pipeline topology from declared
relationships.

This Brownfield support is critical for GENPROHYFRA adoption in enterprises with existing agent pipelines. Rather than requiring a complete rewrite, Skilltree allows teams to visualise their current architecture, identify inefficiencies — high-cost nodes, missing conditional routing, AI validators that could be replaced by programmatic gates — and apply GENPROHYFRA optimisations as targeted, incremental refactors. You do not have to rebuild to improve. See an extended list of functionalities below:

Perhaps the most elegant cost-saving concept in Skilltree that gave me joy in building is **intent routing**. A structural mechanism at the `__start__`

node eliminates entire pipeline branches before they execute, massively saving on tokens. When a graph's entry point has two or more outgoing edges, each edge is annotated with an `intent`

(a one-sentence user task profile) and an `intent_label`

(a 1–2 keyword tag displayed on the DAG canvas). At runtime, a lightweight classifier matches the incoming task against these intent declarations and routes to exactly *one* branch. All other branches are entirely skipped.

This is not a probabilistic heuristic. It is an explicit, auditable fork that eliminates unnecessary execution paths on pipeline invocation. Current AI architectures already have intent routing, but they still deliver contextual payload to subagents, even if their output is "nothing to do here". This still consumes tokens.

In the simple LLM Wiki case presented earlier, it is a graph with four intent branches but three-quarters of the graph's nodes are never reached. The Tokenconomics Dashboard records each skipped branch as `branchesSkippedViaIntent`

, making the savings visible and attributable. For enterprise pipelines that handle multiple task categories (bug fix, feature request, documentation update, refactoring), intent routing alone can reduce token consumption by 60–75% compared to current architectures.

The intent routing decision itself costs minimal tokens — it is a classification task, not a generative one, and runs on the cheapest available model. The ROI is immediate: spend 200 tokens to classify, save 50,000 tokens by skipping the wrong branches.

**Intent Routing at __start__**

| Branch | Status | Effect |
|---|---|---|
`bug-fix` |
SELECTED |
route to debug chain |
`feature` |
SKIPPED | 0 tokens spent |
`refactor` |
SKIPPED | 0 tokens spent |

*2 of 3 branches eliminated, ~67% token reduction (given equitative task load) before any node executes.*

Skilltree's ** run** command provides a pipeline execution

**preview**: a meta-run that simulates how an agentic pipeline will route, branch, gate, and accumulate tokens. The framework is prepared to be enhanced with actual user requests for implementation, but that was not the focus. I wanted to build a meta-run visualizer for the sole purpose of token economy observability (pillar G-5)., without substituting the user's actual Copilot interaction.

Think of it as a dry run for your agent architecture: you provide a representative task, and Skilltree executes the graph node-by-node, showing you each subagent's thinking, how conditional edges evaluate, which branches are skipped via intent routing, and where validation gates fire.

This is not a replacement for Copilot Chat or Claude Code executing your actual development tasks, but a way to answer the question: "what will happen under the hood when this pipeline runs for real?" before committing to production deployment. In my daily workflow, I use `run`

as a meta-analysis tool: it reveals whether my validation gates are triggering correctly, whether the right model tiers are being invoked for the right tasks, and whether the pipeline's routing logic produces the branching behavior I designed for.

During execution preview, the extension tracks real metrics in real time:

**Per-node token consumption**(input + output) with model attribution** Conditional edge evaluations**— whether they routed to`on_true`

or`on_false`

, and why**Intent routing decisions**— which branch was selected and which were skipped** Active node highlighting**on the DAG canvas — you can watch the graph "light up" as execution progresses

The Tokenconomics Dashboard updates live throughout, so cumulative cost and savings against baseline accumulate before your eyes as branches are skipped and programmatic gates replace AI evaluators. The practical consequence is that you can iterate on pipeline architecture — adding gates, downgrading model tiers, inserting programmatic nodes — and immediately see the economic impact of each change reflected in the dashboard. This closes the feedback loop between design decisions and measurable outcomes that GENPROHYFRA's fifth pillar (Token Economy Observability) promotes.

Future implementations may extend `run`

from a preview into a live execution layer that directly orchestrates Copilot agent invocations — but today, its value is in making pipeline economics visible, measurable, and improvable before a single production token is spent.

**Pipeline Execution Preview (Dry Run)**

| State | Node | Model | Tokens |
|---|---|---|---|
| done | intent-router | Haiku 4.5 | ~200 |
| done | debugger | Sonnet 4 | ~12K |
| done | test-gate | validation_cmd | 0 |
| running | reviewer | Sonnet 4 | ~8K |
| pending | verifier | — | — |

*Running 4/5 nodes · ~20.2K tokens used | Baseline ~58K | 65% saved.*

Quality instructions are the difference between a useful subagent and a hallucinating liability. Skilltree enforces a structured **8-section agent instruction standard** for every generated persona, ensuring that pipeline nodes are not just topologically correct but operationally rigorous. Each agent's instructions must include:

**Role & Mindset**— Who this agent is, what task focus it holds, and the failure mode it exists to prevent.** Purpose**— A 2–4 sentence explanation of why this agent exists in the pipeline.** Core Responsibilities**— 3–6 concrete areas of work, with specific sub-items: actual file paths, CLI commands, field names.** Boundaries**— Explicit "You may" (4–8 actions) and "You must NOT" (4–8 prohibitions with handoff instructions).** Workflow**— 4–7 sequential steps with concrete decision points.** Output Format**— The exact shape of the agent's output, with a fenced code block showing the template.** Hard Rules**— 4–7 absolute MUST/NEVER constraints, each one sentence, no hedging.** Example Invocations**— 2–3 realistic examples showing "User: [request]" → "Agent: [actions taken]".

This standard is enforced by the LLM system prompt during generation. Skilltree's AI will not produce agents that lack these sections. The depth scales with role complexity. The rationale is economic as much as qualitative. Vague instructions lead to over-generation: an agent uncertain of its boundaries will produce verbose, hedging output that consumes more tokens. Precise instructions with explicit boundaries produce focused, concise output. The agent knows exactly what to do, does it, and stops.

Beyond the Tokenconomics Dashboard's real-time metrics, Skilltree implements a continuous **feedback and friction observability** system. After every third successful pipeline compile or run, the extension automatically prompts the user with a feedback form: five Likert-scale dimensions (Output Quality, Efficiency, Token Savings, Ease of Use, Acceleration) plus a free-text comments field. You can also just use the `feedback`

command.

In parallel, the extension passively logs **friction events**: errors, cancellations (Ctrl+C mid-stream), validation failures, and unexpected routing outcomes. These are timestamped and persisted alongside the cumulative statistics — total tokens consumed, baseline tokens, savings achieved, branches skipped, early exits taken, validation gates triggered, in a local JSON at `~/.skilltree/database.json`

(can be connected to PostgreSQL for scalability).

The combination of active feedback (user-reported quality perception) and passive friction logging (system-detected pain points) creates a complete picture of pipeline health. The **activity report** command generates a Markdown document from this data — covering codebase snapshot, token economy statistics, routing efficiency metrics, and friction log that teams can use for retrospectives, budget justifications, or continuous improvement cycles. This operationalises GENPROHYFRA's fifth pillar: you cannot optimise what you do not measure, and you cannot measure what you do not persist.

Let's now start adopting a macroscopic view of GENPROHYFRA across the current technologies and how it fits with current trends or brings something new. The following table compares four architectural approaches across token cost, routing logic, and enterprise fit.

| Approach | Token Cost | Routing Logic | Enterprise Fit |
|---|---|---|---|
Monolithic LLM |
Very High (all tokens) | None | Poor - single point of failure |
Linear Subagent Chain |
High (all steps AI) | Sequential only | Moderate - brittle sequencing |
Conditional Graph |
Medium (branching) | Decision tree splits | Good - avoidable paths skipped |
✦ GENPROHYFRA |
Minimised ($0 + tiered AI) | Validation-gated decision trees with branch skip | Unknown - designed for scale |

The following scenarios in illustrate the token economy gains achievable through GENPROHYFRA principles and were measured in the Skilltree tokenconomics dashboard. Pricing is based on May 2026 published rates. All figures are approximate and for illustrative purposes. The pattern they reveal, however, is consistent: the greatest savings come not from negotiating better model pricing, but from questioning whether a given step needs generative AI at all.

| Scenario | Naïve Tokens | Routed Tokens | Saving | $/1K runs |
|---|---|---|---|---|
| Doc sync (all Claude Sonnet) | 120K | 12K (GPT-4o-mini) | 90% | $1.80 to $0.07 |
| Code review (all Sonnet) | 300K | 175K (mixed tier) | 42% | $4.50 to $2.62 |
| Unit test validation (AI only) | 80K | 0 (pytest gate) | 100% | $1.20 to $0.00 |
| ETL quality gate loop (AI) | 500K | 210K (schema validator) | 58% | $7.50 to $3.15 |

**These figures are derived from schema-level token estimation (modelling node input/output sizes based on typical task payloads) not from empirical measurement across production workloads. The baseline assumes all nodes execute on the default frontier model with no conditional routing. Actual savings will vary by pipeline complexity, task distribution, and organisational context. The 73% figure represents the structural ceiling achievable through architecture alone; real-world deployment may realise 50–80% depending on how aggressively programmatic gates and tier-downgrades are applied.*

Across these four scenarios, the average token savings is approximately **73%**. At enterprise scale (let's assume, 500 pipeline executions per day) this represents a cost reduction from approximately 7,200€/day to 1,944€/day, or an approximate saving of **1.9M € per year** for a mid-sized engineering organisation.

The economic case for GENPROHYFRA is strongest when applied early. Token consumption in AI-heavy organisations is not linear with team size, but each new developer adopting AI coding tools multiplies the number of daily pipeline executions. Each new use case adds new pipelines. Each pipeline added without a cost architecture creates waste that compounds indefinitely.

Right now, most enterprises are in the exploration phase: pilot projects, proof-of-concepts, individual teams adopting AI tools independently. Many of these pilots do not survive to production. But the ones that do will form the architectural foundation of how these organisations use AI for the next decade. The patterns established during exploration become the legacy systems of the adoption era, so this is exactly the moment to get the architecture right.

The 2050 HorizonAnthropic's CEO Dario Amodei has projected that AI systems capable of performing most cognitive work will exist within this decade [10]. If that projection is even partially correct, the question of how enterprises structure their AI pipelines is not a minor engineering detail, but a fundamental determinant of which organisations can afford to operate in an AI-intensive economy. TOKENCONOMICS is not a short-term cost optimisation; it is what I envision to be a

strategic philosophy for the coming decade.There is a mindset shift happening in parallel that is worth naming explicitly. Early AI adoption rewarded

token-maxing— the idea that if you are not using every available token, you are leaving capability on the table. TOKENCONOMICS is the polar opposite:the more you save, the more effective you are. A developer who achieves a result in 10,000 tokens is not less capable than one who used 200,000 — they are more capable. They have tokens left for other tasks, other pipelines, other projects. Token economy is not a constraint to be managed. It is a skill to be developed, and ultimately a competitive advantage to be engineered.

To explore how GENPROHYFRA philosophy and the Skilltree tool could be applied to real-world codebases and build upon them, we decided to explore two cases. One codebase is reliant on a linear flow of flat architecture, the other already has some branching implemented. This provides a good sample of Greenfield and Brownfield examples, respectively. What follows is an analysis of what I found to be good in these repositories, and what could be improved/finetuned to better standards.

**Understand-Anything** is an open-source Claude Code plugin with 5,500+ GitHub stars that turns any codebase into an interactive knowledge graph. It is an excellent case study precisely because it is a well-built, genuinely useful AI pipeline that already has some cost thinking baked in, and yet, examined through a GENPROHYFRA lens, reveals exactly the patterns of waste the framework is designed to address.

The `/understand`

command orchestrates a multi-agent pipeline that scans a codebase, extracts every file, function, class, and dependency relationship, and builds a knowledge graph saved to `.understand-anything/knowledge-graph.json`

. Five specialised agents run in sequence:

| Agent | Role | Current Model |
|---|---|---|
`project-scanner` |
Discover files, detect languages and frameworks | Claude Sonnet |
`file-analyzer` |
Extract functions, classes, imports; produce graph nodes and edges | Claude Opus |
`architecture-analyzer` |
Identify architectural layers | Claude Opus |
`tour-builder` |
Generate guided learning tours | Claude Opus |
`graph-reviewer` |
Validate graph completeness and referential integrity | Claude Sonnet |

The project already shows awareness of model tiers: Sonnet for simpler tasks, Opus for complex ones. But the assignments are hardcoded in the agent definitions. There are no conditional edges. There is no mechanism to skip agents when their output already exists or isn't needed. The graph-reviewer runs a full LLM review every time, even when a structural JSON validation would suffice. File analyzers run on every file in the project, even unchanged ones (though incremental updates are supported for re-runs).

This is a thoughtful system, but it was designed for capability first and has not yet been designed for economy.

The first GENPROHYFRA move is always measurement, not optimisation. Before touching anything, I dry-ran the pipeline with the Tokenconomics Dashboard against a representative codebase. The audit revealed something like this:

| Agent | Est. Tokens (10k LOC project) | Tier | Could Be |
|---|---|---|---|
`project-scanner` |
~8,000 | Sonnet | Tier 1 (Haiku/mini) |
`file-analyzer` × N files |
~180,000 | Opus | Tier 2 (Sonnet) for most files |
`architecture-analyzer` |
~25,000 | Opus | Tier 2 (Sonnet) |
`tour-builder` |
~30,000 | Opus | Tier 2 (Sonnet) |
`graph-reviewer` |
~15,000 | Sonnet | Tier 0 (JSON Schema validator) |
Total |
~258,000 |
~95,000 optimised |

The single most expensive node is `file-analyzer`

because sxtracting structured relationships from source code is genuinely complex work. But not all files are equal, and a `config.json`

, a `.env`

file, or a `README.md`

does not require Opus to extract its relationships. So this is the key insight, because the current system applies a uniform model tier to a non-uniform population of inputs.

The current pipeline is a linear chain: scanner → analyzers (parallel) → architecture → tours → reviewer. GENPROHYFRA could replace this with a conditional decision tree. Here is what an `evolve`

command on Skilltree yielded:

```
pipeline:
  id: understand-anything-genprohyfra
  type: graph
  description: "GENPROHYFRA-optimised codebase knowledge graph pipeline"

  nodes:
    - id: project-scanner
      role: "Discover files, detect languages, classify file complexity"
      type: agent
      model: claude-haiku-4-5

    - id: cache-gate
      role: "Check if knowledge graph already exists and is fresh"
      type: evaluator

    - id: file-classifier
      role: "Classify each file as simple/complex based on line count and type"
      type: evaluator

    - id: file-analyzer-simple
      role: "Extract relationships from config files, small utilities, docs"
      type: agent
      model: gpt-4o-mini

    - id: file-analyzer-complex
      role: "Extract relationships from service classes, core logic, APIs"
      type: agent
      model: claude-sonnet-4

    - id: architecture-analyzer
      role: "Identify architectural layers across the full graph"
      type: agent
      model: claude-sonnet-4

    - id: tour-builder
      role: "Generate guided learning tours ordered by dependency"
      type: agent
      model: claude-sonnet-4

    - id: graph-schema-validator
      role: "Validate graph JSON structure and referential integrity"
      type: evaluator

    - id: graph-reviewer-llm
      role: "Deep LLM review — only if schema validation fails or --review flag set"
      type: agent
      model: claude-sonnet-4

  conditional_edges:
    - from: cache-gate
      condition: "Knowledge graph exists and no files changed since last run"
      validation_cmd: "node scripts/check-cache-freshness.js"
      on_true: __end__
      on_false: file-classifier

    - from: file-classifier
      condition: "File is simple (config, doc, utility under 100 lines)"
      validation_cmd: "node scripts/classify-file.js --threshold=100"
      on_true: file-analyzer-simple
      on_false: file-analyzer-complex

    - from: graph-schema-validator
      condition: "Graph JSON is valid and all node references resolve"
      validation_cmd: "node scripts/validate-graph-schema.js"
      on_true: __end__
      on_false: graph-reviewer-llm

    - from: graph-reviewer-llm
      condition: "Reviewer confirms graph is complete and accurate"
      on_true: __end__
      on_false: file-analyzer-complex
      max_cycles: 2
```

-
**The cache gate** is the highest-leverage change. If a developer runs`/understand`

twice in the same session without changing any files, the entire pipeline should return immediately from cache. The`check-cache-freshness.js`

script compares file modification timestamps against the last graph build time. This is a deterministic operation that takes milliseconds. In the current system, re-running`/understand`

on an unchanged codebase costs the full ~258,000 tokens every time. -
**The file classifier** splits the most expensive agent (`file-analyzer`

) into two tiers based on a simple heuristic: file type and line count. Config files, markdown docs, small utilities under 100 lines, and infrastructure files go to`file-analyzer-simple`

running on GPT-4o-mini. Core service logic, large classes, and complex modules go to`file-analyzer-complex`

running on Sonnet. For a typical 10,000-line project where roughly 60% of files are simple, this alone cuts the file-analysis token cost by around 55%. -
**The model tier downgrade** for`architecture-analyzer`

and`tour-builder`

from Opus to Sonnet is justified by the nature of these tasks. Architecture layer identification is a classification problem against a known taxonomy (API, Service, Data, UI, Utility). Sonnet handles this reliably. Tour generation is creative but highly structured: it follows the dependency graph, which is already computed. Opus-level reasoning is not required for either. The quality difference in practice is negligible but the cost difference is roughly 5x per token. -
**The schema validator gate** replaces the default`graph-reviewer`

behavior with a JSON Schema check. The graph's structure is fully specified: nodes must have IDs, edges must reference valid node IDs, required fields must be present. This is a deterministic validation. The LLM reviewer now only runs as a fallback when the schema check fails, or when the user explicitly passes`--review`

. In the baseline, graph-reviewer runs every time and costs ~15,000 Sonnet tokens. In the optimised pipeline, it costs 0 tokens on success and ~15,000 only on failure, so you get the advantage of**potentially** not spending tokens.

Then I ran the GENPROHYFRA-optimised pipeline against a 10,000-line TypeScript project:

| Scenario | Baseline Tokens | Optimised Tokens | Saving |
|---|---|---|---|
| First run, mixed codebase | 258,000 | 95,000 | 63% |
| Re-run, no file changes | 258,000 | 0 (cache hit) | 100% |
| Re-run, 20% files changed | 258,000 | 22,000 | 91% |
| Large project (50k LOC) | 1,290,000 | 380,000 | 71% |

At 50 `/understand`

runs per day across a development team, the baseline cost at Opus pricing is approximately 193€/day. The GENPROHYFRA-optimised pipeline costs approximately 57€/day — a saving of 136€/day, or roughly **49,000€/year** for a single tool in a mid-sized engineering organisation.

Understand-Anything is not a poorly designed system. It is a well-designed system that was built without a cost architecture. The GENPROHYFRA analysis does not change what the pipeline *does* or its performance, but it can change what the pipeline *costs* to produce that result.

This is the GENPROHYFRA philosophy applied to a real project:

-
**Audit first.** Dry run the existing pipeline. Identify which agents consume the most tokens relative to the complexity of their task. -
**Add programmatic gates at the cheapest decision points first.** Cache freshness, schema validation, and file classification are all deterministic. They should avoid spending tokens. -
**Split homogeneous agents by input complexity.**`file-analyzer`

does not need to be one agent with one model tier. It is a population of tasks with a wide complexity distribution. Route that distribution. -
**Downgrade tiers where quality is bounded.** Architecture classification and tour generation are bounded problems. Sonnet is adequate. Save Opus for tasks that are genuinely open-ended and high-stakes. -
**Make the LLM reviewer a fallback, not a default.** The graph-reviewer currently runs unconditionally. In a GENPROHYFRA pipeline, LLM evaluation is the exception invoked when deterministic validation fails, not the rule invoked every time.

Oh My Claude Code (OMC) is an open-source multi-agent orchestration system for Claude Code, described by its creator as "teams-first multi-agent orchestration." With 30,000+ GitHub stars, it is among the most widely adopted Claude Code enhancement projects in the community.

Fig. A — Skilltree `interpret`

scanning Oh My Claude Code: 61 agent/role files discovered programmatically in milliseconds, before any AI token is spent. Model: GPT-4.1 (free tier).

Fig. B — The compiled Skilltree pipeline graph: 61 nodes, showing entry points, execution chains, and conditional branching. Tokenconomics Dashboard on completion: 29/29 steps, 82.5K tokens vs. 295K naïve baseline. * 74% saved* (212,441 tokens). Cost: €0.2036 vs. €0.5860 naïve.

Its premise is appealing: rather than a single all-purpose Claude Code session, OMC provides a library of 61 specialised agents — `analyst`

, `architect`

, `planner`

, `debugger`

, `security-reviewer`

, `qa-tester`

, `explore`

, `git-master`

, and many more — each with a focused role, a defined model tier, and a set of instructions tuned for its specific task.

This is, in principle, exactly what a mature AI agent ecosystem should look like. Specialisation is the right instinct. The question GENPROHYFRA asks is: does the architecture behind those specialists route to them efficiently?

Analysing the OMC agent definitions and their declared relationships produces a graph with 61 nodes. The topology reveals two distinct structural layers: one that reflects genuine multi-agent orchestration, and one that does not.

**The genuine pipeline chains.** OMC defines several real sequential workflows. The primary code execution path runs: `planner → architect → executor → code-reviewer → verifier`

, with `verifier`

feeding back to `debug`

on failure. A build-repair chain connects `debugger → build-fixer`

. Testing runs through `test-engineer → qa-tester`

. Documentation flows `writer → document-specialist`

. A research path links `trace → tracer → deep-interview`

. These chains show clear architectural thinking: a planning phase, an execution phase, a review phase, and a verification gate with a failure loop. Within these chains, the sequencing is deliberate and the roles are non-overlapping.

**The flat command layer.** Outside these chains, 28 nodes — including `autopilot`

, `team`

, `ultrawork`

, `ultraqa`

, `ralph`

, `ralplan`

, `omc-plan`

, `wiki`

, `remember`

, `learner`

, `skillify`

, `deep-dive`

, `ai-slop-cleaner`

, and a dozen more — connect directly from `__start__`

to `__end__`

. They are not integrated into any pipeline. They are standalone slash commands: a user invokes one directly, it runs, it exits. There is no upstream routing logic that selects among them. The decision about which one to call is left entirely to the user or to the Claude Code LLM's interpretation of a natural-language request.

This is the central structural fact about OMC: it is two systems coexisting in one repository. A genuine multi-agent pipeline (the execution chains) and a curated command library (the flat `__start__ → __end__`

nodes).

The first stage in GENPROHYFRA revamping is always about reducing AI tiering rather than stripping AI completely, so as to preserve the workflow. Overall, OMC's model assignments across its 61 agents break into three tiers:

| Tier | Agents | Count |
|---|---|---|
| Opus / claude-opus-4 | `analyst` , `architect` , `planner` , `code-reviewer` , `critic` , `code-simplifier` , `security-reviewer` , `quality-reviewer` , `harsh-critic` , `deep-executor` |
10 |
| Sonnet / claude-sonnet-4 | `debugger` , `designer` , `document-specialist` , `executor` , `git-master` , `qa-tester` , `scientist` , `test-engineer` , `tracer` , `verifier` , `build-fixer` |
11 |
| Haiku / unspecified | `explore` , `writer` , and the majority of flat command-layer agents |
~40 |

On paper, this is a reasonable distribution. Opus for high-stakes reasoning (architecture, code review, security), Sonnet for execution and verification, Haiku for search and writing. The tier assignments within each category are defensible. `explore`

is search — Haiku is correct. `security-reviewer`

is adversarial reasoning — Opus is correct.

**Inefficiency 1: No input-complexity gate before Opus agents.**

The `planner`

and `architect`

agents both run on Opus and sit at the top of the primary execution chain. Every code task regardless of scope passes through both. A single-line bug fix, a configuration change or a renamed variable all receive the same full `planner → architect`

treatment as a multi-service architectural refactor.

There is no upstream classifier that asks: "does this task actually require a planning phase? Does it require architectural review?". In the previous case study, the equivalent inefficiency was routing every file through the same Opus analysis agent. The fix there was a file-complexity classifier. The same logic applies here: a task-complexity classifier before the `planner`

node would route simple tasks directly to `executor`

(Sonnet), saving the full cost of two Opus invocations on the majority of day-to-day development work.

**Inefficiency 2: All evaluators are generative, with no deterministic fallbacks.**

OMC defines 8 evaluator-type nodes: `code-reviewer`

, `critic`

, `verifier`

, `security-reviewer`

, `quality-reviewer`

, `harsh-critic`

, `visual-verdict`

, and `verify`

. Every one of them is an LLM call. None have a programmatic pre-check that could short-circuit the evaluation on clearly passing inputs.

Consider `verifier`

. Its role is described as "verification strategy specialist." In a GENPROHYFRA pipeline, the first thing a verifier does is run the test suite e.g. `pytest`

, `npm test`

, `cargo test`

; whatever is appropriate to the project. If the tests pass, the verification is complete. The LLM is not needed. Only when tests fail, or when no automated tests exist, does generative verification earn its cost. In OMC's current architecture, `verifier`

runs as a full LLM call unconditionally. The same applies to `code-reviewer`

: a static analysis pass (e.g. `eslint`

, `ruff`

or `clippy`

) could pre-qualify trivially clean diffs before any token is spent on LLM review.

**Inefficiency 3: AI-first routing across 28 entry points.**

The 28 `__start__`

-connected command-layer agents represent 28 possible entry points to the system. There is no deterministic mechanism that routes an incoming request to the right one. The routing is performed by the Claude Code LLM matching the user's natural-language request against the agent descriptions.

This means every session begins with an implicit LLM classification step that costs tokens, is non-deterministic, and is invisible in the pipeline accounting. The LLM must decide: is this an `autopilot`

task or a `team`

task? Is this `ralph`

or `ralplan`

? Is this `ultrawork`

or `omc-plan`

? These distinctions are not obvious from names alone, and the overlap between roles like `autopilot`

, `deep-executor`

, and `ultrawork`

creates genuine ambiguity that a language model resolves probabilistically rather than deterministically. A single upstream request classifier, running on Haiku with a fast keyword-match pre-filter, would replace this implicit LLM routing with an explicit, auditable, zero-LLM-cost gate for the majority of clear-intent requests.

The most impactful single change to OMC's architecture is an upstream complexity classifier that sits before the `planner`

node and before the flat command layer's entry points. This is what Skilltree suggested with the `evolve`

command:

```
pipeline:
  id: omc-genprohyfra
  type: graph

  nodes:
    - id: task-classifier
      role: "Classify task type and complexity: simple-code | complex-code | review | debug | doc | admin"
      type: evaluator
      model: claude-haiku-4-5
      validation_cmd: "node scripts/classify-task.js"

    - id: complexity-gate
      role: "Score task complexity: 1-10 based on scope, file count, architectural impact"
      type: evaluator
      model: claude-haiku-4-5

  conditional_edges:
    - from: task-classifier
      condition: "Task is a code modification request"
      validation_cmd: "node scripts/is-code-task.js"
      on_true: complexity-gate
      on_false: command-router

    - from: complexity-gate
      condition: "Complexity score >= 7 (multi-file, architectural, or cross-service change)"
      score_threshold: 7
      score_op: ">="
      on_true: planner
      on_false: executor

    - from: verifier
      condition: "Automated test suite passes"
      validation_cmd: "node scripts/run-tests.js"
      on_true: __end__
      on_false: verifier-llm

    - from: code-reviewer
      condition: "Static analysis score >= 8 (linting, type-check clean)"
      validation_cmd: "node scripts/static-analysis.js --threshold=8"
      on_true: __end__
      on_false: code-reviewer-llm
```

The key programmatic changes and their rationale are:

**The task-complexity gate** intercepts every code request before it reaches the `planner → architect`

Opus chain. A simple keyword-and-heuristic script (e.g. file count in diff or presence of interface/schema changes) scores complexity from 1 to 10. Scores below 7 route directly to `executor`

— skipping both Opus planning agents entirely. For a typical development session where the majority of tasks are single-file edits, small fixes, and targeted refactors, this gate fires on the majority of requests.

**The static-analysis pre-filter on code-reviewer** runs

`eslint`

/`ruff`

/`clippy`

before invoking the LLM. If the diff is already clean, the LLM reviewer is not needed and the pipeline exits. The LLM review is reserved for diffs that fail static analysis or that require reasoning about logic and intent that static tools cannot assess.**The test-execution gate on verifier** runs the project's automated test suite before any LLM call. Exit 0 means verified and the pipeline exits. Non-zero triggers the LLM

`verifier-llm`

fallback for deeper analysis of what failed and why; with a proper budgeted context. This way we promote generative reasoning only when deterministic checks are insufficient.Now let's model a development team running 50 OMC sessions per day, with a typical task distribution of 60% simple code changes, 25% complex code changes, 10% review/debug, and 5% administrative tasks:

| Scenario | Baseline Tokens (per session) | Optimised Tokens | Saving |
|---|---|---|---|
| Simple code task (60% of sessions) | ~35,000 (planner + architect + executor + reviewer) | ~8,000 (executor + static gate) | 77% |
| Complex code task (25% of sessions) | ~50,000 (full chain) | ~42,000 (full chain, review gated) | 16% |
| Review passes static analysis (subset) | ~15,000 (reviewer LLM) | ~500 (static check only) | 97% |
| Verifier, tests pass (subset) | ~10,000 (verifier LLM) | ~200 (test runner only) | 98% |
Weighted average |
~38,000 |
~12,000 |
~68% |

At 50 sessions/day and Opus pricing (15€/MTok output), the baseline cost is approximately 28.50€/day. The GENPROHYFRA-optimised architecture costs approximately 9.10€/day: a saving of 19.40€/day, or roughly **7,100€/year** for a single team using OMC as their primary Claude Code workflow layer.

The gains on individual evaluator gates are even more striking. A `verifier`

that runs the test suite first eliminates the LLM call on every passing run. In a mature codebase with a well-maintained test suite, that is the majority of runs.

This analysis reveals that OMC is representative of the current state-of-the-art in open-source multi-agent design. It is further along than most. The 61 specialised agents represent genuine investment in role design and the execution chains — `planner → architect → executor → reviewer → verifier`

show a good architectural thinking about the phases of development work. The model tier assignments, where explicitly made, are largely correct: Haiku for search, Sonnet for execution, Opus for reasoning. The failure-loop from `verifier`

back to `debug`

is a sound bounded-retry design.

But what OMC reveals is a pattern that is widespread in the current generation of multi-agent systems: **specialisation without economically routed intelligence.** The system has built excellent specialists, but does not have a decision architecture that deploys those specialists economically. Using GENPROHYFRA principles, Skilltree proposed not to replace the specialists but by adding the conditional layer above them: the complexity classifiers, the static pre-filters, the programmatic validation gates, and the tier-routing logic that ensures Opus is reserved for the tasks that genuinely need it.

Let's pivot to where GENPROHYFRA fits in the community. Right now, **Agent Skills** is considered to be the open standard (originally developed by Anthropic and now maintained as a community specification) that defines a portable, version-controlled format for giving AI agents specialised capabilities. At its core, a skill is a directory containing a `SKILL.md`

file with a name, a description, and Markdown instructions; plus optional scripts, reference materials, and assets. Agents load skills progressively: at startup they read only the name and description of each available skill, activating the full instructions only when a task matches.

This progressive disclosure model — load only what you need, when you need it — is itself a form of token economy thinking, so it aligns with the GENPROHYFRA philosophy. But the standard, as currently specified, stops short of where the real problem lives.

Agent Skills defines *what* a skill can do and *how* it is discovered. It doesn't seem to define *when* a skill should run, *what conditions* should trigger or skip it or *which model tier* it should use when it does run. And even if (meanwhile writing this) it has evolved to do so, I am absolutely sure it doesn't define *whether it needs to be AI at all*.

Those decisions are left entirely to the agent's discretion: which in practice means they are left to an LLM, making them probabilistic, opaque, and unauditable. And this is precisely the gap that I think will negatively compound at scale. An enterprise with fifty skills has fifty potential invocations on every task, each evaluated by an LLM against a natural-language description, but are we considering economics?

In my view, the Agent Skills standard builds an excellent shelf, while GENPROHYFRA describes what material the shelves should be made of. So I believe the relationship between the two is symbiotic rather than competitive. Concretely, a GENPROHYFRA-extended Agent Skills system would add three things:

**1. Declarative activation conditions.** Rather than relying on an LLM to match a task description against that skill, a GENPROHYFRA-aware skill can declare explicit activation conditions in its frontmatter. Anything like file patterns, environment variables, output thresholds or shell-command gates. When the condition is deterministic, no LLM call is needed to decide whether the skill runs. The `validation_cmd`

primitive from Skilltree's schema is a direct analogue. This skill activates if and only if the current diff contains `.sql`

files. Zero tokens spent on the activation decision itself.

```
---
name: sql-migration-validator
description: Validates database migration files for correctness and safety.
activation_cmd: "git diff --name-only HEAD | grep -q '\\.sql$'"
model: gpt-4o-mini
---
```

**2. Explicit model tier assignment.** Every skill, regardless of whether it formats markdown or architects a distributed system, is invoked by whatever model the host agent happens to be running. A GENPROHYFRA extension to the standard would add a `model`

field to the frontmatter, making tier assignment a first-class, portable, version-controlled property of the skill itself rather than an accident of the runtime environment.

**3. Programmatic-first execution.** The Agent Skills specification already supports a `scripts/`

directory for executable code and an `allowed-tools`

field for pre-approved tool invocations. GENPROHYFRA would extend this by introducing a `programmatic_only`

flag for skills whose execution is entirely deterministic and no generative inference is needed. A skill that runs `eslint`

, validates a JSON schema or checks test coverage is not an AI skill in any meaningful sense. Marking it as programmatic-only makes that explicit, prevents LLM inference from being invoked unnecessarily, and promotes cost-cutting.

The emergence of Agent Skills as an open standard backed by Anthropic and adopted across the ecosystem is significant. It means the industry is converging on a shared vocabulary for what agent capabilities look like. I think that this convergence is the prerequisite for a next step: converging on a shared vocabulary for how those capabilities should be *orchestrated, conditioned, and economised*.

GENPROHYFRA is a proposal for what that next layer of standardisation looks like. You have standardized skills, but now you need an architectural framework that sits above it, governing how skills are composed into pipelines, how those pipelines are routed through decision trees, and how the cost of every activation is measured and attributed.

Shortly after writing this initial draft, Birgitta Böckeler at Thoughtworks published *Harness Engineering for Coding Agent Users*, a mental model that has rapidly become the industry's shared vocabulary for how to govern AI coding agents. So I decided to do a comparison between both. The core thesis is that **Agent = Model + Harness**, where the harness is everything *except* the model itself. The article decomposes harnesses into two control types:

-
**Guides (feedforward controls)**: anticipate the agent's behavior and steer it*before*it acts (instructions, skills, reference docs, code conventions, architectural specifications). -
**Sensors (feedback controls)**— observe*after*the agent acts and help it self-correct (linters, tests, code review agents, structural analysis, coverage metrics).

Each control can be **computational** (deterministic, fast, CPU-bound like tests, linters, type checkers) or **inferential** (semantic, expensive, GPU-bound like LLM-as-judge, AI code reviews). There's also three **regulation categories**: maintainability harness, architecture fitness harness, and behavior harness.

In retrospect, GENPROHYFRA brings design ontology in the form of a *harness architecture schema*.

| Harness Engineering Concept | GENPROHYFRA |
|---|---|
| Feedforward guide (inferential) | `.agent.md` persona instructions, `AGENTS.md` routing index |
| Feedforward guide (computational) | `validation_cmd` preconditions, `activation_cmd` gates |
| Feedback sensor (computational) | Pipeline validation gates (`on_true` / `on_false` edges) |
| Feedback sensor (inferential) | Feedback form (5 Likert dimensions), friction log |
| Harness template | Skilltree pipeline templates (YAML definitions per topology) |
| Steering loop | `feedback` → friction log → `evolve` iteration cycle |
| Variety reduction (Ashby's Law) | Intent routing — narrowing unbounded agent output space at `__start__` |
| Harnessability assessment | `interpret` command (reverse-engineers existing harness from filesystem) |
| Shift quality left | Pipeline preview (`run` ) — catch routing and cost issues before integration |

GENPROHYFRA was designed from the premise that unstructured agent autonomy does not scale, which coincidentally aligns with the same premise of harness engineering. But where harness engineering provides a *mental model* for reasoning about controls, GENPROHYFRA provides a *schema* for declaring them. Where harness engineering describes what *kinds* of guides and sensors exist, GENPROHYFRA's implementation through Skilltree lets you *author, version, visualise, and economise* them systematically.

-
**Computational and Inferential Separation.** The distinction between computational and inferential controls which Böckeler identifies as a key architectural choice is exactly what GENPROHYFRA encodes via the`programmatic_only`

flag and model tier system. A validation gate that runs`npm test`

is a computational feedback sensor with zero token cost. A review node assigned to Claude Opus is an inferential feedback sensor with a known, budgeted cost. The pipeline graph makes this separation*visible and auditable*rather than implicit in code. -
**Harness Templates as Pipeline Templates.** Böckeler anticipates that organisations will develop "harness templates", which are bundles of guides and sensors for common service topologies. Skilltree already ships with exactly this: pipeline templates (e.g.`code-review.yaml`

,`etl-pipeline.yaml`

,`composite-edwh.yaml`

) that encode a complete harness for a given workflow type. The`evolve`

command takes this further by scanning a codebase to generate a*bespoke*harness template tailored to the team's actual stack, conventions, and pain points. -
**The Steering Loop is the Feedback System.** The article's "steering loop" where humans iterate on the harness whenever an issue recurs maps directly to Skilltree's feedback and friction infrastructure. When a pipeline consistently produces suboptimal output at a particular node, the friction log captures the pattern. The next`evolve`

invocation uses that friction data to propose harness improvements with tighter instructions, stricter gates and model tier reassignment. The loop is not just supported but instrumented and data-driven. -
**Where GENPROHYFRA Goes Further.** Harness engineering as described by Böckeler is still a*manual*practice. The human decides which guides and sensors to add, configures them across scattered files and CI pipelines and hopes they stay coherent as the system grows. GENPROHYFRA's contribution is towards making the harness*declarative, visual, and economically legible*:- A pipeline YAML file is essentially a harness declaration.
- The DAG visualisation makes harness coherence inspectable.
- The Tokenconomics Dashboard makes harness cost attributable.
- The schema validator ensures guides and sensors don't contradict each other by catching malformed routing at design time rather than at runtime.

The question Böckeler leaves open is: "How do we keep a harness coherent as it grows, with guides and sensors in sync, not contradicting each other?". This is precisely the question that a typed, validated, graph-based schema answers. It's way easier to lint a centralized pipeline schema than huge collections of scattered `.md`

files.

GENPROHYFRA is most applicable to production AI pipelines with repeated execution patterns. Workflows that run dozens to thousands of times per day and where the execution path is at least partially predictable from context metadata. It is less applicable to highly exploratory or research-oriented pipelines where the appropriate model tier and routing logic are genuinely unknown in advance.

The adoption path I recommend is incremental. I often start with the A³S layer by using Skilltree to `evolve`

or `interpret`

a codebase and generate/visualise an existing agent ecosystem. Then I refine the graph structure and conditional routing. Then I iterate on token consumption using the Tokenconomics Dashboard for monitoring. While applying GENPROHYFRA as a progressive discipline, I've been able to improve my working systems.

A legitimate concern with model tier-downgrade is quality degradation. GENPROHYFRA does not prescribe universal tier assignments, only the tooling and vocabulary for engineers to make explicit, auditable trade-off decisions, rather than defaulting to frontier models by inertia. It's feedback system also promotes and recommends certain refinements over time.

I always try to use programmatic quality gates downstream of tier-downgraded nodes can catch output failures and re-route to higher-tier models when needed. This way I build cheap architectures first (instead of most of the community building expensive first), then I verify deterministically, and upgrade if necessary. This achieves acceptable quality at significantly reduced average cost, with the cost of each upgrade visible in the Tokenconomics dashboard.

Several important problems in this first iteration of GENPROHYFRA remain open:

-
The framework currently lacks a principled method for

**automatically recommending model tier assignments** based on task type. The feedback system helps with this, but a full automation would require a meta-model trained on task-outcome-cost data from real pipeline executions. This implies mass usage and data collection, which for now, is not my priority, as I'd rather induce a positive mental shift in developers instead. -
The Tokenconomics Dashboard does not yet model the cost of

**LLM context accumulation** across pipeline steps. This will be important to directly address how much of context bloating we are actually suffering from nowadays. -
The relationship between

**Curated Shared Team Instructions**(the ThoughtWorks Tech Radar practice that AGENTS.md implements) and GENPROHYFRA's graph schema is underspecified because it requires increased adoption. Future work should define a migration roadmap from`.md`

-based team instructions to graph pipeline definitions that preserves the human-readable properties of the former while potentiating higher dimensionality in the latter.

The current phase of enterprise AI adoption is characterised by experimentation and urgency. Companies have entered a period of revenue loss in order to adopt AI: not because it is immediately profitable, but because they fear the cost of not adopting. Pilot projects proliferate but most do not reach production or even maturity. Ultimately, organisations that do not figure out AI in the next one to two years risk being structurally disadvantaged by those that already have and this dynamic creates the feverish conditions for exactly the kind of architectural debt that GENPROHYFRA is designed to prevent.

When the goal is to adopt AI at all costs, cost architecture is not the priority. But when the adoption wave settles, the question shifts from "are we using AI?" to "can we afford to keep using AI this way?". The Uber incident is an early preview of that question arriving ahead of schedule. So this is my view for the next years:

-
**Stage 1 · NOW: Recognition of Linear Chain Limitations.** Teams adopt conditional routing, multi-model pipelines, and hybrid patterns independently. -
**Stage 2 · NEAR-TERM: Standardisation into Organisational Frameworks.** Equivalents of GENPROHYFRA govern how AI pipelines are designed, reviewed and deployed. -
**Stage 3 · MEDIUM-TERM: AI Pipeline Observability as Enterprise Infrastructure.** Token economy becomes a first-class capability, analogous to application performance monitoring (APM).

The current `.md`

file paradigm for agent configuration — `AGENTS.md`

, individual `.md`

files — has served as a productive first-generation standard. Its strength is human readability and integration with existing developer workflows. Its fundamental weakness is that it encodes pipeline topology *implicitly*, through prose descriptions of routing logic, rather than *explicitly*, through structured graph definitions.

Implicit topology cannot be analysed because you cannot compute the token cost of a prose-defined pipeline before running it. You cannot automatically detect that a node is over-specified for its task nor can you enforce architectural governance rules programmatically.

GENPROHYFRA's YAML-based graph schema is an attempt of kickstarting the second generation of agent configuration. By making nodes, edges, conditional routing, model assignments, and validation commands explicit and machine-parseable, it enables everything that implicit topology prevents: automated cost estimation, visual IDE tooling, static analysis, and programmatic governance. The transition from `.md`

to structured graph definitions is not a stylistic preference but a prerequisite for managing AI systems at enterprise scale.

I anticipate that major agent platforms will adopt structured graph formats within 1-2 years, driven by enterprise demand for cost governance and pipeline observability. The schema in Skilltree is a concrete proposal for what that standard could look like.

One of the most significant near-term shifts will be the normalisation of multi-model pipelines, where organisations deploy different foundation models from different providers within the same pipeline. This is very similar to what happened in the onset of cloud technology, where enterprises demanded multi-cloud compatibility.

The question driving this shift is the same question that drives GENPROHYFRA broadly: "*why use an overkill model for this task?*". If Claude Opus is the right choice for complex architectural reasoning, and GPT-5 is perfectly adequate for documentation formatting, then a well-designed pipeline uses both.

As enterprise procurement teams begin negotiating multi-model contracts, reserving frontier-model capacity for high-value tasks and using commodity models for routine ones, GENPROHYFRA can provide the architectural guidelines to implement and measure those agreements in actual pipeline definitions.

The final frontier of enterprise AI economisation is the shift from AI-governed to programmatically-governed pipelines. But despite being the final frontier, it doesn't mean it should be the last.

Every routing decision that can be expressed as a deterministic rule *should be* expressed as a deterministic rule, not delegated to an LLM. For me, that is an obvious cost-reduction feature. But this shift matters beyond cost: it's about **reliability**. A financial services firm deploying AI for document processing cannot afford routing decisions that are subject to model temperature or context drift. GENPROHYFRA's approach to programmatic governance is not just an efficiency measure, but also provides a compliance architecture for regulated industries where determinism is a requirement.

TOKENCONOMICS is the discipline that will separate scalable AI enterprises from those that cannot afford to scale. The architectural patterns of early adoption — flat agent hierarchies, homogeneous model tiers, AI-powered validation of deterministic questions, LLM orchestrators with no deterministic exit conditions — are adequate for experimentation but economically and structurally untenable at production scale. The Uber incident is a preview of the structural crises awaiting every organisation that delays cost-aware pipeline architecture. GENPROHYFRA is a concrete, implementable response: use AI at minimum necessity, make the boundary between generative and programmatic explicit, assign every node the minimum model tier it requires and measure everything. Skilltree demonstrates that these principles are buildable, deployable and demonstrable today, and any codebase can adopt it in under 5 minutes.

In my opinion, the flat `.md`

file is not the future of AI agent architecture. Explicit topology, conditional routing, hybrid execution and token economy are a first-class engineering concern that require a solution of higher dimensionality. The organisations that make this transition in 2026 will not merely save money but build architectural robustness to operate in an AI-intensive economy on their own terms, at a cost they control.

- Anthropic (2024). "Claude 3.5 and Claude 3 Model Family."
[https://www.anthropic.com/news/claude-3-5-sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) - OpenAI (2024). "GPT-4o System Card."
[https://openai.com/index/gpt-4o-system-card/](https://openai.com/index/gpt-4o-system-card/) - Google DeepMind (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530.
- LangChain (2024). "LangGraph: Building Stateful Multi-Actor Applications with LLMs."
[https://blog.langchain.dev/langgraph/](https://blog.langchain.dev/langgraph/) - Microsoft Research (2023). "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv:2308.08155.
- Creston, D., Liang, P. et al. (2024). "Measuring the Economic Value of AI Work." Stanford HAI Working Paper.
- GitHub (2024). "GitHub Copilot Enterprise: Agent File Architecture (AGENTS.md specification)." GitHub Docs.
- Anthropic (2025). "Introducing Claude Code."
[https://www.anthropic.com/news/claude-code](https://www.anthropic.com/news/claude-code) - Reuters (2026). "Uber’s AI Coding Tool Budget Exhausted in Four Months." Technology News, April 2026.
- Amodei, D. (2025). "Machines of Loving Grace."
[https://darioamodei.com/machines-of-loving-grace](https://darioamodei.com/machines-of-loving-grace) - Park, J. S. et al. (2023). "Generative Agents: Interactive Simulacra of Human Behavior." UIST 2023.
- Yao, S. et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
- Zhao, W. X. et al. (2023). "A Survey of Large Language Models." arXiv:2303.18223.
- Skilltree Project (2025). "Agents-as-a-Service (A³S): Visual IDE for Agent Orchestration."
[https://github.com/skilltree-main](https://github.com/skilltree-main) - pytest Development Team (2024). "pytest: The Python Testing Framework."
[https://docs.pytest.org](https://docs.pytest.org) - Gartner (2024). "Hype Cycle for Artificial Intelligence, 2024." Gartner Research Report.
- McKinsey Global Institute (2024). "The economic potential of generative AI: The next productivity frontier."
- Wooldridge, M. (2009). "An Introduction to MultiAgent Systems." 2nd ed. Wiley.
- Anthropic (2025). "Agent Skills: An Open Standard for Portable AI Agent Capabilities." Anthropic Research.
- Skilltree Project (2026). "Skilltree v2.0: Pipeline Execution Preview and Intent Routing." Internal Release Notes.
- Böckeler, B. (2026). "Harness Engineering for Coding Agent Users." martinfowler.com, April 2026.
[https://martinfowler.com/articles/harness-engineering.html](https://martinfowler.com/articles/harness-engineering.html)

This white paper and the **GENPROHYFRA** design philosophy are
licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
You are free to share and adapt this material for any purpose, provided you
give appropriate credit to the author. [https://creativecommons.org/licenses/by/4.0/](https://creativecommons.org/licenses/by/4.0/) | **Skilltree** — All rights reserved. © Bruno M. Guerreiro, 2026.
