Agents are workflows. SirenSpec is the workflow tool that admits it. A developer has released SirenSpec, a YAML-first SDK that treats production AI "agents" as workflows with fixed sequences of LLM calls and branching logic. The tool allows entire pipelines to be defined in a single readable YAML file that can be validated before execution and tested in CI without token costs. SirenSpec emerged after documented incidents where autonomous agents incurred thousands of dollars in runaway costs, highlighting the opacity of traditional agent frameworks that scatter orchestration logic across multiple files. TL;DR: Most production "agents" are really just workflows with a fixed sequence of LLM calls with some branching. SirenSpec is a YAML-first SDK that treats them that way. A whole pipeline can live in one .yaml file that a teammate can read in 30 seconds, you can validate before you run it, and can test in CI without spending a cent on tokens. Two stories from the last year and a half. A developer's autonomous agent spent $47,000 on itself https://dev.to/gabrielanhaia/the-agent-that-spent-47k-on-itself-an-autonomous-loop-postmortem-3313 in a runaway loop before anyone caught it. A different one burned $4,200 over a single weekend https://medium.com/@sattyamjain96/the-agent-that-burned-4-200-in-63-hours-a-production-ai-postmortem-d38fd9586a85 ...63 hours of uncapped inference while its owner was at a wedding. Neither developer was careless. Both wrote code that looked fine. The bug wasn't in a specific line: it was that the shape of the system, what runs, in what order, and what it's allowed to do, was scattered across a state machine in three Python files. You couldn't look at it. You could only trace it after something went wrong. That's what gets me about the way most agent frameworks are built. Runaway loops are a known failure mode. But the deeper problem is that calling something an "agent" implies intelligence and autonomy, and that framing leads you to build something opaque by default. What both of those systems actually were, underneath the branding, was a sequence of LLM calls with some branching. A workflow. And workflows should be readable. I'm Tristan, and I built SirenSpec https://github.com/sirenspec/sirenspec because most production AI workflows shouldn't need a framework at all. They need a spec. Worth asking before we go further: is "agent" even the right word for what most of us ship? A few people arrived at the same answer recently. Anthropic's "Building Effective Agents" https://www.anthropic.com/research/building-effective-agents tells you to start with the simplest thing that works, usually a plain pipeline, and only reach for real agentic behavior when nothing simpler will do. Temporal's team was blunter: "agents are just workflows, really." https://www.amplifypartners.com/blog-posts/agents-are-just-workflows-really And an arXiv survey of multi-agent pain points https://arxiv.org/html/2510.25423v2 found the top two developer frustrations were orchestration semantics and policy enforcement, exactly the things that vanish into code in most frameworks. Strip the branding and a production "agent" is usually a sequence of LLM calls, some shared context, a few conditional branches, and rules about what each step is allowed to do. That's a workflow. And if it's a workflow, the definition should be the first thing you read, not something you reconstruct from a pile of StateGraph.add node calls. SirenSpec is a YAML-first agent orchestration SDK. You write the whole pipeline in one file: the agents model plus system prompt , the nodes which agent runs, and where its output goes , the edges order, plus optional branching , and the guardrails injection detection, PII redaction, output validation, cost caps . Run it from the CLI and you get a JSON trace of every node, token, and decision. version: "0.1" env file: .env agents: researcher: model: "openai:gpt-4o" system: "Summarize the following for a non-expert." writer: model: "anthropic:claude-3-5-sonnet-20241022" system: | Write a 200-word blog intro from this research: {{ research.output }} nodes: research: agent: researcher writes: working.research write: agent: writer writes: output.draft edges: - from: research to: write guardrails: - injection - name: length config: max chars: 1000 - name: cost cap config: max usd: 0.10 A two-agent pipeline, and also the complete answer to what it does, what it's allowed to do, and in what order. You don't need Python to read it. That matters more than it sounds, because the person who needs to read it usually isn't the person who wrote it. pip install sirenspec sirenspec init scaffolds a workflow.yaml sirenspec run workflow.yaml Install, scaffold, run. No project setup, no boilerplate, and sirenspec validate will catch a broken workflow before it ever calls a model. Here's the GitHub triage https://docs.sirenspec.dev/cookbook/github-issues-triage/README example from the cookbook: version: "0.1" env file: .env agents: classifier: model: "openai:gpt-4o-mini" system: | Classify this GitHub issue. Return JSON with: category bug|feature|question|docs , priority low|medium|high , needs repro bool . Issue: {{ inputs.message }} guardrails: - name: schema config: schema: type: object required: category, priority, needs repro responder: model: "anthropic:claude-haiku-4-5-20251001" system: | Write a friendly triage response. Classification: {{ classify.output }} nodes: classify: agent: classifier writes: working.classification respond: agent: responder writes: output.response edges: - from: classify to: respond guardrails: - injection The equivalent in Python code means wiring up functions, managing prompt strings separately, and threading context between calls manually. You're past 50 lines before you've written a single system prompt. This isn't a line-count contest. It's about whether the shape of the workflow survives without a Python interpreter running in your head. Hand github-triage.yaml to your PM, your ops lead, or whoever inherits the project after you leave, and they can see what runs, in what order, and what it's not allowed to do. "Shorter code" and "a non-engineer can read it" are different claims. SirenSpec is going for the second one. sirenspec validate fails before you push Before a single API call fires: sirenspec validate research-pipeline.yaml ✗ Node 'analyze' references undefined agent 'analyzr' — did you mean 'analyzer'? ✗ agents.verify.system: field required ✗ InterpolationError in '{{ missing node.output }}': node not found Each line is a real class of bug. A typo'd agent name gets caught at load by Pydantic instead of throwing a KeyError mid-run, which is a thing people hit in CrewAI https://community.crewai.com/t/keyerror-when-parsing-config-agents-yaml-file-in-a-trivial-crew-configuration/5073 . A node missing its system prompt surfaces here, not as a confusing provider error three steps in. And if node A's prompt references node B while B's references A, SirenSpec catches the cycle at load. LangGraph lets you build it and tells you at runtime. validate exits 0 or 1, makes no API calls, and costs nothing to run. The bugs other frameworks find in production, yours finds in CI. agents: classifier: model: "openai:gpt-4o-mini" system: "Classify this support ticket." guardrails: - injection prompt-injection detection - name: pii redact before the model sees it config: entities: email, phone, ssn - name: length config: max chars: 2000 These sit on the agent, right next to the model and the prompt. Not a separate library, not middleware, not a plugin you bolt on later. Cost caps live in the same place: guardrails: - name: cost cap config: max usd: 0.50 That one line is the difference between the $47K story and a run that stops itself. It's optional; skip it for a low-stakes internal tool, but when you want it, it's one line, and anyone can open the file and confirm it's there. You can't say that about a setting buried in a Python state machine. sirenspec test records a real run once, then replays it. After that, CI runs against the recording: deterministic, instant, no tokens. Record against the live API, once sirenspec test tests/triage test.yaml --record --cassette cassettes/run.yaml Replay in CI — no live calls sirenspec test tests/triage test.yaml --mock --cassette cassettes/run.yaml The closest comparison is Pydantic AI's TestModel , but that's a mock: you assert against synthetic output. A cassette is the real model's response, run through your real pipeline. So when a model update quietly changes what you get back, it shows up as a failing test in a PR, not as a strange trace in production three weeks later. One command turns any workflow into a Mermaid flowchart: sirenspec render workflow.yaml --target mermaid Here's the output for the email triage https://docs.sirenspec.dev/cookbook/email-triage/README example, a workflow that fetches your latest unread Gmail, fans out to three classifiers in parallel urgency, intent, sender reputation , then routes to whichever response agent fits: graph TD fetch email fetch email\npython tool triage triage\nswrm urgency urgency intent intent sender sender synthesis synthesis draft reply draft reply forward note forward note archive reason archive reason fetch email -- triage triage -- urgency triage -- intent triage -- sender urgency -- synthesis intent -- synthesis sender -- synthesis synthesis -- |reply| draft reply synthesis -- |forward| forward note synthesis -- |archive| archive reason Paste it into any Mermaid renderer and you get a diagram of your pipeline without writing a single line of diagram code. This matters more than it sounds, because your workflow's audience is no longer just you. Your PM wants to know what it does. Your manager wants to audit it. And increasingly, your AI coding tools need to understand it too. Mermaid is significantly more token-efficient than ASCII diagrams for LLMs https://dev.to/darkmavis1980/why-mermaid-is-the-best-way-to-document-your-architecture-in-the-ai-era-2dgb , with less chance of misinterpretation. Drop a rendered diagram into your CLAUDE.md or project README and Codex, Claude Code, or whatever you're pairing with can orient itself in seconds. If you're sizing this up, here's where it stops. No dynamic loops, no autonomous tool selection, no handoffs, no memory layer. You write the graph; the graph runs. Connectors, web browsing, and richer tool integrations are on the roadmap, but they're still in planning. SirenSpec is for the script you've already written more than once: the one that calls OpenAI, retries on a 429, checks a JSON shape, counts tokens, and hopes. That script, with a spec you can read, a validator, and tests around it. | SirenSpec | Big Agent Frameworks | Raw SDK | | |---|---|---|---| | Readable by non-engineers | ✅ | ❌ | ❌ | | Pre-run validation | ✅ | ❌ | ❌ | | Guardrails built in | ✅ | DIY | DIY | | CI tests via cassettes | ✅ | ❌ | ❌ | | Dynamic agent loops | ❌ | ✅ | ✅ | | Provider-agnostic | ✅ | Varies | ❌ | Does it support loops? Yes, via factory nodes. A factory iterates over a list and runs one agent instance per item, with configurable concurrency. The changelog annotator https://docs.sirenspec.dev/cookbook/changelog-annotator/README is a good example: one classifier per commit, then a release writer that aggregates them. Autonomous tool selection and open-ended handoffs are not supported. Which providers? OpenAI, Anthropic, and Ollama today. Gemini, Bedrock, and Groq are on the list. Why YAML instead of Python? Because the workflow is the thing you want to read, diff in a PR, and hand to someone who doesn't write Python. When the definition lives inside code, "what does this pipeline actually do?" stops having a quick answer. How do I run the workflow in production? Currently, SirenSpec has a lightweight Python SDK https://docs.sirenspec.dev/sdk shipped on install. You can load your workflow into Python and execute in a variety of ways A lot of “agents” in production are really just workflows with retries, branching, and memory layered on top. That realization is what led me to build SirenSpec. We’re still early at v0.1.1, which makes this a fun stage to experiment in. I'd love to hear from you: How much of your company's “agent” stack actually deterministic underneath? If you're a non-technical founder, PM, or hobbyist vibecoder, when have you hit a wall building AI workflows or agents? If any of that sounds familiar, I’d love to hear how your team is approaching it. You can check out SirenSpec on GitHub https://github.com/sirenspec/sirenspec?utm source=chatgpt.com or browse the docs https://docs.sirenspec.dev?utm source=chatgpt.com .