How to Write a Flutter Agent Skill That Actually Works: The 2026 Recipe

A developer has published a recipe for writing effective Flutter agent skills, synthesizing official guidance from Flutter, Anthropic, Google, and OpenAI into a single copy-pasteable format. The recipe emphasizes that a great skill is a tightly scoped `SKILL.md` file with a description engineered for discovery, ruthless conciseness, anti-patterns stated upfront, a checklist workflow, and a feedback loop. The format is an open standard that works across Claude Code, OpenAI Codex, Google Antigravity, Gemini CLI, and Cursor.

TL;DRA great agent skill is not a pile of documentation. It is a tightly scoped SKILL.md with a description engineered for discovery, ruthless conciseness, anti-patterns stated up front, a checklist workflow, and a feedback loop. The format is an open standard that works across Claude Code, OpenAI Codex, Google Antigravity, Gemini CLI, and Cursor. This post synthesizes the official authoring guidance fromFlutter, Anthropic, Google, and OpenAIinto one recipe, hands you a complete copy-pasteable Flutter skill, and shows you how to actually evaluate it instead of guessing. In my last article, I wrote about the official Dart and Flutter Agent Skills and why they stop your AI from writing 2022 Flutter. The most common reply I got was some version of the same question: "Cool. How do I write my own?" So I went and read the actual playbooks. Not the hot takes, the primary sources: Flutter's skill docs and eval framework, Anthropic's skill authoring best practices, Google's Antigravity skill docs, and OpenAI's Codex skill guide. The good news is they agree on almost everything. The better news is that the gap between a skill that works and a skill that gets silently ignored comes down to a handful of decisions, and most people get them wrong. Here is the recipe, Flutter-flavored. AI agents are generalists. They average across years of Flutter code, much of it deprecated, and hand you the most statistically common answer instead of the currently correct one. The Flutter team named this the knowledge gap : the framework ships features faster than language models can update their training data. Skills exist to close that gap by handing the agent a task-specific, expert workflow. But here is what nobody tells you. A poorly written skill does not just fail to help. It actively costs you. Every skill's metadata sits in the agent's context budget at all times. A vague skill that never triggers is dead weight. A skill with a fuzzy description that triggers on the wrong tasks is worse, because now your agent is following the wrong playbook with full confidence. The bar is not "wrote some Markdown." The bar is "the agent reliably finds it, trusts it, and follows it." Everything below is in service of that bar. A skill is the simplest possible thing: a folder with one required file. building-riverpod-async-screens/ ├── SKILL.md Required: metadata + instructions ├── references/ Optional: deep-dive docs loaded on demand ├── examples/ Optional: reference implementations ├── scripts/ Optional: scripts the agent runs, not reads └── assets/ Optional: templates, images The SKILL.md itself is YAML frontmatter plus a Markdown body: --- name: building-riverpod-async-screens description: "Build a Flutter screen that loads async data with Riverpod..." --- Building Riverpod Async Screens instructions go here The magic that makes this scale is progressive disclosure . At startup the agent loads only the lightweight metadata name, description, path of every skill. It reads the full SKILL.md only when a task matches, and it reads anything in references/ or examples/ only when the body points it there. If you write Flutter, you already know this pattern: it is deferred loading for the context window. OpenAI, Anthropic, and Google all describe the exact same mechanism. This is the part that makes writing a skill worth your time. SKILL.md is an open standard published at agentskills.io, originated at Anthropic, since adopted across the ecosystem . One skill works almost everywhere: | Tool | Vendor | Where skills live | |---|---|---| | Claude Code | Anthropic | .claude/skills/ project , ~/.claude/skills/ personal | | OpenAI Codex | OpenAI | .codex/skills/ project , ~/.codex/skills/ or ~/.agents/skills/ | | Antigravity | .agents/skills/ workspace , ~/.gemini/antigravity/skills/ global | | | Gemini CLI | SKILL.md standard locations | | | Cursor / Copilot | Various | supported with manual placement | The Flutter team's installer targets the cross-tool location directly: npx skills add flutter/skills --skill ' ' --agent universal The --agent universal flag drops everything into .agents/skills , the folder compatible agents auto-discover. Write a skill once, and your whole team gets the same expertise regardless of which agent they prefer. Codex adds a distribution layer on top it calls the authoring format a "skill" and the installable package a "plugin" , but the core file is identical. Every official source converges on these. I have ordered them by how much they matter in practice. If your skill does not trigger, it is almost never the instructions. It is the description. This is the single most important line in the entire file, because it is the only part the agent reads when deciding whether to load your skill at all , often choosing from 100+ candidates. Three rules from the official guidance: Compare: Weak: vague, no triggers, will rarely fire correctly description: Helps with Flutter screens. Strong: what + when + triggers + boundary description: Build a Flutter screen that loads async data with Riverpod, handling loading, error, and data states with AsyncValue. Use when fetching from a repository or API and rendering spinners, retry UI, and lists. Do not use for purely static screens with no async data. Anthropic puts it perfectly: the context window is a public good. Your skill shares it with the system prompt, the conversation, every other skill's metadata, and the user's actual request. The default assumption must be that the agent is already very smart . Do not explain what Flutter is. Do not explain what a widget is. Do not define JSON. Challenge every sentence: does the agent really not know this? Keep the SKILL.md body under 500 lines. If it grows past that, split it into references/ files. php < -- Bad: wastes tokens on what the model already knows -- Flutter is Google's UI toolkit. A widget is a building block of the UI. To make a network call, you first need an HTTP client, which is a piece of software that... < -- Good: assumes competence, gets to the point -- Use the http package for REST calls. Wrap responses in a typed model. This framing from Anthropic is the one most people miss. Think of the agent as a robot walking a path: dart run build runner build --delete-conflicting-outputs . Do not modify the flags."Fragile, deterministic Flutter operations code generation, migrations, platform config want low freedom. Architectural and design decisions want high freedom. Most skills need a mix. This is what makes the official Flutter skills so effective, and it is the ingredient that separates a senior skill from a junior one. Do not only say what to do. Ban the wrong instinct explicitly. The official flutter-build-responsive-layout skill does exactly this. It does not just say "be responsive." It says: do NOT switch layouts on MediaQuery.orientationOf , do NOT check for "phone" vs "tablet", do NOT lock orientation. Those negative rules are what stop the model from reaching for the plausible-but-wrong pattern it learned from a thousand old tutorials. Rules - Use AsyncValue.when to render data/loading/error. Never assume data is present. - Do NOT use FutureBuilder for server state. It re-runs on every rebuild and causes duplicate network calls. - Do NOT swallow exceptions or show an infinite spinner on failure. For any multi-step task, give the agent a checklist it can copy into its response and tick off. This prevents skipped steps, which is the most common failure mode on complex work. Both Anthropic and Flutter's own skills use this pattern. Workflow Copy this checklist and track progress: - Define the immutable data model. - Add the repository method returning Future<Model . - Create the provider that calls the repository. - Build the screen with ref.watch + AsyncValue.when . - Implement the error branch with a retry action. - Run dart analyze and fix everything. Repeat until clean. The highest-leverage pattern in the entire playbook: run validator, fix errors, repeat. Give the agent an objective check it can run and a rule to keep going until it passes. In Flutter, you have world-class validators for free. After generating code, run dart analyze . If it reports issues, fix them and run it again. Only present the result when analysis is clean and flutter test passes. This single habit improves output quality more than almost anything else, because it converts "looks right" into "provably compiles and lints clean." Keep the main SKILL.md as a lean overview and push depth into linked files. Three patterns, named by the Antigravity docs: SKILL.md only. For focused, single-purpose skills. SKILL.md + references/ . For skills with deep API detail. SKILL.md + examples/ . For skills where output quality depends on seeing worked examples.Two rules when you split: keep references one level deep from SKILL.md the agent may only partially read nested files , and add a table of contents to any reference file longer than 100 lines so the agent can see the full scope even on a partial read. Never write "before August 2025, use the old API." It rots. Instead, put deprecated guidance in a collapsed "old patterns" section so the current path stays clean while history stays available. Old patterns <details <summary Why not FutureBuilder? legacy </summary FutureBuilder re-runs its future on every rebuild unless cached, causing duplicate calls. Providers cache and dedupe by default. Prefer providers. </details If output quality depends on style, show a complete input/output example rather than describing it. The model matches patterns far better than it follows prose. One correct, runnable Dart snippet anchors the entire skill. Here is a full, working skill that bundles every ingredient above. It targets a spot where AI agents reliably write outdated Flutter: loading async data. Drop this into .agents/skills/building-riverpod-async-screens/SKILL.md and it works in Claude Code, Codex, and Antigravity. --- name: building-riverpod-async-screens description: Build a Flutter screen that loads async data with Riverpod, handling loading, error, and data states with AsyncValue. Use when fetching from a repository, API, or database and rendering spinners, retry UI, and lists. Do not use for static screens with no async data. --- Building Riverpod Async Screens Wire an async data screen the way a senior Flutter dev would: a typed provider, AsyncValue state handling, and explicit loading/error/data branches. No raw FutureBuilder , no manual setState for server state, no swallowed errors. Rules - Use a FutureProvider for read-only data, or an AsyncNotifier when the screen also mutates state. Do NOT use StatefulWidget + setState for server state. - Watch with ref.watch inside build . Use ref.read only inside callbacks. - Render all three states with AsyncValue.when . Never assume data exists. - Always give the error branch a retry path. Do NOT swallow exceptions or show an infinite spinner on failure. - Keep shared providers in their own file: one feature, one providers file. Workflow Copy this checklist and track progress: - Define the immutable data model. - Add the repository method returning Future<Model . - Create a FutureProvider or AsyncNotifier that calls the repository. - Build the screen: ref.watch the provider, render with AsyncValue.when . - Implement the error branch with a retry that invalidates the provider. - Run dart analyze and fix all issues. Repeat until clean. Example // product providers.dart final productProvider = FutureProvider.autoDispose<List<Product ref async { final repo = ref.watch productRepositoryProvider ; return repo.fetchProducts ; } ; // product screen.dart class ProductScreen extends ConsumerWidget { const ProductScreen {super.key} ; @override Widget build BuildContext context, WidgetRef ref { final products = ref.watch productProvider ; return Scaffold appBar: AppBar title: const Text 'Products' , body: products.when data: items = ListView.builder itemCount: items.length, itemBuilder: , i = ListTile title: Text items i .name , , loading: = const Center child: CircularProgressIndicator , error: err, = ErrorRetry message: 'Could not load products', onRetry: = ref.invalidate productProvider , , , ; } } Why not FutureBuilder? legacy FutureBuilder re-runs its future on every rebuild unless cached, causing duplicate network calls. Providers cache and dedupe by default. Prefer providers for any server state. Notice how much work the description does, how the rules ban wrong instincts before listing right ones, how the workflow ends in a validator loop, and how the example is complete enough to copy. That is the whole recipe in one file. How to actually evaluate your skill This is the step almost everyone skips, and it is the one that separates a skill that feels good from one that is good. Both Anthropic and the Flutter team are emphatic: do not trust vibes. Measure. Build the evaluation first Anthropic calls this evaluation-driven development, and the order matters: 1. Find the gap. Run the agent on a real task with no skill. Document exactly where it fails or writes outdated code. 2. Write three eval scenarios that target those failures. 3. Establish a baseline. Measure performance without the skill. 4. Write the minimum instructions needed to pass. 5. Iterate. Re-run, compare to baseline, refine. This guarantees you are solving a real problem instead of documenting an imaginary one. A simple eval is just structured expectations: { "skills": "building-riverpod-async-screens" , "query": "Build a screen that loads the user's order history from OrderRepository and shows it in a list", "expected behavior": "Creates a FutureProvider or AsyncNotifier that calls OrderRepository, not a StatefulWidget with setState", "Renders loading, error, and data states using AsyncValue.when", "Includes a retry action in the error branch that invalidates the provider", "Generated code passes dart analyze with no errors" } The Dart and Flutter teams run an experimental evals framework open-sourced at the flutter/evals repository built around critical user journeys : realistic developer tasks rather than toy prompts. They score on two axes, which is a great rubric to copy for your own skills: dart analyze , and pass the tests? Objective, machine-checkable.For your own skill, that translates to a dead-simple loop: run the task with and without the skill, then ask "did the deterministic checks pass, and is the code meaningfully better?" If the skill does not move either axis, it is not earning its context budget. Anthropic's most practical tip: develop the skill with one instance call it the author and test it with a fresh instance that has no memory of the conversation the tester . The author helps you write and tighten the SKILL.md . The tester reveals what the instructions actually communicate to a cold agent. When the tester stumbles, bring the specific failure back to the author and refine. Repeat. This observe-refine-test loop is how the official skills were hardened, and it works because the model understands both how to write agent instructions and what an agent needs to receive. Skills can include scripts and reference external resources. That means an untrusted skill can introduce vulnerabilities or quietly exfiltrate data. Before you install a community skill, read it, the same way you would read a dependency before adding it to pubspec.yaml . For any skill that runs terminal commands or touches infrastructure, add an explicit "Safety" section documenting exactly what it does. Treat skills as code, because they are. When the official skills dropped, the Flutter corner of X and Reddit reacted the way it always does: screenshots, threads, and declarations that AI coding just changed again. I want to be straight, because the skeptics have a point worth hearing. More than one experienced Flutter dev read the actual skill files and came away underwhelmed, noting the initial set is fairly thin and covers ground a competent dev already knows. That is fair. And it is also the wrong frame. A skill is not a magic file that makes your agent brilliant. It is a discipline. The value is not in any single skill the Flutter team shipped. It is in the workflow the format unlocks: codify a pattern once, evaluate it, refine it on a loop, and every future session inherits it. The teams that win with AI in 2026 are not the ones with the best model. They are the ones who got good at writing down what they already know, then testing that the agent actually follows it. That is the real reason to learn this recipe. Not to consume the official skills, but to write the ones your team actually needs. Before you commit a Flutter skill, verify: SKILL.md body is under 500 lines; depth is in references/ or examples/ . dart analyze / flutter test . What is a Flutter agent skill? A folder containing a SKILL.md file that gives an AI coding agent task-specific, expert instructions for a Flutter or Dart workflow. It loads on demand via progressive disclosure, so it adds expertise without permanently bloating the context window. What makes an agent skill good? A precise, trigger-rich description the single biggest factor , ruthless conciseness, explicitly stated anti-patterns, a checklist workflow, a validator feedback loop, and at least one complete example, all verified against evaluations rather than vibes. How do I write the description so the skill actually triggers? Third person, state both what the skill does and when to use it, front-load the trigger words a developer would type, and add a "Do not use" clause to prevent it firing on the wrong tasks. How do I evaluate a skill? Build evals before writing docs. Run the task without the skill to establish a baseline, write three scenarios with expected behaviors, then measure deterministic correctness compiles, passes dart analyze and tests and qualitative quality against that baseline. Does a skill I write for Claude Code work in Codex and Antigravity? Yes. SKILL.md is an open standard. Skills that stick to the core format frontmatter plus Markdown instructions work across Claude Code, Codex, Antigravity, Gemini CLI, and Cursor. Only advanced, tool-specific features need adjustment. How is a skill different from a rules file or AGENTS.md? Rules and AGENTS.md are always-on, repository-wide instructions setup commands, standards . A skill is loaded only when its description matches the current task. Use always-on files for global rules and short if/then triggers, and skills for specific, repeatable workflows. How long should a SKILL.md be? Keep the body under 500 lines. If it grows past that, move depth into one-level-deep references/ files and keep the main file as a lean overview. The official Dart and Flutter skills are a starting point, not the destination. The real unlock is the recipe behind them: a discovery-optimized description, concise expert instructions, anti-patterns stated out loud, a checklist, a validator loop, and an evaluation that proves it works. Get those right and you can encode your team's hardest-won Flutter patterns into something every agent on the team follows automatically. Write one skill this week. Pick the task where your AI agent annoys you most, encode the correct pattern, and evaluate it against the no-skill baseline. Then tell me what you built. I read every comment, and I want to know which Flutter pattern you taught your agent first. 🥊 Sources: Flutter docs: Agent skills and AI Evaluations, flutter/skills, Anthropic: Skill authoring best practices, Google Antigravity skills docs, and OpenAI Codex: Agent skills.