{"slug": "ghost-skills-teaching-ai-agents-to-think-like-data-engineers", "title": "Ghost Skills: Teaching AI Agents to Think Like Data Engineers", "summary": "A data engineer created Ghost Skills, a collection of methodology instructions for AI coding agents, to address the gap between model capability and the contextual knowledge needed for real-world data engineering work. The skills repository provides durable craft practices—like declaring grain before building fact tables—that agents lack by default, aiming to improve their effectiveness in complex data environments.", "body_md": "# Ghost Skills: Teaching AI Agents to Think Like Data Engineers\n\nAnother week, another skills repo on the GitHub trending page. I know. There are roughly seventeen of them now, all promising to turn your AI coding agent from a confident intern into a slightly-less-confident intern. Most of them are great. Most of them are also built by solo devs, for solo devs, on solo-dev codebases that fit comfortably in a context window.\n\nWhich is fine, if that’s your world.\n\nLess fine if your world involves a Snowflake warehouse with four tables that *could* be the source of truth for “customer”, an SCD2 someone half-built in 2021 and quietly walked away from, and a dbt project where `stg_users_final_v3_actually_use_this`\n\nis, somehow, the one you’re meant to use. (Don’t laugh. You’ve seen worse.)\n\nI’ve been using AI agents day-to-day for a while now — mostly as a peer reviewer, a second pair of eyes, a thinking partner that doesn’t sigh when I ask it to look at the same query three times. They’re genuinely useful.\n\nSo I built something [Ghost Skills](https://github.com/ghostinthedata-info/skills) — a collection of data-engineering methodology skills for AI coding agents. Before I get into what’s in it, let me explain why I bothered.\n\n### The problem isn’t the model. It’s that data work has rules nobody told it about.\n\nYour AI agent is, by default, an extremely bright graduate on their first day. It knows Python. It knows SQL. It can write a beautifully-commented `for`\n\nloop and explain window functions to you in five different ways. It will happily generate a fact table, a dbt model, an Airflow DAG, all of it looking very professional.\n\nWhat it doesn’t know is everything that actually matters.\n\nIt doesn’t know that the `customer_id`\n\nin that source isn’t really unique because of the 2022 CRM migration. It doesn’t know that Roger pushed a release on Friday afternoon (because of course he did) and the platform’s been wheezing through a seven-year backfill ever since. It doesn’t know that this dimension changes slowly and the audit team will lose their minds if you flatten the history. It doesn’t know that the business definition of “active customer” changed in March, but only in the marketing data mart, and only sometimes.\n\nThese aren’t model intelligence problems. The model is plenty smart. They’re *context* problems — and the context is exactly the stuff that takes a data engineer two years to learn and roughly thirty minutes to forget when they leave the company.\n\nWhat agents are missing isn’t capability. It’s methodology. It’s the durable craft of data engineering — the part that doesn’t change when you swap Snowflake for BigQuery, or dbt for SQLMesh, or your warehouse for a “lakehouse” that’s somehow priced like a warehouse anyway.\n\nThat’s the gap this is attempting to fill.\n\n### Skills, briefly, for anyone who hasn’t drunk this particular Kool-Aid yet\n\nIf you’ve not been living inside Claude Code or Cursor for the last six months: a *skill* is a folder of instructions you give an AI coding agent. “When you build a fact table, declare the grain first.” “When you write tests, follow this severity model.” “When something breaks, here’s the comms workflow.” You version it, you commit it, you keep it in the repo. The agent reads it before doing the thing, and re-reads it next session because agents have the long-term memory of a goldfish.\n\nIt’s a deceptively powerful pattern. Instead of pasting the same prompt every session (and forgetting half of it, and writing it slightly differently this time), the standard lives in the repo. New team member joins? They get the skills. Agent updates next month? It still reads the skills.\n\nThe catch — and we’ll come back to this — is that skills tell an agent *how* you do something. They don’t tell it *what* you’ve built, *why* you built it that way, or *which* of your seven utility tables is the one that’s still maintained.\n\nFor solo devs, that’s fine. There is no “seven utility tables”. For data teams at any kind of scale, it’s a real limit. More on that at the end.\n\n### Introducing Ghost Skills\n\nRight. The repo: [github.com/ghostinthedata-info/skills](https://github.com/ghostinthedata-info/skills). Built mostly for myself, polished up for sharing.\n\n30-second setup:\n\n```\nnpx skills@latest add ghostinthedata-info/skills\n```\n\nPick the skills you want, pick your agent (Claude Code, Codex, Cursor — whatever you’re using), and run `/setup-ghost-skills`\n\n. It’ll ask you three questions: warehouse dialect, transform tooling, and how your domain docs are laid out. Then it writes itself a configuration block into your `CLAUDE.md`\n\nor `AGENTS.md`\n\nand gets out of your way.\n\nThe agent reads it next session. And the one after that. And the one after that. The same standards, every time, without you re-explaining them.\n\nThat’s it. That’s the whole thing.\n\n### What’s in the catalogue\n\nThe skills group into four areas, mapped to roughly how data work actually happens.\n\n**Discovery** is everything you should do before you build anything, and frequently don’t. `profile-data`\n\nruns the baseline checks on a new dataset — row counts, cardinality, null analysis, key uniqueness — so the agent isn’t generating models against assumptions that fall apart in week two. `gather-requirements`\n\npins down grain, sources, consumers, freshness SLAs, and acceptance criteria one question at a time, instead of letting the agent guess. `refine-context`\n\nstress-tests a plan against your documented domain model and writes the decisions back to `CONTEXT.md`\n\nand ADRs as they crystallise.\n\n**Modeling** is the methodology your agent should already be applying and usually isn’t. `dimensional-modeling`\n\nwalks the four-step process. `fact-table-design`\n\nforces grain declaration and measure classification before a single line of SQL gets generated. `keys`\n\ncovers business, natural, surrogate, composite, and durable keys — and the anti-patterns that bite you eighteen months later when the source system gets re-platformed. `slowly-changing-dimensions`\n\ncovers SCD types 0 through 7, including my Healing Tables approach to deterministic, path-independent SCD2.\n\n**Quality** is where defensive engineering lives. `test-data`\n\nproduces a test plan you can actually defend in a code review — uniqueness, referential integrity, nulls, accepted values, freshness, volume variance — and maps each test to a severity level. `performance-tuning`\n\nfollows the measure-first philosophy: find the critical path, then fix partition pruning, incremental processing, and the phantom dependencies that are silently doubling your costs. `spark-performance`\n\nhandles the distributed case — partition counts, shuffle minimisation, skew handling, broadcasting small tables.\n\n**Operations** is what should happen when things go pop. `incident-comms`\n\ngives the agent the severity classification, notification workflow, update cadence, and post-incident review template — the “Don’t Go Dark” pattern in skill form. `pipeline-design`\n\nencodes idempotency, reproducibility, and defensive engineering as defaults instead of afterthoughts. `data-as-a-product`\n\nbrings data mesh thinking into the agent — domain ownership, discoverability, SLAs, federated governance. `data-security-classification`\n\ncovers the four A’s, PII/PHI handling, and least privilege.\n\nThere’s a setup skill (`setup-ghost-skills`\n\n) that runs once per repo and handles the configuration. Everything else, you opt into.\n\n### Why tool-agnostic, even though tool-specific would sell better\n\nThe temptation when building something like this is to specialise. Snowflake-only. dbt-only. Make every skill assume your exact stack and produce immediately runnable code.\n\nI deliberately didn’t do that, and the tradeoff is worth naming honestly. Generic skills are less immediately useful than highly specific ones. A skill that knows your exact dbt project structure, your column naming conventions, your team’s snake_case-vs-camelCase argument that’s been going on since 2022 — that’s more powerful, day one.\n\nBut it’s also yours. It’s not shareable. And the bits that are genuinely universal — *declare the grain before you build the fact*, *a business key isn’t a real business key until you’ve checked it’s unique*, *test before publish* — those don’t change between platforms. They didn’t change when we moved from Teradata to Snowflake. They won’t change when we move from Snowflake to whatever everyone’s furiously rebranding next year.\n\nGhost Skills is the base layer. Your repo’s `CONTEXT.md`\n\nis your specialisation layer. They’re meant to work together, not replace each other. Fork it. Add a `snowflake-cost-optimisation`\n\nskill on top. Open a PR if you want to share it back. (Or don’t. Do what you like.)\n\n### Skills help. They don’t solve everything. Let’s not pretend.\n\nThe honest thing to say at this point is that skills are not a silver bullet, and anyone telling you otherwise is selling you a course.\n\nSkills tell an agent *how* to do things. They don’t tell it *what’s there* or *why*. An agent that knows the Kimball four-step process will still make the wrong grain declaration if it doesn’t understand the business process it’s modelling. A skill can encode what SCD2 is; it can’t replace a conversation with the stakeholder who actually needs the history. A skill can encode the Write-Audit-Publish pattern; it can’t tell the agent which of your fourteen “users” tables is the one to run it against.\n\nThe other thing skills can’t do is fix a stale skill. The conventions you encode today are the conventions you knew today. If your team’s standards evolve — and they should — your skills need to evolve with them. A stale skill is worse than no skill, because it gives the agent a confident wrong answer instead of asking a question.\n\nSo treat them as a starting point. Use them. Fork them. Argue with them in a code review. Update them when your standards shift. The agent will keep reading whatever’s in the repo, faithfully, every session — which means the skills are only ever as good as the last person who maintained them.\n\nThe repo is at [github.com/ghostinthedata-info/skills](https://github.com/ghostinthedata-info/skills). Fork it for your cloud-specific layer. Open issues if something’s wrong. Open a PR if you’ve got methodology worth encoding and you don’t want to keep it to yourself.\n\nAnd if you build something better on top of it, please tell me.", "url": "https://wpnews.pro/news/ghost-skills-teaching-ai-agents-to-think-like-data-engineers", "canonical_source": "https://ghostinthedata.info/posts/2026/2026-06-14-ai-ghost-skills-ai-agents/", "published_at": "2026-06-13 23:00:00+00:00", "updated_at": "2026-06-17 19:52:22.869699+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "ai-research", "developer-tools"], "entities": ["Ghost Skills", "Snowflake", "BigQuery", "dbt", "SQLMesh", "Claude Code", "Cursor", "Airflow"], "alternates": {"html": "https://wpnews.pro/news/ghost-skills-teaching-ai-agents-to-think-like-data-engineers", "markdown": "https://wpnews.pro/news/ghost-skills-teaching-ai-agents-to-think-like-data-engineers.md", "text": "https://wpnews.pro/news/ghost-skills-teaching-ai-agents-to-think-like-data-engineers.txt", "jsonld": "https://wpnews.pro/news/ghost-skills-teaching-ai-agents-to-think-like-data-engineers.jsonld"}}