# Agnostic Cluster Refactor Skill for Antigrafity CLI: Building an AI Agent that Migrates Apps from AWS to GKE (Subagents, HITL Gate & Workload Identity)

> Source: <https://dev.to/gde/agnostic-cluster-refactor-skill-for-antigrafity-cli-building-an-ai-agent-that-migrates-apps-from-e0>
> Published: 2026-06-30 15:10:30+00:00

Have you ever inherited a codebase where `import boto3`

appears in 47 different files? Where AWS credentials live in hardcoded environment variables and file storage is a `file.save("/tmp/...")`

that will blow up the moment it hits an ephemeral Kubernetes pod?

I did. And instead of refactoring everything by hand, I built an AI agent to do it for me — with mandatory human oversight before any production mutation.

This article documents what I built: a **skill for the Antigravity CLI** (`agy`

) that scans cloud dependencies, spawns parallel subagents to refactor code and infrastructure, and validates everything on local Kubernetes before deploying to GKE with keyless Workload Identity.

`boto3`

is the AWS SDK for Python. It seems harmless at first:

``` python
# Innocent on day 1
import boto3
s3 = boto3.client('s3', region_name='us-east-1')
s3.upload_fileobj(file, bucket_name, filename)
```

Six months later:

``` python
# examples/legacy-app/app.py — the real state after it grows
import os
import boto3
from flask import Flask, request, jsonify

app = Flask(__name__)

# "Temporary" hardcoded since 2022
DB_PASSWORD = os.getenv("DB_PASSWORD", "default-insecure-password")

S3_BUCKET = os.getenv("AWS_S3_BUCKET_NAME")
AWS_REGION = os.getenv("AWS_DEFAULT_REGION", "us-east-1")

s3_client = boto3.client(
    's3',
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    region_name=AWS_REGION
)

@app.route("/upload", methods=["POST"])
def upload_file():
    file = request.files['file']
    filename = file.filename
    if S3_BUCKET:
        s3_client.upload_fileobj(file, S3_BUCKET, filename)
        return jsonify({"message": f"Uploaded to AWS S3: {S3_BUCKET}"})
    else:
        # Fallback to local disk — will break in K8s ephemeral pods
        local_path = os.path.join("/tmp", filename)
        file.save(local_path)
        return jsonify({"message": f"Saved locally at {local_path}"})
```

Three coupling problems in a single file: proprietary SDK (`boto3`

), AWS-specific credentials, and local disk storage that doesn't survive ephemeral Kubernetes pods.

Now multiply that by 10 services.

A **skill** for the Antigravity CLI that adds two commands to the agent chat:

```
/agnostic-cluster-refactor:scan-deps
/agnostic-cluster-refactor:spawn-refactor
```

The complete flow:

But before diving into the code, let me introduce the players.

`agy`

is not a script. It's an LLM-powered agent — you describe what you want in the chat and it decides how to do it, using a toolset: `read_file`

, `write_to_file`

, `run_command`

, `invoke_subagent`

.

The difference from a web chatbot: `agy`

has access to your local filesystem, runs terminal commands, and operates in autonomous loops. It's an engineer working on your machine.

| Script | Agent |
|---|---|
`sed 's/boto3/gcs/g'` across all files |
Analyzes the semantic context of each import and replaces it with the correct equivalent API |
| Fails if the environment changed | Adapts to the current state |
| Deterministic | Probabilistic + adaptive |

A skill is a `SKILL.md`

file with YAML frontmatter that defines when and how the agent uses that capability. The agent reads the `description`

field and decides whether the skill is relevant to the current task.

```
---
name: scan-deps
description: Scans the project for cloud-provider dependencies and generates
             dependency-map.json. Use when the user wants to map vendor lock-in
             before migrating to GKE.
---

## Steps

1. Ask which directory to scan
2. Run: python3 .agents/skills/.../scan_deps.py <PATH>
3. Present the DAG summary
```

💡

Key distinction:skills in`.agents/skills/`

are injected silently into context. To appear as a`/command`

in autocomplete, you need aplugininstalled at`~/.gemini/config/plugins/<plugin>/`

. More on that in Part 6.

A subagent is a child agent with completely isolated context. It doesn't "see" the parent's history or the other subagent's — exactly what we want: the Backend agent can't get confused by the YAML the Infra agent is writing.

```
# Pseudocode — how agy orchestrates this
invoke_subagent(
    name="backend-engine",
    system_prompt="You are an expert in migrating boto3 to GCS...",
    toolset=["read_file", "write_to_file", "run_command"],
    workspace="/path/to/shadow-worktree-backend",
    message="Refactor the files from dependency-map.json"
)
# Subagent B is invoked in parallel — no blocking
invoke_subagent(
    name="infra-engine",
    toolset=["write_to_file"],  # write only — principle of least privilege
    workspace="/path/to/shadow-worktree-infra",
    message="Generate serviceaccount.yaml, deployment.yaml, ingress.yaml for GKE"
)
```

Each subagent operates in an isolated **Git Worktree** — a physical copy of the repository in a separate directory, on a different branch. If Subagent A introduces a bug, `main`

stays untouched.

The first step is mapping the problem. `scan_deps.py`

walks the project with `os.walk()`

, applies regex patterns by category, and generates a DAG (Directed Acyclic Graph) as JSON.

```
# scripts/scan_deps.py
patterns = {
    "storage": [
        r"google\.cloud\.storage",
        r"boto3.*s3",         # AWS-coupled
        r"aws-sdk.*s3"
    ],
    "messaging": [
        r"google\.cloud\.pubsub",
        r"boto3.*sqs",        # AWS-coupled
        r"kafka-python",
    ],
    "secrets": [
        r"boto3.*secretsmanager",
        r"python-dotenv",
    ],
    "databases": [
        r"psycopg2", r"pymongo"
    ]
}

for root, dirs, files in os.walk(path):
    dirs[:] = [d for d in dirs if not d.startswith('.')
               and d not in ['venv', 'node_modules', '__pycache__']]
    for file in files:
        if not file.endswith(('.py', '.js', '.yaml', '.tf')):
            continue
        with open(os.path.join(root, file)) as f:
            content = f.read()
            for dep_type, pattern_list in patterns.items():
                for pattern in pattern_list:
                    if re.search(pattern, content, re.IGNORECASE):
                        dependencies[dep_type].append({
                            "file": os.path.relpath(file_path, path),
                            "matched_pattern": pattern
                        })
```

The output is a `dependency-map.json`

with the full dependency graph:

```
{
  "dependencies": {
    "storage": [
      { "file": "examples/legacy-app/app.py", "matched_pattern": "boto3.*s3" },
      { "file": "examples/legacy-app/api.py",  "matched_pattern": "boto3.*s3" }
    ],
    "messaging": [
      { "file": "examples/legacy-app/worker.py", "matched_pattern": "boto3.*sqs" }
    ]
  },
  "architectural_dag": {
    "nodes": [
      { "id": "application",   "type": "component" },
      { "id": "dep-storage",   "files": ["app.py", "api.py"] },
      { "id": "provider-aws",  "type": "cloud-provider" }
    ],
    "edges": [
      { "source": "application",  "target": "dep-storage",  "relation": "uses_storage"   },
      { "source": "dep-storage",  "target": "provider-aws", "relation": "coupled_to_aws" }
    ]
  },
  "recommended_action": "Execute '/spawn-refactor' targeting GCP GKE"
}
```

❓

Why a DAG and not a plain list?The graph reveals transitive relationships:`app.py`

and`worker.py`

both depend on AWS via`boto3`

— so they need to be refactored together. A list would only say "these files have boto3."

This was the most important design decision: how do I ensure the agent doesn't refactor the wrong file without me seeing what's happening first?

The answer lives in two places.

The `.agents/hooks.json`

file registers a `PreToolUse`

hook — a command that runs **before** any `write_to_file`

the agent attempts:

```
{
  "hitl-production-gate": {
    "enabled": true,
    "PreToolUse": [
      {
        "matcher": "write_to_file|replace_file_content|multi_replace_file_content",
        "hooks": [
          {
            "type": "command",
            "command": "python3 .agents/skills/agnostic-cluster-refactor/scripts/scan_deps.py --check-only",
            "timeout": 5
          }
        ]
      }
    ]
  }
}
```

The hook receives a JSON payload via stdin and responds with a decision:

```
# scan_deps.py — --check-only mode
SAFE_WRITE_PREFIXES = ("examples/", "terraform/", ".agents/")

def check_only_hook():
    payload = json.load(sys.stdin)
    target = payload.get("toolCall", {}).get("args", {}).get("TargetFile", "")
    workspace_root = payload.get("workspacePaths", ["."])[0]
    rel_path = os.path.relpath(target, workspace_root)

    if not any(rel_path.startswith(p) for p in SAFE_WRITE_PREFIXES):
        print(json.dumps({
            "decision": "force_ask",
            "reason": f"[HITL Gate] '{rel_path}' is outside safe directories. Confirm before proceeding."
        }))
    else:
        print(json.dumps({"decision": "allow"}))
```

Three possible decisions the hook can return:

| Decision | Effect |
|---|---|
`"allow"` |
Agent proceeds automatically |
`"force_ask"` |
agy pauses and asks the human |
`"deny"` |
Completely blocked, no prompt |

Testing it from the command line:

```
# File OUTSIDE safe directories
echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/app.py"}},
      "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only
# → {"decision": "force_ask", "reason": "[HITL Gate] 'src/app.py' is outside safe directories..."}

# File INSIDE safe directories
echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/examples/k8s/deployment.yaml"}},
      "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only
# → {"decision": "allow"}
```

Beyond the automatic hook, the `/spawn-refactor`

`SKILL.md`

instructs the agent to always ask for explicit confirmation before spawning subagents:

```
## HITL Gate — mandatory before any mutation

Display the list of files that will be changed and ask:

  The following files will be modified:
    - examples/legacy-app/app.py    (replace boto3 → GCS)
    - examples/legacy-app/worker.py (replace SQS → Pub/Sub)

  Type YES to confirm or NO to abort.

Halt if the user does not confirm with YES.
```

🛡️ Two layers of protection: the hook catches any write automatically, and the SKILL.md forces you to see the full plan before anything moves.

After Subagent A runs, `app.py`

goes from the boto3 mess above to this:

``` python
# examples/refactored-app/app.py
import os
from flask import Flask, request, jsonify

app = Flask(__name__)

DB_PASSWORD = os.getenv("DB_PASSWORD")
if not DB_PASSWORD:
    raise RuntimeError("DB_PASSWORD environment variable is required!")

GCS_BUCKET_NAME = os.getenv("GCS_BUCKET_NAME", "local-mock")

# LOCAL_MOCK=true → bypasses GCS; useful for K8s plumbing tests without real credentials
LOCAL_MOCK = os.getenv("LOCAL_MOCK", "false").lower() == "true"

if LOCAL_MOCK:
    storage_client = None
    print("[LOCAL_MOCK] GCS disabled. Uploads will be simulated.")
else:
    from google.cloud import storage  # import only when we actually need GCS
    storage_client = storage.Client()  # zero credentials — ADC via Workload Identity

_mock_store: dict[str, bytes] = {}

@app.route("/health", methods=["GET"])
def health():
    return jsonify({
        "status": "healthy",
        "platform": "local-k8s" if LOCAL_MOCK else "gcp-gke",
        "gcs_bucket": GCS_BUCKET_NAME,
        "mock_mode": LOCAL_MOCK,
    })

@app.route("/upload", methods=["POST"])
def upload_file():
    file = request.files["file"]
    filename = file.filename

    if LOCAL_MOCK:
        data = file.read()
        _mock_store[filename] = data
        return jsonify({
            "message": f"[LOCAL_MOCK] {filename} stored in memory ({len(data)} bytes)",
            "gcs_uri": f"gs://local-mock/{filename}",
            "files_in_mock": list(_mock_store.keys()),
        })

    bucket = storage_client.bucket(GCS_BUCKET_NAME)
    blob = bucket.blob(filename)
    blob.upload_from_file(file)
    return jsonify({
        "message": f"Uploaded {filename} to {GCS_BUCKET_NAME}",
        "gcs_uri": f"gs://{GCS_BUCKET_NAME}/{filename}",
    })

@app.route("/files", methods=["GET"])
def list_files():
    if LOCAL_MOCK:
        return jsonify({"files": list(_mock_store.keys()), "source": "local-mock"})
    blobs = storage_client.list_blobs(GCS_BUCKET_NAME)
    return jsonify({"files": [b.name for b in blobs], "source": f"gs://{GCS_BUCKET_NAME}"})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080, debug=LOCAL_MOCK)
```

**What changed:**

| Before | After |
|---|---|
`import boto3` |
`from google.cloud import storage` (conditional) |
`boto3.client('s3', aws_access_key_id=...)` |
`storage.Client()` — zero credentials |
`file.save("/tmp/...")` |
`blob.upload_from_file(file)` |
`DB_PASSWORD` with insecure default |
`RuntimeError` if missing |

```
# ❌ Wrong — crashes at startup without GCP credentials
from google.cloud import storage
storage_client = storage.Client()   # RuntimeError before any request is handled

# ✅ Correct — import only happens when we actually need it
if LOCAL_MOCK:
    storage_client = None
else:
    from google.cloud import storage   # ← inside the else block
    storage_client = storage.Client()
```

`from google.cloud import storage`

executes when Python loads the module — before serving any request. Without GCP credentials, the app crashes at startup. Moving the import inside `else`

fixes it: with `LOCAL_MOCK=true`

, the module is never imported.

I wanted to validate the entire K8s stack (Deployment, ConfigMap, Secret, Service, health checks, routing) locally using Docker Desktop — without needing real GCP credentials.

The solution was `LOCAL_MOCK=true`

combined with a Docker Desktop quirk that catches a lot of people off guard.

Docker Desktop uses **two completely separate runtimes** that don't share images:

```
┌──────────────────────────────────────┐
│  Docker daemon                       │  ← docker build, docker images
│  (images here are NOT visible to K8s)│
└──────────────────────────────────────┘

┌──────────────────────────────────────┐
│  containerd                          │  ← used by the Kubernetes cluster
│  (separate namespace)                │
└──────────────────────────────────────┘
```

When you run `docker build -t my-image .`

, the image exists in the Docker daemon but **not** in containerd. With `imagePullPolicy: Never`

, K8s looks in containerd and fails:

```
Failed to pull image "my-image:local": ErrImageNeverPull
```

The fix: a **local registry** as the bridge between both runtimes.

```
# registry:2 on port 5001 (port 5000 is taken by macOS AirPlay)
docker run -d -p 5001:5000 --restart=always --name local-registry registry:2
```

Now the flow works end-to-end:

```
docker build → Docker daemon
      ↓
docker tag + push → localhost:5001 → registry:2
      ↓
containerd pulls from registry:2 ← K8s Pod starts successfully
```

The `Makefile`

handles all of this in a single command:

```
REGISTRY       = localhost:5001
REGISTRY_IMAGE = $(REGISTRY)/agnostic-cluster-refactor:local

registry-start:
    @docker ps --filter name=local-registry --filter status=running | grep local-registry || \
        docker run -d -p 5001:5000 --restart=always --name local-registry registry:2

build: registry-start
    docker build -t agnostic-cluster-refactor:local .
    docker tag agnostic-cluster-refactor:local $(REGISTRY_IMAGE)
    docker push $(REGISTRY_IMAGE)
    @echo "Image available to K8s: $(REGISTRY_IMAGE)"

local-up:
    kubectl config use-context docker-desktop
    kubectl apply -f examples/k8s/local/secret-db.yaml
    kubectl apply -f examples/k8s/local/configmap.local.yaml
    kubectl apply -f examples/k8s/local/deployment.local.yaml
    kubectl apply -f examples/k8s/local/service.local.yaml
    kubectl rollout status deployment/agnostic-cluster-app --timeout=60s
    @echo "Access: http://localhost:8080/health"
# examples/k8s/local/deployment.local.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agnostic-cluster-app
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: app
          image: localhost:5001/agnostic-cluster-refactor:local
          imagePullPolicy: Always   # always pull from local registry
          envFrom:
            - configMapRef:
                name: app-config-local   # injects LOCAL_MOCK=true
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: app-secrets
                  key: db-password
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
# examples/k8s/local/configmap.local.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-local
data:
  GCS_BUCKET_NAME: "local-mock"
  GCP_PROJECT_ID: "local-dev"
  LOCAL_MOCK: "true"    # ← activates the in-memory store
```

Running it:

```
make build      # build + push to local registry
make local-up   # apply all manifests

curl http://localhost:8080/health
# {"status":"healthy","platform":"local-k8s","mock_mode":true,"gcs_bucket":"local-mock"}

curl -X POST http://localhost:8080/upload -F "file=@package.json"
# {"message":"[LOCAL_MOCK] package.json stored in memory (842 bytes)",
#  "gcs_uri":"gs://local-mock/package.json"}

curl http://localhost:8080/files
# {"files":["package.json"],"source":"local-mock"}

make local-down  # teardown
```

✅ Entire K8s stack validated — Deployment, ConfigMap, Secret, Service, health checks, routing — without a single GCP token.

On GKE, the story is completely different.

**The naive approach:**

```
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/app/sa-key.json"
storage_client = storage.Client()
```

This requires a JSON key file inside the container, which means:

**The Workload Identity approach:** annotate a Kubernetes Service Account (KSA) with a Google Service Account (GSA) email:

```
# examples/k8s/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: refactored-app-ksa
  annotations:
    iam.gke.io/gcp-service-account: "gke-app-sa@MY_PROJECT.iam.gserviceaccount.com"
```

GKE's internal metadata server intercepts ADC calls from Pods, verifies the annotation, and returns a short-lived OAuth2 token:

The application code becomes:

```
# Zero credentials — works automatically on GKE
storage_client = storage.Client()
```

Terraform provisions the IAM binding automatically:

```
# terraform/iam.tf
resource "google_service_account_iam_member" "workload_identity" {
  service_account_id = google_service_account.app.name
  role               = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project_id}.svc.id.goog[default/refactored-app-ksa]"
}
```

🔐 This binding is the handshake between the Kubernetes world and GCP IAM. Without it, no token is issued —

`storage.Client()`

returns a 403.

When I first tested, `/scan-deps`

and `/spawn-refactor`

**did not appear** in the `agy`

autocomplete. I spent a good chunk of time debugging this.

The discovery: `agy`

has three distinct skill-loading mechanisms:

| Mechanism | Location | Shows in `/` autocomplete? |
|---|---|---|
| Project skill | `.agents/skills/<name>/SKILL.md` |
❌ No |
| Global contextual skill | `~/.gemini/antigravity-cli/skills/` |
❌ No |
Plugin with namespace |
`~/.gemini/config/plugins/<plugin>/` |
✅ Yes
|

To make the commands appear, create the plugin structure:

```
mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/scan-deps
mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/spawn-refactor

cat > ~/.gemini/config/plugins/agnostic-cluster-refactor/plugin.json << 'EOF'
{
  "name": "agnostic-cluster-refactor",
  "version": "1.0.0",
  "description": "Migrates apps from AWS to GCP GKE with Workload Identity."
}
EOF
```

After restarting `agy`

, the autocomplete shows:

```
/agnostic-cluster-refactor:scan-deps
/agnostic-cluster-refactor:spawn-refactor
```

The namespace prevents collisions — two different plugins can both have a skill named `scan-deps`

and they'll appear as `/plugin-a:scan-deps`

and `/plugin-b:scan-deps`

.

When I ran `/agnostic-cluster-refactor:spawn-refactor`

and confirmed the HITL Gate, Gemini (the `agy`

engine) orchestrated:

**Subagent A (Backend) — in shadow-worktree-backend:**

`dependency-map.json`

to identify boto3 files`import boto3`

→ `from google.cloud import storage, pubsub_v1`

in each file`boto3.client('s3', ...)`

→ `storage.Client().bucket(...)`

with semantically equivalent calls`boto3.client('sqs', ...)`

→ `pubsub_v1.SubscriberClient()`

`requirements.txt`

: removed `boto3==1.28.0`

, added `google-cloud-storage==2.10.0`

and `google-cloud-pubsub==2.18.0`

**Subagent B (Infra) — in shadow-worktree-infra:**

`serviceaccount.yaml`

with the `iam.gke.io/gcp-service-account`

annotation`deployment.yaml`

with env vars via ConfigMap/Secret — no hardcoded credentials`ingress.yaml`

with `ingressClassName: gce`

(the current format, not the deprecated annotation)All in isolated Git Worktrees, in parallel, without touching `main`

.

**1. The conditional import is intentional, not lazy.**

When `LOCAL_MOCK=true`

, `from google.cloud import storage`

must not run at module level. Without GCP credentials, it throws at startup before any request is served. Import conditionally.

**2. Docker Desktop K8s and the Docker daemon live in separate worlds.**

`imagePullPolicy: Never`

breaks with Docker Desktop because K8s uses containerd, not the daemon. Use a local registry on port 5001 (5000 is taken by macOS) and `imagePullPolicy: Always`

.

**3. .agents/workflows/ does not create slash commands in agy.**

Skills in `.agents/skills/`

are context injections, not interactive commands. The `/`

autocomplete requires a plugin installed in `~/.gemini/config/plugins/`

.

**4. The HITL Gate needs two independent layers.**

A hook catches unexpected writes automatically. But for `/spawn-refactor`

— which modifies multiple files in parallel — explicit plan confirmation in the SKILL.md is non-negotiable. Without both layers, the agent can act before you understand the blast radius.

**5. Workload Identity eliminates an entire security problem class.**

No JSON keys in containers means no credential leaks in logs, no manual rotation, no hardcoded keys in Dockerfiles, and no Secret volumes mounted on Pod disk. The Metadata Server's short-lived tokens are genuinely safer.

```
# Clone
git clone https://github.com/carlosrgomes/agnostic-cluster-refactor
cd agnostic-cluster-refactor

# Test locally without GCP (Docker Desktop K8s)
make build      # build + push to local registry
make local-up   # apply manifests to docker-desktop context
curl http://localhost:8080/health

# Scan your own project's dependencies
python3 scripts/scan_deps.py /path/to/your/project
cat dependency-map.json | python3 -m json.tool

# Validate the HITL Gate hook
echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/main.py"}},
      "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only
# → {"decision": "force_ask", ...}

# Teardown
make local-down
```

For the full GKE deployment with Workload Identity, the project README includes the Terraform that provisions all the infrastructure.

The project started from a real problem (boto3 everywhere) and ended up with a surprisingly complete solution: automatic dependency scanning, parallel subagent refactoring, mandatory human oversight, local K8s testing without cloud credentials, and keyless production auth.

What impressed me most wasn't the AI doing the refactoring — it was the **supervision system design**: hooks intercepting any write outside safe directories, SKILL.md with an explicit gate before destructive actions, and Git Worktrees ensuring `main`

is never touched without human review.

An autonomous agent without oversight is a chaotic script. An agent with a well-designed HITL Gate is a trustworthy teammate.

Tutorial técnico completo: migração autônoma de aplicações acopladas à AWS para o Google Kubernetes Engine (GKE) usando o Antigravity CLI com Workload Identity, subagentes paralelos e HITL gate.

`/scan-deps`

)`/spawn-refactor`

)Aplicações legadas acumulam acoplamentos…
