Agnostic Cluster Refactor Skill for Antigrafity CLI: Building an AI Agent that Migrates Apps from AWS to GKE (Subagents, HITL Gate & Workload Identity)

wpnews.pro

Have you ever inherited a codebase where import boto3

appears in 47 different files? Where AWS credentials live in hardcoded environment variables and file storage is a file.save("/tmp/...")

that will blow up the moment it hits an ephemeral Kubernetes pod?

I did. And instead of refactoring everything by hand, I built an AI agent to do it for me — with mandatory human oversight before any production mutation.

This article documents what I built: a skill for the Antigravity CLI (agy

) that scans cloud dependencies, spawns parallel subagents to refactor code and infrastructure, and validates everything on local Kubernetes before deploying to GKE with keyless Workload Identity.

boto3

is the AWS SDK for Python. It seems harmless at first:

import boto3
s3 = boto3.client('s3', region_name='us-east-1')
s3.upload_fileobj(file, bucket_name, filename)

Six months later:

import os
import boto3
from flask import Flask, request, jsonify

app = Flask(__name__)

DB_PASSWORD = os.getenv("DB_PASSWORD", "default-insecure-password")

S3_BUCKET = os.getenv("AWS_S3_BUCKET_NAME")
AWS_REGION = os.getenv("AWS_DEFAULT_REGION", "us-east-1")

s3_client = boto3.client(
    's3',
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    region_name=AWS_REGION
)

@app.route("/upload", methods=["POST"])
def upload_file():
    file = request.files['file']
    filename = file.filename
    if S3_BUCKET:
        s3_client.upload_fileobj(file, S3_BUCKET, filename)
        return jsonify({"message": f"Uploaded to AWS S3: {S3_BUCKET}"})
    else:
        local_path = os.path.join("/tmp", filename)
        file.save(local_path)
        return jsonify({"message": f"Saved locally at {local_path}"})

Three coupling problems in a single file: proprietary SDK (boto3

), AWS-specific credentials, and local disk storage that doesn't survive ephemeral Kubernetes pods.

Now multiply that by 10 services.

A skill for the Antigravity CLI that adds two commands to the agent chat:

/agnostic-cluster-refactor:scan-deps
/agnostic-cluster-refactor:spawn-refactor

The complete flow:

But before diving into the code, let me introduce the players.

agy

is not a script. It's an LLM-powered agent — you describe what you want in the chat and it decides how to do it, using a toolset: read_file

, write_to_file

, run_command

, invoke_subagent

.

The difference from a web chatbot: agy

has access to your local filesystem, runs terminal commands, and operates in autonomous loops. It's an engineer working on your machine.

Script	Agent
`sed 's/boto3/gcs/g'` across all files
Analyzes the semantic context of each import and replaces it with the correct equivalent API
Fails if the environment changed	Adapts to the current state
Deterministic	Probabilistic + adaptive

A skill is a SKILL.md

file with YAML frontmatter that defines when and how the agent uses that capability. The agent reads the description

field and decides whether the skill is relevant to the current task.

---
name: scan-deps
description: Scans the project for cloud-provider dependencies and generates
             dependency-map.json. Use when the user wants to map vendor lock-in
             before migrating to GKE.
---

## Steps

1. Ask which directory to scan
2. Run: python3 .agents/skills/.../scan_deps.py <PATH>
3. Present the DAG summary

💡

Key distinction:skills in.agents/skills/

are injected silently into context. To appear as a/command

in autocomplete, you need aplugininstalled at~/.gemini/config/plugins/<plugin>/

. More on that in Part 6.

A subagent is a child agent with completely isolated context. It doesn't "see" the parent's history or the other subagent's — exactly what we want: the Backend agent can't get confused by the YAML the Infra agent is writing.

invoke_subagent(
    name="backend-engine",
    system_prompt="You are an expert in migrating boto3 to GCS...",
    toolset=["read_file", "write_to_file", "run_command"],
    workspace="/path/to/shadow-worktree-backend",
    message="Refactor the files from dependency-map.json"
)
invoke_subagent(
    name="infra-engine",
    toolset=["write_to_file"],  # write only — principle of least privilege
    workspace="/path/to/shadow-worktree-infra",
    message="Generate serviceaccount.yaml, deployment.yaml, ingress.yaml for GKE"
)

Each subagent operates in an isolated Git Worktree — a physical copy of the repository in a separate directory, on a different branch. If Subagent A introduces a bug, main

stays untouched.

The first step is mapping the problem. scan_deps.py

walks the project with os.walk()

, applies regex patterns by category, and generates a DAG (Directed Acyclic Graph) as JSON.

patterns = {
    "storage": [
        r"google\.cloud\.storage",
        r"boto3.*s3",         # AWS-coupled
        r"aws-sdk.*s3"
    ],
    "messaging": [
        r"google\.cloud\.pubsub",
        r"boto3.*sqs",        # AWS-coupled
        r"kafka-python",
    ],
    "secrets": [
        r"boto3.*secretsmanager",
        r"python-dotenv",
    ],
    "databases": [
        r"psycopg2", r"pymongo"
    ]
}

for root, dirs, files in os.walk(path):
    dirs[:] = [d for d in dirs if not d.startswith('.')
               and d not in ['venv', 'node_modules', '__pycache__']]
    for file in files:
        if not file.endswith(('.py', '.js', '.yaml', '.tf')):
            continue
        with open(os.path.join(root, file)) as f:
            content = f.read()
            for dep_type, pattern_list in patterns.items():
                for pattern in pattern_list:
                    if re.search(pattern, content, re.IGNORECASE):
                        dependencies[dep_type].append({
                            "file": os.path.relpath(file_path, path),
                            "matched_pattern": pattern
                        })

The output is a dependency-map.json

with the full dependency graph:

{
  "dependencies": {
    "storage": [
      { "file": "examples/legacy-app/app.py", "matched_pattern": "boto3.*s3" },
      { "file": "examples/legacy-app/api.py",  "matched_pattern": "boto3.*s3" }
    ],
    "messaging": [
      { "file": "examples/legacy-app/worker.py", "matched_pattern": "boto3.*sqs" }
    ]
  },
  "architectural_dag": {
    "nodes": [
      { "id": "application",   "type": "component" },
      { "id": "dep-storage",   "files": ["app.py", "api.py"] },
      { "id": "provider-aws",  "type": "cloud-provider" }
    ],
    "edges": [
      { "source": "application",  "target": "dep-storage",  "relation": "uses_storage"   },
      { "source": "dep-storage",  "target": "provider-aws", "relation": "coupled_to_aws" }
    ]
  },
  "recommended_action": "Execute '/spawn-refactor' targeting GCP GKE"
}

❓

Why a DAG and not a plain list?The graph reveals transitive relationships:app.py

andworker.py

both depend on AWS viaboto3

— so they need to be refactored together. A list would only say "these files have boto3."

This was the most important design decision: how do I ensure the agent doesn't refactor the wrong file without me seeing what's happening first?

The answer lives in two places.

The .agents/hooks.json

file registers a PreToolUse

hook — a command that runs before any write_to_file

the agent attempts:

{
  "hitl-production-gate": {
    "enabled": true,
    "PreToolUse": [
      {
        "matcher": "write_to_file|replace_file_content|multi_replace_file_content",
        "hooks": [
          {
            "type": "command",
            "command": "python3 .agents/skills/agnostic-cluster-refactor/scripts/scan_deps.py --check-only",
            "timeout": 5
          }
        ]
      }
    ]
  }
}

The hook receives a JSON payload via stdin and responds with a decision:

SAFE_WRITE_PREFIXES = ("examples/", "terraform/", ".agents/")

def check_only_hook():
    payload = json.load(sys.stdin)
    target = payload.get("toolCall", {}).get("args", {}).get("TargetFile", "")
    workspace_root = payload.get("workspacePaths", ["."])[0]
    rel_path = os.path.relpath(target, workspace_root)

    if not any(rel_path.startswith(p) for p in SAFE_WRITE_PREFIXES):
        print(json.dumps({
            "decision": "force_ask",
            "reason": f"[HITL Gate] '{rel_path}' is outside safe directories. Confirm before proceeding."
        }))
    else:
        print(json.dumps({"decision": "allow"}))

Three possible decisions the hook can return:

Decision	Effect
`"allow"`
Agent proceeds automatically
`"force_ask"`
agy s and asks the human
`"deny"`
Completely blocked, no prompt

Testing it from the command line:

echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/app.py"}},
      "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only

echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/examples/k8s/deployment.yaml"}},
      "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only

Beyond the automatic hook, the /spawn-refactor

SKILL.md

instructs the agent to always ask for explicit confirmation before spawning subagents:

## HITL Gate — mandatory before any mutation

Display the list of files that will be changed and ask:

  The following files will be modified:
    - examples/legacy-app/app.py    (replace boto3 → GCS)
    - examples/legacy-app/worker.py (replace SQS → Pub/Sub)

  Type YES to confirm or NO to abort.

Halt if the user does not confirm with YES.

🛡️ Two layers of protection: the hook catches any write automatically, and the SKILL.md forces you to see the full plan before anything moves.

After Subagent A runs, app.py

goes from the boto3 mess above to this:

import os
from flask import Flask, request, jsonify

app = Flask(__name__)

DB_PASSWORD = os.getenv("DB_PASSWORD")
if not DB_PASSWORD:
    raise RuntimeError("DB_PASSWORD environment variable is required!")

GCS_BUCKET_NAME = os.getenv("GCS_BUCKET_NAME", "local-mock")

LOCAL_MOCK = os.getenv("LOCAL_MOCK", "false").lower() == "true"

if LOCAL_MOCK:
    storage_client = None
    print("[LOCAL_MOCK] GCS disabled. Uploads will be simulated.")
else:
    from google.cloud import storage  # import only when we actually need GCS
    storage_client = storage.Client()  # zero credentials — ADC via Workload Identity

_mock_store: dict[str, bytes] = {}

@app.route("/health", methods=["GET"])
def health():
    return jsonify({
        "status": "healthy",
        "platform": "local-k8s" if LOCAL_MOCK else "gcp-gke",
        "gcs_bucket": GCS_BUCKET_NAME,
        "mock_mode": LOCAL_MOCK,
    })

@app.route("/upload", methods=["POST"])
def upload_file():
    file = request.files["file"]
    filename = file.filename

    if LOCAL_MOCK:
        data = file.read()
        _mock_store[filename] = data
        return jsonify({
            "message": f"[LOCAL_MOCK] {filename} stored in memory ({len(data)} bytes)",
            "gcs_uri": f"gs://local-mock/{filename}",
            "files_in_mock": list(_mock_store.keys()),
        })

    bucket = storage_client.bucket(GCS_BUCKET_NAME)
    blob = bucket.blob(filename)
    blob.upload_from_file(file)
    return jsonify({
        "message": f"Uploaded {filename} to {GCS_BUCKET_NAME}",
        "gcs_uri": f"gs://{GCS_BUCKET_NAME}/{filename}",
    })

@app.route("/files", methods=["GET"])
def list_files():
    if LOCAL_MOCK:
        return jsonify({"files": list(_mock_store.keys()), "source": "local-mock"})
    blobs = storage_client.list_blobs(GCS_BUCKET_NAME)
    return jsonify({"files": [b.name for b in blobs], "source": f"gs://{GCS_BUCKET_NAME}"})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080, debug=LOCAL_MOCK)

What changed:

Before	After
`import boto3`
`from google.cloud import storage` (conditional)
`boto3.client('s3', aws_access_key_id=...)`
`storage.Client()` — zero credentials
`file.save("/tmp/...")`
`blob.upload_from_file(file)`
`DB_PASSWORD` with insecure default
`RuntimeError` if missing

from google.cloud import storage
storage_client = storage.Client()   # RuntimeError before any request is handled

if LOCAL_MOCK:
    storage_client = None
else:
    from google.cloud import storage   # ← inside the else block
    storage_client = storage.Client()

from google.cloud import storage

executes when Python loads the module — before serving any request. Without GCP credentials, the app crashes at startup. Moving the import inside else

fixes it: with LOCAL_MOCK=true

, the module is never imported.

I wanted to validate the entire K8s stack (Deployment, ConfigMap, Secret, Service, health checks, routing) locally using Docker Desktop — without needing real GCP credentials.

The solution was LOCAL_MOCK=true

combined with a Docker Desktop quirk that catches a lot of people off guard.

Docker Desktop uses two completely separate runtimes that don't share images:

┌──────────────────────────────────────┐
│  Docker daemon                       │  ← docker build, docker images
│  (images here are NOT visible to K8s)│
└──────────────────────────────────────┘

┌──────────────────────────────────────┐
│  containerd                          │  ← used by the Kubernetes cluster
│  (separate namespace)                │
└──────────────────────────────────────┘

When you run docker build -t my-image .

, the image exists in the Docker daemon but not in containerd. With imagePullPolicy: Never

, K8s looks in containerd and fails:

Failed to pull image "my-image:local": ErrImageNeverPull

The fix: a local registry as the bridge between both runtimes.

docker run -d -p 5001:5000 --restart=always --name local-registry registry:2

Now the flow works end-to-end:

docker build → Docker daemon
      ↓
docker tag + push → localhost:5001 → registry:2
      ↓
containerd pulls from registry:2 ← K8s Pod starts successfully

The Makefile

handles all of this in a single command:

REGISTRY       = localhost:5001
REGISTRY_IMAGE = $(REGISTRY)/agnostic-cluster-refactor:local

registry-start:
    @docker ps --filter name=local-registry --filter status=running | grep local-registry || \
        docker run -d -p 5001:5000 --restart=always --name local-registry registry:2

build: registry-start
    docker build -t agnostic-cluster-refactor:local .
    docker tag agnostic-cluster-refactor:local $(REGISTRY_IMAGE)
    docker push $(REGISTRY_IMAGE)
    @echo "Image available to K8s: $(REGISTRY_IMAGE)"

local-up:
    kubectl config use-context docker-desktop
    kubectl apply -f examples/k8s/local/secret-db.yaml
    kubectl apply -f examples/k8s/local/configmap.local.yaml
    kubectl apply -f examples/k8s/local/deployment.local.yaml
    kubectl apply -f examples/k8s/local/service.local.yaml
    kubectl rollout status deployment/agnostic-cluster-app --timeout=60s
    @echo "Access: http://localhost:8080/health"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agnostic-cluster-app
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: app
          image: localhost:5001/agnostic-cluster-refactor:local
          imagePullPolicy: Always   # always pull from local registry
          envFrom:
            - configMapRef:
                name: app-config-local   # injects LOCAL_MOCK=true
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: app-secrets
                  key: db-password
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-local
data:
  GCS_BUCKET_NAME: "local-mock"
  GCP_PROJECT_ID: "local-dev"
  LOCAL_MOCK: "true"    # ← activates the in-memory store

Running it:

make build      # build + push to local registry
make local-up   # apply all manifests

curl http://localhost:8080/health

curl -X POST http://localhost:8080/upload -F "file=@package.json"

curl http://localhost:8080/files

make local-down  # teardown

✅ Entire K8s stack validated — Deployment, ConfigMap, Secret, Service, health checks, routing — without a single GCP token.

On GKE, the story is completely different.

The naive approach:

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/app/sa-key.json"
storage_client = storage.Client()

This requires a JSON key file inside the container, which means:

The Workload Identity approach: annotate a Kubernetes Service Account (KSA) with a Google Service Account (GSA) email:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: refactored-app-ksa
  annotations:
    iam.gke.io/gcp-service-account: "gke-app-sa@MY_PROJECT.iam.gserviceaccount.com"

GKE's internal metadata server intercepts ADC calls from Pods, verifies the annotation, and returns a short-lived OAuth2 token:

The application code becomes:

storage_client = storage.Client()

Terraform provisions the IAM binding automatically:

resource "google_service_account_iam_member" "workload_identity" {
  service_account_id = google_service_account.app.name
  role               = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project_id}.svc.id.goog[default/refactored-app-ksa]"
}

🔐 This binding is the handshake between the Kubernetes world and GCP IAM. Without it, no token is issued —

storage.Client()

returns a 403.

When I first tested, /scan-deps

and /spawn-refactor

did not appear in the agy

autocomplete. I spent a good chunk of time debugging this.

The discovery: agy

has three distinct skill- mechanisms:

Mechanism	Location	Shows in `/` autocomplete?
Project skill	`.agents/skills/<name>/SKILL.md`
❌ No
Global contextual skill	`~/.gemini/antigravity-cli/skills/`
❌ No
Plugin with namespace
`~/.gemini/config/plugins/<plugin>/`
✅ Yes

To make the commands appear, create the plugin structure:

mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/scan-deps
mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/spawn-refactor

cat > ~/.gemini/config/plugins/agnostic-cluster-refactor/plugin.json << 'EOF'
{
  "name": "agnostic-cluster-refactor",
  "version": "1.0.0",
  "description": "Migrates apps from AWS to GCP GKE with Workload Identity."
}
EOF

After restarting agy

, the autocomplete shows:

/agnostic-cluster-refactor:scan-deps
/agnostic-cluster-refactor:spawn-refactor

The namespace prevents collisions — two different plugins can both have a skill named scan-deps

and they'll appear as /plugin-a:scan-deps

and /plugin-b:scan-deps

.

When I ran /agnostic-cluster-refactor:spawn-refactor

and confirmed the HITL Gate, Gemini (the agy

engine) orchestrated:

Subagent A (Backend) — in shadow-worktree-backend:

dependency-map.json

to identify boto3 filesimport boto3

→ from google.cloud import storage, pubsub_v1

in each fileboto3.client('s3', ...)

→ storage.Client().bucket(...)

with semantically equivalent callsboto3.client('sqs', ...)

→ pubsub_v1.SubscriberClient()

requirements.txt

: removed boto3==1.28.0

, added google-cloud-storage==2.10.0

and google-cloud-pubsub==2.18.0

Subagent B (Infra) — in shadow-worktree-infra:

serviceaccount.yaml

with the iam.gke.io/gcp-service-account

annotationdeployment.yaml

with env vars via ConfigMap/Secret — no hardcoded credentialsingress.yaml

with ingressClassName: gce

(the current format, not the deprecated annotation)All in isolated Git Worktrees, in parallel, without touching main

.

1. The conditional import is intentional, not lazy.

When LOCAL_MOCK=true

, from google.cloud import storage

must not run at module level. Without GCP credentials, it throws at startup before any request is served. Import conditionally.

2. Docker Desktop K8s and the Docker daemon live in separate worlds.

imagePullPolicy: Never

breaks with Docker Desktop because K8s uses containerd, not the daemon. Use a local registry on port 5001 (5000 is taken by macOS) and imagePullPolicy: Always

.

3. .agents/workflows/ does not create slash commands in agy.

Skills in .agents/skills/

are context injections, not interactive commands. The /

autocomplete requires a plugin installed in ~/.gemini/config/plugins/

.

4. The HITL Gate needs two independent layers.

A hook catches unexpected writes automatically. But for /spawn-refactor

— which modifies multiple files in parallel — explicit plan confirmation in the SKILL.md is non-negotiable. Without both layers, the agent can act before you understand the blast radius.

5. Workload Identity eliminates an entire security problem class.

No JSON keys in containers means no credential leaks in logs, no manual rotation, no hardcoded keys in Dockerfiles, and no Secret volumes mounted on Pod disk. The Metadata Server's short-lived tokens are genuinely safer.

git clone https://github.com/carlosrgomes/agnostic-cluster-refactor
cd agnostic-cluster-refactor

make build      # build + push to local registry
make local-up   # apply manifests to docker-desktop context
curl http://localhost:8080/health

python3 scripts/scan_deps.py /path/to/your/project
cat dependency-map.json | python3 -m json.tool

echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/main.py"}},
      "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only

make local-down

For the full GKE deployment with Workload Identity, the project README includes the Terraform that provisions all the infrastructure.

The project started from a real problem (boto3 everywhere) and ended up with a surprisingly complete solution: automatic dependency scanning, parallel subagent refactoring, mandatory human oversight, local K8s testing without cloud credentials, and keyless production auth.

What impressed me most wasn't the AI doing the refactoring — it was the supervision system design: hooks intercepting any write outside safe directories, SKILL.md with an explicit gate before destructive actions, and Git Worktrees ensuring main

is never touched without human review.

An autonomous agent without oversight is a chaotic script. An agent with a well-designed HITL Gate is a trustworthy teammate.

Tutorial técnico completo: migração autônoma de aplicações acopladas à AWS para o Google Kubernetes Engine (GKE) usando o Antigravity CLI com Workload Identity, subagentes paralelos e HITL gate.

/scan-deps

)/spawn-refactor

)Aplicações legadas acumulam acoplamentos…

source & further reading

dev.to — original article Cum îți citește de fapt motorul de căutare pagina în 2026 (și de ce numărul de cuvinte e irelevant) Scientists On AI: It’s Still Experimental Gemma, the Epstein Files, and sandboxing cause a stir at the World's Fair

Agnostic Cluster Refactor Skill for Antigrafity CLI: Building an AI Agent that Migrates Apps from AWS to GKE (Subagents, HITL Gate & Workload Identity)

Run your AI side-project on zahid.host