# Agnostic Cluster Refactor Skill for Antigrafity CLI: Building an AI Agent that Migrates Apps from AWS to GKE (Subagents, HITL Gate & Workload Identity) > Source: > Published: 2026-06-30 15:10:30+00:00 Have you ever inherited a codebase where `import boto3` appears in 47 different files? Where AWS credentials live in hardcoded environment variables and file storage is a `file.save("/tmp/...")` that will blow up the moment it hits an ephemeral Kubernetes pod? I did. And instead of refactoring everything by hand, I built an AI agent to do it for me — with mandatory human oversight before any production mutation. This article documents what I built: a **skill for the Antigravity CLI** (`agy` ) that scans cloud dependencies, spawns parallel subagents to refactor code and infrastructure, and validates everything on local Kubernetes before deploying to GKE with keyless Workload Identity. `boto3` is the AWS SDK for Python. It seems harmless at first: ``` python # Innocent on day 1 import boto3 s3 = boto3.client('s3', region_name='us-east-1') s3.upload_fileobj(file, bucket_name, filename) ``` Six months later: ``` python # examples/legacy-app/app.py — the real state after it grows import os import boto3 from flask import Flask, request, jsonify app = Flask(__name__) # "Temporary" hardcoded since 2022 DB_PASSWORD = os.getenv("DB_PASSWORD", "default-insecure-password") S3_BUCKET = os.getenv("AWS_S3_BUCKET_NAME") AWS_REGION = os.getenv("AWS_DEFAULT_REGION", "us-east-1") s3_client = boto3.client( 's3', aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"), aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"), region_name=AWS_REGION ) @app.route("/upload", methods=["POST"]) def upload_file(): file = request.files['file'] filename = file.filename if S3_BUCKET: s3_client.upload_fileobj(file, S3_BUCKET, filename) return jsonify({"message": f"Uploaded to AWS S3: {S3_BUCKET}"}) else: # Fallback to local disk — will break in K8s ephemeral pods local_path = os.path.join("/tmp", filename) file.save(local_path) return jsonify({"message": f"Saved locally at {local_path}"}) ``` Three coupling problems in a single file: proprietary SDK (`boto3` ), AWS-specific credentials, and local disk storage that doesn't survive ephemeral Kubernetes pods. Now multiply that by 10 services. A **skill** for the Antigravity CLI that adds two commands to the agent chat: ``` /agnostic-cluster-refactor:scan-deps /agnostic-cluster-refactor:spawn-refactor ``` The complete flow: But before diving into the code, let me introduce the players. `agy` is not a script. It's an LLM-powered agent — you describe what you want in the chat and it decides how to do it, using a toolset: `read_file` , `write_to_file` , `run_command` , `invoke_subagent` . The difference from a web chatbot: `agy` has access to your local filesystem, runs terminal commands, and operates in autonomous loops. It's an engineer working on your machine. | Script | Agent | |---|---| `sed 's/boto3/gcs/g'` across all files | Analyzes the semantic context of each import and replaces it with the correct equivalent API | | Fails if the environment changed | Adapts to the current state | | Deterministic | Probabilistic + adaptive | A skill is a `SKILL.md` file with YAML frontmatter that defines when and how the agent uses that capability. The agent reads the `description` field and decides whether the skill is relevant to the current task. ``` --- name: scan-deps description: Scans the project for cloud-provider dependencies and generates dependency-map.json. Use when the user wants to map vendor lock-in before migrating to GKE. --- ## Steps 1. Ask which directory to scan 2. Run: python3 .agents/skills/.../scan_deps.py 3. Present the DAG summary ``` 💡 Key distinction:skills in`.agents/skills/` are injected silently into context. To appear as a`/command` in autocomplete, you need aplugininstalled at`~/.gemini/config/plugins//` . More on that in Part 6. A subagent is a child agent with completely isolated context. It doesn't "see" the parent's history or the other subagent's — exactly what we want: the Backend agent can't get confused by the YAML the Infra agent is writing. ``` # Pseudocode — how agy orchestrates this invoke_subagent( name="backend-engine", system_prompt="You are an expert in migrating boto3 to GCS...", toolset=["read_file", "write_to_file", "run_command"], workspace="/path/to/shadow-worktree-backend", message="Refactor the files from dependency-map.json" ) # Subagent B is invoked in parallel — no blocking invoke_subagent( name="infra-engine", toolset=["write_to_file"], # write only — principle of least privilege workspace="/path/to/shadow-worktree-infra", message="Generate serviceaccount.yaml, deployment.yaml, ingress.yaml for GKE" ) ``` Each subagent operates in an isolated **Git Worktree** — a physical copy of the repository in a separate directory, on a different branch. If Subagent A introduces a bug, `main` stays untouched. The first step is mapping the problem. `scan_deps.py` walks the project with `os.walk()` , applies regex patterns by category, and generates a DAG (Directed Acyclic Graph) as JSON. ``` # scripts/scan_deps.py patterns = { "storage": [ r"google\.cloud\.storage", r"boto3.*s3", # AWS-coupled r"aws-sdk.*s3" ], "messaging": [ r"google\.cloud\.pubsub", r"boto3.*sqs", # AWS-coupled r"kafka-python", ], "secrets": [ r"boto3.*secretsmanager", r"python-dotenv", ], "databases": [ r"psycopg2", r"pymongo" ] } for root, dirs, files in os.walk(path): dirs[:] = [d for d in dirs if not d.startswith('.') and d not in ['venv', 'node_modules', '__pycache__']] for file in files: if not file.endswith(('.py', '.js', '.yaml', '.tf')): continue with open(os.path.join(root, file)) as f: content = f.read() for dep_type, pattern_list in patterns.items(): for pattern in pattern_list: if re.search(pattern, content, re.IGNORECASE): dependencies[dep_type].append({ "file": os.path.relpath(file_path, path), "matched_pattern": pattern }) ``` The output is a `dependency-map.json` with the full dependency graph: ``` { "dependencies": { "storage": [ { "file": "examples/legacy-app/app.py", "matched_pattern": "boto3.*s3" }, { "file": "examples/legacy-app/api.py", "matched_pattern": "boto3.*s3" } ], "messaging": [ { "file": "examples/legacy-app/worker.py", "matched_pattern": "boto3.*sqs" } ] }, "architectural_dag": { "nodes": [ { "id": "application", "type": "component" }, { "id": "dep-storage", "files": ["app.py", "api.py"] }, { "id": "provider-aws", "type": "cloud-provider" } ], "edges": [ { "source": "application", "target": "dep-storage", "relation": "uses_storage" }, { "source": "dep-storage", "target": "provider-aws", "relation": "coupled_to_aws" } ] }, "recommended_action": "Execute '/spawn-refactor' targeting GCP GKE" } ``` ❓ Why a DAG and not a plain list?The graph reveals transitive relationships:`app.py` and`worker.py` both depend on AWS via`boto3` — so they need to be refactored together. A list would only say "these files have boto3." This was the most important design decision: how do I ensure the agent doesn't refactor the wrong file without me seeing what's happening first? The answer lives in two places. The `.agents/hooks.json` file registers a `PreToolUse` hook — a command that runs **before** any `write_to_file` the agent attempts: ``` { "hitl-production-gate": { "enabled": true, "PreToolUse": [ { "matcher": "write_to_file|replace_file_content|multi_replace_file_content", "hooks": [ { "type": "command", "command": "python3 .agents/skills/agnostic-cluster-refactor/scripts/scan_deps.py --check-only", "timeout": 5 } ] } ] } } ``` The hook receives a JSON payload via stdin and responds with a decision: ``` # scan_deps.py — --check-only mode SAFE_WRITE_PREFIXES = ("examples/", "terraform/", ".agents/") def check_only_hook(): payload = json.load(sys.stdin) target = payload.get("toolCall", {}).get("args", {}).get("TargetFile", "") workspace_root = payload.get("workspacePaths", ["."])[0] rel_path = os.path.relpath(target, workspace_root) if not any(rel_path.startswith(p) for p in SAFE_WRITE_PREFIXES): print(json.dumps({ "decision": "force_ask", "reason": f"[HITL Gate] '{rel_path}' is outside safe directories. Confirm before proceeding." })) else: print(json.dumps({"decision": "allow"})) ``` Three possible decisions the hook can return: | Decision | Effect | |---|---| `"allow"` | Agent proceeds automatically | `"force_ask"` | agy pauses and asks the human | `"deny"` | Completely blocked, no prompt | Testing it from the command line: ``` # File OUTSIDE safe directories echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/app.py"}}, "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only # → {"decision": "force_ask", "reason": "[HITL Gate] 'src/app.py' is outside safe directories..."} # File INSIDE safe directories echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/examples/k8s/deployment.yaml"}}, "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only # → {"decision": "allow"} ``` Beyond the automatic hook, the `/spawn-refactor` `SKILL.md` instructs the agent to always ask for explicit confirmation before spawning subagents: ``` ## HITL Gate — mandatory before any mutation Display the list of files that will be changed and ask: The following files will be modified: - examples/legacy-app/app.py (replace boto3 → GCS) - examples/legacy-app/worker.py (replace SQS → Pub/Sub) Type YES to confirm or NO to abort. Halt if the user does not confirm with YES. ``` 🛡️ Two layers of protection: the hook catches any write automatically, and the SKILL.md forces you to see the full plan before anything moves. After Subagent A runs, `app.py` goes from the boto3 mess above to this: ``` python # examples/refactored-app/app.py import os from flask import Flask, request, jsonify app = Flask(__name__) DB_PASSWORD = os.getenv("DB_PASSWORD") if not DB_PASSWORD: raise RuntimeError("DB_PASSWORD environment variable is required!") GCS_BUCKET_NAME = os.getenv("GCS_BUCKET_NAME", "local-mock") # LOCAL_MOCK=true → bypasses GCS; useful for K8s plumbing tests without real credentials LOCAL_MOCK = os.getenv("LOCAL_MOCK", "false").lower() == "true" if LOCAL_MOCK: storage_client = None print("[LOCAL_MOCK] GCS disabled. Uploads will be simulated.") else: from google.cloud import storage # import only when we actually need GCS storage_client = storage.Client() # zero credentials — ADC via Workload Identity _mock_store: dict[str, bytes] = {} @app.route("/health", methods=["GET"]) def health(): return jsonify({ "status": "healthy", "platform": "local-k8s" if LOCAL_MOCK else "gcp-gke", "gcs_bucket": GCS_BUCKET_NAME, "mock_mode": LOCAL_MOCK, }) @app.route("/upload", methods=["POST"]) def upload_file(): file = request.files["file"] filename = file.filename if LOCAL_MOCK: data = file.read() _mock_store[filename] = data return jsonify({ "message": f"[LOCAL_MOCK] {filename} stored in memory ({len(data)} bytes)", "gcs_uri": f"gs://local-mock/{filename}", "files_in_mock": list(_mock_store.keys()), }) bucket = storage_client.bucket(GCS_BUCKET_NAME) blob = bucket.blob(filename) blob.upload_from_file(file) return jsonify({ "message": f"Uploaded {filename} to {GCS_BUCKET_NAME}", "gcs_uri": f"gs://{GCS_BUCKET_NAME}/{filename}", }) @app.route("/files", methods=["GET"]) def list_files(): if LOCAL_MOCK: return jsonify({"files": list(_mock_store.keys()), "source": "local-mock"}) blobs = storage_client.list_blobs(GCS_BUCKET_NAME) return jsonify({"files": [b.name for b in blobs], "source": f"gs://{GCS_BUCKET_NAME}"}) if __name__ == "__main__": app.run(host="0.0.0.0", port=8080, debug=LOCAL_MOCK) ``` **What changed:** | Before | After | |---|---| `import boto3` | `from google.cloud import storage` (conditional) | `boto3.client('s3', aws_access_key_id=...)` | `storage.Client()` — zero credentials | `file.save("/tmp/...")` | `blob.upload_from_file(file)` | `DB_PASSWORD` with insecure default | `RuntimeError` if missing | ``` # ❌ Wrong — crashes at startup without GCP credentials from google.cloud import storage storage_client = storage.Client() # RuntimeError before any request is handled # ✅ Correct — import only happens when we actually need it if LOCAL_MOCK: storage_client = None else: from google.cloud import storage # ← inside the else block storage_client = storage.Client() ``` `from google.cloud import storage` executes when Python loads the module — before serving any request. Without GCP credentials, the app crashes at startup. Moving the import inside `else` fixes it: with `LOCAL_MOCK=true` , the module is never imported. I wanted to validate the entire K8s stack (Deployment, ConfigMap, Secret, Service, health checks, routing) locally using Docker Desktop — without needing real GCP credentials. The solution was `LOCAL_MOCK=true` combined with a Docker Desktop quirk that catches a lot of people off guard. Docker Desktop uses **two completely separate runtimes** that don't share images: ``` ┌──────────────────────────────────────┐ │ Docker daemon │ ← docker build, docker images │ (images here are NOT visible to K8s)│ └──────────────────────────────────────┘ ┌──────────────────────────────────────┐ │ containerd │ ← used by the Kubernetes cluster │ (separate namespace) │ └──────────────────────────────────────┘ ``` When you run `docker build -t my-image .` , the image exists in the Docker daemon but **not** in containerd. With `imagePullPolicy: Never` , K8s looks in containerd and fails: ``` Failed to pull image "my-image:local": ErrImageNeverPull ``` The fix: a **local registry** as the bridge between both runtimes. ``` # registry:2 on port 5001 (port 5000 is taken by macOS AirPlay) docker run -d -p 5001:5000 --restart=always --name local-registry registry:2 ``` Now the flow works end-to-end: ``` docker build → Docker daemon ↓ docker tag + push → localhost:5001 → registry:2 ↓ containerd pulls from registry:2 ← K8s Pod starts successfully ``` The `Makefile` handles all of this in a single command: ``` REGISTRY = localhost:5001 REGISTRY_IMAGE = $(REGISTRY)/agnostic-cluster-refactor:local registry-start: @docker ps --filter name=local-registry --filter status=running | grep local-registry || \ docker run -d -p 5001:5000 --restart=always --name local-registry registry:2 build: registry-start docker build -t agnostic-cluster-refactor:local . docker tag agnostic-cluster-refactor:local $(REGISTRY_IMAGE) docker push $(REGISTRY_IMAGE) @echo "Image available to K8s: $(REGISTRY_IMAGE)" local-up: kubectl config use-context docker-desktop kubectl apply -f examples/k8s/local/secret-db.yaml kubectl apply -f examples/k8s/local/configmap.local.yaml kubectl apply -f examples/k8s/local/deployment.local.yaml kubectl apply -f examples/k8s/local/service.local.yaml kubectl rollout status deployment/agnostic-cluster-app --timeout=60s @echo "Access: http://localhost:8080/health" # examples/k8s/local/deployment.local.yaml apiVersion: apps/v1 kind: Deployment metadata: name: agnostic-cluster-app spec: replicas: 1 template: spec: containers: - name: app image: localhost:5001/agnostic-cluster-refactor:local imagePullPolicy: Always # always pull from local registry envFrom: - configMapRef: name: app-config-local # injects LOCAL_MOCK=true env: - name: DB_PASSWORD valueFrom: secretKeyRef: name: app-secrets key: db-password readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 # examples/k8s/local/configmap.local.yaml apiVersion: v1 kind: ConfigMap metadata: name: app-config-local data: GCS_BUCKET_NAME: "local-mock" GCP_PROJECT_ID: "local-dev" LOCAL_MOCK: "true" # ← activates the in-memory store ``` Running it: ``` make build # build + push to local registry make local-up # apply all manifests curl http://localhost:8080/health # {"status":"healthy","platform":"local-k8s","mock_mode":true,"gcs_bucket":"local-mock"} curl -X POST http://localhost:8080/upload -F "file=@package.json" # {"message":"[LOCAL_MOCK] package.json stored in memory (842 bytes)", # "gcs_uri":"gs://local-mock/package.json"} curl http://localhost:8080/files # {"files":["package.json"],"source":"local-mock"} make local-down # teardown ``` ✅ Entire K8s stack validated — Deployment, ConfigMap, Secret, Service, health checks, routing — without a single GCP token. On GKE, the story is completely different. **The naive approach:** ``` os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/app/sa-key.json" storage_client = storage.Client() ``` This requires a JSON key file inside the container, which means: **The Workload Identity approach:** annotate a Kubernetes Service Account (KSA) with a Google Service Account (GSA) email: ``` # examples/k8s/serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: refactored-app-ksa annotations: iam.gke.io/gcp-service-account: "gke-app-sa@MY_PROJECT.iam.gserviceaccount.com" ``` GKE's internal metadata server intercepts ADC calls from Pods, verifies the annotation, and returns a short-lived OAuth2 token: The application code becomes: ``` # Zero credentials — works automatically on GKE storage_client = storage.Client() ``` Terraform provisions the IAM binding automatically: ``` # terraform/iam.tf resource "google_service_account_iam_member" "workload_identity" { service_account_id = google_service_account.app.name role = "roles/iam.workloadIdentityUser" member = "serviceAccount:${var.project_id}.svc.id.goog[default/refactored-app-ksa]" } ``` 🔐 This binding is the handshake between the Kubernetes world and GCP IAM. Without it, no token is issued — `storage.Client()` returns a 403. When I first tested, `/scan-deps` and `/spawn-refactor` **did not appear** in the `agy` autocomplete. I spent a good chunk of time debugging this. The discovery: `agy` has three distinct skill-loading mechanisms: | Mechanism | Location | Shows in `/` autocomplete? | |---|---|---| | Project skill | `.agents/skills//SKILL.md` | ❌ No | | Global contextual skill | `~/.gemini/antigravity-cli/skills/` | ❌ No | Plugin with namespace | `~/.gemini/config/plugins//` | ✅ Yes | To make the commands appear, create the plugin structure: ``` mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/scan-deps mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/spawn-refactor cat > ~/.gemini/config/plugins/agnostic-cluster-refactor/plugin.json << 'EOF' { "name": "agnostic-cluster-refactor", "version": "1.0.0", "description": "Migrates apps from AWS to GCP GKE with Workload Identity." } EOF ``` After restarting `agy` , the autocomplete shows: ``` /agnostic-cluster-refactor:scan-deps /agnostic-cluster-refactor:spawn-refactor ``` The namespace prevents collisions — two different plugins can both have a skill named `scan-deps` and they'll appear as `/plugin-a:scan-deps` and `/plugin-b:scan-deps` . When I ran `/agnostic-cluster-refactor:spawn-refactor` and confirmed the HITL Gate, Gemini (the `agy` engine) orchestrated: **Subagent A (Backend) — in shadow-worktree-backend:** `dependency-map.json` to identify boto3 files`import boto3` → `from google.cloud import storage, pubsub_v1` in each file`boto3.client('s3', ...)` → `storage.Client().bucket(...)` with semantically equivalent calls`boto3.client('sqs', ...)` → `pubsub_v1.SubscriberClient()` `requirements.txt` : removed `boto3==1.28.0` , added `google-cloud-storage==2.10.0` and `google-cloud-pubsub==2.18.0` **Subagent B (Infra) — in shadow-worktree-infra:** `serviceaccount.yaml` with the `iam.gke.io/gcp-service-account` annotation`deployment.yaml` with env vars via ConfigMap/Secret — no hardcoded credentials`ingress.yaml` with `ingressClassName: gce` (the current format, not the deprecated annotation)All in isolated Git Worktrees, in parallel, without touching `main` . **1. The conditional import is intentional, not lazy.** When `LOCAL_MOCK=true` , `from google.cloud import storage` must not run at module level. Without GCP credentials, it throws at startup before any request is served. Import conditionally. **2. Docker Desktop K8s and the Docker daemon live in separate worlds.** `imagePullPolicy: Never` breaks with Docker Desktop because K8s uses containerd, not the daemon. Use a local registry on port 5001 (5000 is taken by macOS) and `imagePullPolicy: Always` . **3. .agents/workflows/ does not create slash commands in agy.** Skills in `.agents/skills/` are context injections, not interactive commands. The `/` autocomplete requires a plugin installed in `~/.gemini/config/plugins/` . **4. The HITL Gate needs two independent layers.** A hook catches unexpected writes automatically. But for `/spawn-refactor` — which modifies multiple files in parallel — explicit plan confirmation in the SKILL.md is non-negotiable. Without both layers, the agent can act before you understand the blast radius. **5. Workload Identity eliminates an entire security problem class.** No JSON keys in containers means no credential leaks in logs, no manual rotation, no hardcoded keys in Dockerfiles, and no Secret volumes mounted on Pod disk. The Metadata Server's short-lived tokens are genuinely safer. ``` # Clone git clone https://github.com/carlosrgomes/agnostic-cluster-refactor cd agnostic-cluster-refactor # Test locally without GCP (Docker Desktop K8s) make build # build + push to local registry make local-up # apply manifests to docker-desktop context curl http://localhost:8080/health # Scan your own project's dependencies python3 scripts/scan_deps.py /path/to/your/project cat dependency-map.json | python3 -m json.tool # Validate the HITL Gate hook echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/main.py"}}, "workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only # → {"decision": "force_ask", ...} # Teardown make local-down ``` For the full GKE deployment with Workload Identity, the project README includes the Terraform that provisions all the infrastructure. The project started from a real problem (boto3 everywhere) and ended up with a surprisingly complete solution: automatic dependency scanning, parallel subagent refactoring, mandatory human oversight, local K8s testing without cloud credentials, and keyless production auth. What impressed me most wasn't the AI doing the refactoring — it was the **supervision system design**: hooks intercepting any write outside safe directories, SKILL.md with an explicit gate before destructive actions, and Git Worktrees ensuring `main` is never touched without human review. An autonomous agent without oversight is a chaotic script. An agent with a well-designed HITL Gate is a trustworthy teammate. Tutorial técnico completo: migração autônoma de aplicações acopladas à AWS para o Google Kubernetes Engine (GKE) usando o Antigravity CLI com Workload Identity, subagentes paralelos e HITL gate. `/scan-deps` )`/spawn-refactor` )Aplicações legadas acumulam acoplamentos…