Have you ever inherited a codebase where import boto3
appears in 47 different files? Where AWS credentials live in hardcoded environment variables and file storage is a file.save("/tmp/...")
that will blow up the moment it hits an ephemeral Kubernetes pod?
I did. And instead of refactoring everything by hand, I built an AI agent to do it for me โ with mandatory human oversight before any production mutation.
This article documents what I built: a skill for the Antigravity CLI (agy
) that scans cloud dependencies, spawns parallel subagents to refactor code and infrastructure, and validates everything on local Kubernetes before deploying to GKE with keyless Workload Identity.
boto3
is the AWS SDK for Python. It seems harmless at first:
import boto3
s3 = boto3.client('s3', region_name='us-east-1')
s3.upload_fileobj(file, bucket_name, filename)
Six months later:
import os
import boto3
from flask import Flask, request, jsonify
app = Flask(__name__)
DB_PASSWORD = os.getenv("DB_PASSWORD", "default-insecure-password")
S3_BUCKET = os.getenv("AWS_S3_BUCKET_NAME")
AWS_REGION = os.getenv("AWS_DEFAULT_REGION", "us-east-1")
s3_client = boto3.client(
's3',
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
region_name=AWS_REGION
)
@app.route("/upload", methods=["POST"])
def upload_file():
file = request.files['file']
filename = file.filename
if S3_BUCKET:
s3_client.upload_fileobj(file, S3_BUCKET, filename)
return jsonify({"message": f"Uploaded to AWS S3: {S3_BUCKET}"})
else:
local_path = os.path.join("/tmp", filename)
file.save(local_path)
return jsonify({"message": f"Saved locally at {local_path}"})
Three coupling problems in a single file: proprietary SDK (boto3
), AWS-specific credentials, and local disk storage that doesn't survive ephemeral Kubernetes pods.
Now multiply that by 10 services.
A skill for the Antigravity CLI that adds two commands to the agent chat:
/agnostic-cluster-refactor:scan-deps
/agnostic-cluster-refactor:spawn-refactor
The complete flow:
But before diving into the code, let me introduce the players.
agy
is not a script. It's an LLM-powered agent โ you describe what you want in the chat and it decides how to do it, using a toolset: read_file
, write_to_file
, run_command
, invoke_subagent
.
The difference from a web chatbot: agy
has access to your local filesystem, runs terminal commands, and operates in autonomous loops. It's an engineer working on your machine.
| Script | Agent |
|---|---|
sed 's/boto3/gcs/g' across all files |
|
| Analyzes the semantic context of each import and replaces it with the correct equivalent API | |
| Fails if the environment changed | Adapts to the current state |
| Deterministic | Probabilistic + adaptive |
A skill is a SKILL.md
file with YAML frontmatter that defines when and how the agent uses that capability. The agent reads the description
field and decides whether the skill is relevant to the current task.
---
name: scan-deps
description: Scans the project for cloud-provider dependencies and generates
dependency-map.json. Use when the user wants to map vendor lock-in
before migrating to GKE.
---
## Steps
1. Ask which directory to scan
2. Run: python3 .agents/skills/.../scan_deps.py <PATH>
3. Present the DAG summary
๐ก
Key distinction:skills in.agents/skills/
are injected silently into context. To appear as a/command
in autocomplete, you need aplugininstalled at~/.gemini/config/plugins/<plugin>/
. More on that in Part 6.
A subagent is a child agent with completely isolated context. It doesn't "see" the parent's history or the other subagent's โ exactly what we want: the Backend agent can't get confused by the YAML the Infra agent is writing.
invoke_subagent(
name="backend-engine",
system_prompt="You are an expert in migrating boto3 to GCS...",
toolset=["read_file", "write_to_file", "run_command"],
workspace="/path/to/shadow-worktree-backend",
message="Refactor the files from dependency-map.json"
)
invoke_subagent(
name="infra-engine",
toolset=["write_to_file"], # write only โ principle of least privilege
workspace="/path/to/shadow-worktree-infra",
message="Generate serviceaccount.yaml, deployment.yaml, ingress.yaml for GKE"
)
Each subagent operates in an isolated Git Worktree โ a physical copy of the repository in a separate directory, on a different branch. If Subagent A introduces a bug, main
stays untouched.
The first step is mapping the problem. scan_deps.py
walks the project with os.walk()
, applies regex patterns by category, and generates a DAG (Directed Acyclic Graph) as JSON.
patterns = {
"storage": [
r"google\.cloud\.storage",
r"boto3.*s3", # AWS-coupled
r"aws-sdk.*s3"
],
"messaging": [
r"google\.cloud\.pubsub",
r"boto3.*sqs", # AWS-coupled
r"kafka-python",
],
"secrets": [
r"boto3.*secretsmanager",
r"python-dotenv",
],
"databases": [
r"psycopg2", r"pymongo"
]
}
for root, dirs, files in os.walk(path):
dirs[:] = [d for d in dirs if not d.startswith('.')
and d not in ['venv', 'node_modules', '__pycache__']]
for file in files:
if not file.endswith(('.py', '.js', '.yaml', '.tf')):
continue
with open(os.path.join(root, file)) as f:
content = f.read()
for dep_type, pattern_list in patterns.items():
for pattern in pattern_list:
if re.search(pattern, content, re.IGNORECASE):
dependencies[dep_type].append({
"file": os.path.relpath(file_path, path),
"matched_pattern": pattern
})
The output is a dependency-map.json
with the full dependency graph:
{
"dependencies": {
"storage": [
{ "file": "examples/legacy-app/app.py", "matched_pattern": "boto3.*s3" },
{ "file": "examples/legacy-app/api.py", "matched_pattern": "boto3.*s3" }
],
"messaging": [
{ "file": "examples/legacy-app/worker.py", "matched_pattern": "boto3.*sqs" }
]
},
"architectural_dag": {
"nodes": [
{ "id": "application", "type": "component" },
{ "id": "dep-storage", "files": ["app.py", "api.py"] },
{ "id": "provider-aws", "type": "cloud-provider" }
],
"edges": [
{ "source": "application", "target": "dep-storage", "relation": "uses_storage" },
{ "source": "dep-storage", "target": "provider-aws", "relation": "coupled_to_aws" }
]
},
"recommended_action": "Execute '/spawn-refactor' targeting GCP GKE"
}
โ
Why a DAG and not a plain list?The graph reveals transitive relationships:app.py
andworker.py
both depend on AWS viaboto3
โ so they need to be refactored together. A list would only say "these files have boto3."
This was the most important design decision: how do I ensure the agent doesn't refactor the wrong file without me seeing what's happening first?
The answer lives in two places.
The .agents/hooks.json
file registers a PreToolUse
hook โ a command that runs before any write_to_file
the agent attempts:
{
"hitl-production-gate": {
"enabled": true,
"PreToolUse": [
{
"matcher": "write_to_file|replace_file_content|multi_replace_file_content",
"hooks": [
{
"type": "command",
"command": "python3 .agents/skills/agnostic-cluster-refactor/scripts/scan_deps.py --check-only",
"timeout": 5
}
]
}
]
}
}
The hook receives a JSON payload via stdin and responds with a decision:
SAFE_WRITE_PREFIXES = ("examples/", "terraform/", ".agents/")
def check_only_hook():
payload = json.load(sys.stdin)
target = payload.get("toolCall", {}).get("args", {}).get("TargetFile", "")
workspace_root = payload.get("workspacePaths", ["."])[0]
rel_path = os.path.relpath(target, workspace_root)
if not any(rel_path.startswith(p) for p in SAFE_WRITE_PREFIXES):
print(json.dumps({
"decision": "force_ask",
"reason": f"[HITL Gate] '{rel_path}' is outside safe directories. Confirm before proceeding."
}))
else:
print(json.dumps({"decision": "allow"}))
Three possible decisions the hook can return:
| Decision | Effect |
|---|---|
"allow" |
|
| Agent proceeds automatically | |
"force_ask" |
|
| agy s and asks the human | |
"deny" |
|
| Completely blocked, no prompt |
Testing it from the command line:
echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/app.py"}},
"workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only
echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/examples/k8s/deployment.yaml"}},
"workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only
Beyond the automatic hook, the /spawn-refactor
SKILL.md
instructs the agent to always ask for explicit confirmation before spawning subagents:
## HITL Gate โ mandatory before any mutation
Display the list of files that will be changed and ask:
The following files will be modified:
- examples/legacy-app/app.py (replace boto3 โ GCS)
- examples/legacy-app/worker.py (replace SQS โ Pub/Sub)
Type YES to confirm or NO to abort.
Halt if the user does not confirm with YES.
๐ก๏ธ Two layers of protection: the hook catches any write automatically, and the SKILL.md forces you to see the full plan before anything moves.
After Subagent A runs, app.py
goes from the boto3 mess above to this:
import os
from flask import Flask, request, jsonify
app = Flask(__name__)
DB_PASSWORD = os.getenv("DB_PASSWORD")
if not DB_PASSWORD:
raise RuntimeError("DB_PASSWORD environment variable is required!")
GCS_BUCKET_NAME = os.getenv("GCS_BUCKET_NAME", "local-mock")
LOCAL_MOCK = os.getenv("LOCAL_MOCK", "false").lower() == "true"
if LOCAL_MOCK:
storage_client = None
print("[LOCAL_MOCK] GCS disabled. Uploads will be simulated.")
else:
from google.cloud import storage # import only when we actually need GCS
storage_client = storage.Client() # zero credentials โ ADC via Workload Identity
_mock_store: dict[str, bytes] = {}
@app.route("/health", methods=["GET"])
def health():
return jsonify({
"status": "healthy",
"platform": "local-k8s" if LOCAL_MOCK else "gcp-gke",
"gcs_bucket": GCS_BUCKET_NAME,
"mock_mode": LOCAL_MOCK,
})
@app.route("/upload", methods=["POST"])
def upload_file():
file = request.files["file"]
filename = file.filename
if LOCAL_MOCK:
data = file.read()
_mock_store[filename] = data
return jsonify({
"message": f"[LOCAL_MOCK] {filename} stored in memory ({len(data)} bytes)",
"gcs_uri": f"gs://local-mock/{filename}",
"files_in_mock": list(_mock_store.keys()),
})
bucket = storage_client.bucket(GCS_BUCKET_NAME)
blob = bucket.blob(filename)
blob.upload_from_file(file)
return jsonify({
"message": f"Uploaded {filename} to {GCS_BUCKET_NAME}",
"gcs_uri": f"gs://{GCS_BUCKET_NAME}/{filename}",
})
@app.route("/files", methods=["GET"])
def list_files():
if LOCAL_MOCK:
return jsonify({"files": list(_mock_store.keys()), "source": "local-mock"})
blobs = storage_client.list_blobs(GCS_BUCKET_NAME)
return jsonify({"files": [b.name for b in blobs], "source": f"gs://{GCS_BUCKET_NAME}"})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080, debug=LOCAL_MOCK)
What changed:
| Before | After |
|---|---|
import boto3 |
|
from google.cloud import storage (conditional) |
|
boto3.client('s3', aws_access_key_id=...) |
|
storage.Client() โ zero credentials |
|
file.save("/tmp/...") |
|
blob.upload_from_file(file) |
|
DB_PASSWORD with insecure default |
|
RuntimeError if missing |
from google.cloud import storage
storage_client = storage.Client() # RuntimeError before any request is handled
if LOCAL_MOCK:
storage_client = None
else:
from google.cloud import storage # โ inside the else block
storage_client = storage.Client()
from google.cloud import storage
executes when Python loads the module โ before serving any request. Without GCP credentials, the app crashes at startup. Moving the import inside else
fixes it: with LOCAL_MOCK=true
, the module is never imported.
I wanted to validate the entire K8s stack (Deployment, ConfigMap, Secret, Service, health checks, routing) locally using Docker Desktop โ without needing real GCP credentials.
The solution was LOCAL_MOCK=true
combined with a Docker Desktop quirk that catches a lot of people off guard.
Docker Desktop uses two completely separate runtimes that don't share images:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Docker daemon โ โ docker build, docker images
โ (images here are NOT visible to K8s)โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ containerd โ โ used by the Kubernetes cluster
โ (separate namespace) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
When you run docker build -t my-image .
, the image exists in the Docker daemon but not in containerd. With imagePullPolicy: Never
, K8s looks in containerd and fails:
Failed to pull image "my-image:local": ErrImageNeverPull
The fix: a local registry as the bridge between both runtimes.
docker run -d -p 5001:5000 --restart=always --name local-registry registry:2
Now the flow works end-to-end:
docker build โ Docker daemon
โ
docker tag + push โ localhost:5001 โ registry:2
โ
containerd pulls from registry:2 โ K8s Pod starts successfully
The Makefile
handles all of this in a single command:
REGISTRY = localhost:5001
REGISTRY_IMAGE = $(REGISTRY)/agnostic-cluster-refactor:local
registry-start:
@docker ps --filter name=local-registry --filter status=running | grep local-registry || \
docker run -d -p 5001:5000 --restart=always --name local-registry registry:2
build: registry-start
docker build -t agnostic-cluster-refactor:local .
docker tag agnostic-cluster-refactor:local $(REGISTRY_IMAGE)
docker push $(REGISTRY_IMAGE)
@echo "Image available to K8s: $(REGISTRY_IMAGE)"
local-up:
kubectl config use-context docker-desktop
kubectl apply -f examples/k8s/local/secret-db.yaml
kubectl apply -f examples/k8s/local/configmap.local.yaml
kubectl apply -f examples/k8s/local/deployment.local.yaml
kubectl apply -f examples/k8s/local/service.local.yaml
kubectl rollout status deployment/agnostic-cluster-app --timeout=60s
@echo "Access: http://localhost:8080/health"
apiVersion: apps/v1
kind: Deployment
metadata:
name: agnostic-cluster-app
spec:
replicas: 1
template:
spec:
containers:
- name: app
image: localhost:5001/agnostic-cluster-refactor:local
imagePullPolicy: Always # always pull from local registry
envFrom:
- configMapRef:
name: app-config-local # injects LOCAL_MOCK=true
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: app-secrets
key: db-password
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-local
data:
GCS_BUCKET_NAME: "local-mock"
GCP_PROJECT_ID: "local-dev"
LOCAL_MOCK: "true" # โ activates the in-memory store
Running it:
make build # build + push to local registry
make local-up # apply all manifests
curl http://localhost:8080/health
curl -X POST http://localhost:8080/upload -F "file=@package.json"
curl http://localhost:8080/files
make local-down # teardown
โ Entire K8s stack validated โ Deployment, ConfigMap, Secret, Service, health checks, routing โ without a single GCP token.
On GKE, the story is completely different.
The naive approach:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/app/sa-key.json"
storage_client = storage.Client()
This requires a JSON key file inside the container, which means:
The Workload Identity approach: annotate a Kubernetes Service Account (KSA) with a Google Service Account (GSA) email:
apiVersion: v1
kind: ServiceAccount
metadata:
name: refactored-app-ksa
annotations:
iam.gke.io/gcp-service-account: "gke-app-sa@MY_PROJECT.iam.gserviceaccount.com"
GKE's internal metadata server intercepts ADC calls from Pods, verifies the annotation, and returns a short-lived OAuth2 token:
The application code becomes:
storage_client = storage.Client()
Terraform provisions the IAM binding automatically:
resource "google_service_account_iam_member" "workload_identity" {
service_account_id = google_service_account.app.name
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project_id}.svc.id.goog[default/refactored-app-ksa]"
}
๐ This binding is the handshake between the Kubernetes world and GCP IAM. Without it, no token is issued โ
storage.Client()
returns a 403.
When I first tested, /scan-deps
and /spawn-refactor
did not appear in the agy
autocomplete. I spent a good chunk of time debugging this.
The discovery: agy
has three distinct skill- mechanisms:
| Mechanism | Location | Shows in / autocomplete? |
|---|---|---|
| Project skill | .agents/skills/<name>/SKILL.md |
|
| โ No | ||
| Global contextual skill | ~/.gemini/antigravity-cli/skills/ |
|
| โ No | ||
| Plugin with namespace | ||
~/.gemini/config/plugins/<plugin>/ |
||
| โ Yes | ||
To make the commands appear, create the plugin structure:
mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/scan-deps
mkdir -p ~/.gemini/config/plugins/agnostic-cluster-refactor/skills/spawn-refactor
cat > ~/.gemini/config/plugins/agnostic-cluster-refactor/plugin.json << 'EOF'
{
"name": "agnostic-cluster-refactor",
"version": "1.0.0",
"description": "Migrates apps from AWS to GCP GKE with Workload Identity."
}
EOF
After restarting agy
, the autocomplete shows:
/agnostic-cluster-refactor:scan-deps
/agnostic-cluster-refactor:spawn-refactor
The namespace prevents collisions โ two different plugins can both have a skill named scan-deps
and they'll appear as /plugin-a:scan-deps
and /plugin-b:scan-deps
.
When I ran /agnostic-cluster-refactor:spawn-refactor
and confirmed the HITL Gate, Gemini (the agy
engine) orchestrated:
Subagent A (Backend) โ in shadow-worktree-backend:
dependency-map.json
to identify boto3 filesimport boto3
โ from google.cloud import storage, pubsub_v1
in each fileboto3.client('s3', ...)
โ storage.Client().bucket(...)
with semantically equivalent callsboto3.client('sqs', ...)
โ pubsub_v1.SubscriberClient()
requirements.txt
: removed boto3==1.28.0
, added google-cloud-storage==2.10.0
and google-cloud-pubsub==2.18.0
Subagent B (Infra) โ in shadow-worktree-infra:
serviceaccount.yaml
with the iam.gke.io/gcp-service-account
annotationdeployment.yaml
with env vars via ConfigMap/Secret โ no hardcoded credentialsingress.yaml
with ingressClassName: gce
(the current format, not the deprecated annotation)All in isolated Git Worktrees, in parallel, without touching main
.
1. The conditional import is intentional, not lazy.
When LOCAL_MOCK=true
, from google.cloud import storage
must not run at module level. Without GCP credentials, it throws at startup before any request is served. Import conditionally.
2. Docker Desktop K8s and the Docker daemon live in separate worlds.
imagePullPolicy: Never
breaks with Docker Desktop because K8s uses containerd, not the daemon. Use a local registry on port 5001 (5000 is taken by macOS) and imagePullPolicy: Always
.
3. .agents/workflows/ does not create slash commands in agy.
Skills in .agents/skills/
are context injections, not interactive commands. The /
autocomplete requires a plugin installed in ~/.gemini/config/plugins/
.
4. The HITL Gate needs two independent layers.
A hook catches unexpected writes automatically. But for /spawn-refactor
โ which modifies multiple files in parallel โ explicit plan confirmation in the SKILL.md is non-negotiable. Without both layers, the agent can act before you understand the blast radius.
5. Workload Identity eliminates an entire security problem class.
No JSON keys in containers means no credential leaks in logs, no manual rotation, no hardcoded keys in Dockerfiles, and no Secret volumes mounted on Pod disk. The Metadata Server's short-lived tokens are genuinely safer.
git clone https://github.com/carlosrgomes/agnostic-cluster-refactor
cd agnostic-cluster-refactor
make build # build + push to local registry
make local-up # apply manifests to docker-desktop context
curl http://localhost:8080/health
python3 scripts/scan_deps.py /path/to/your/project
cat dependency-map.json | python3 -m json.tool
echo '{"toolCall":{"name":"write_to_file","args":{"TargetFile":"/project/src/main.py"}},
"workspacePaths":["/project"]}' | python3 scripts/scan_deps.py --check-only
make local-down
For the full GKE deployment with Workload Identity, the project README includes the Terraform that provisions all the infrastructure.
The project started from a real problem (boto3 everywhere) and ended up with a surprisingly complete solution: automatic dependency scanning, parallel subagent refactoring, mandatory human oversight, local K8s testing without cloud credentials, and keyless production auth.
What impressed me most wasn't the AI doing the refactoring โ it was the supervision system design: hooks intercepting any write outside safe directories, SKILL.md with an explicit gate before destructive actions, and Git Worktrees ensuring main
is never touched without human review.
An autonomous agent without oversight is a chaotic script. An agent with a well-designed HITL Gate is a trustworthy teammate.
Tutorial tรฉcnico completo: migraรงรฃo autรดnoma de aplicaรงรตes acopladas ร AWS para o Google Kubernetes Engine (GKE) usando o Antigravity CLI com Workload Identity, subagentes paralelos e HITL gate.
/scan-deps
)/spawn-refactor
)Aplicaรงรตes legadas acumulam acoplamentosโฆ