# Automate Database Backups Across a Server Fleet with AI: 7 Recipes for 2026

> Source: <https://dev.to/bitwiserokos/automate-database-backups-across-a-server-fleet-with-ai-7-recipes-for-2026-395n>
> Published: 2026-06-18 14:13:35+00:00

If you are still SSH-ing into four boxes to confirm last night's dump actually

ran, **automated database backups across servers** are exactly the kind of chore

that should be running on autopilot — not eating your evenings. In this guide we

wire up **postgres backup automation** for a small fleet (one primary, two

streaming replicas, an offsite backup box) and drive the whole thing from a

single AI interface using the MCP server

[ remote-agents](https://www.npmjs.com/package/remote-agents). No agent

In short:every database host runs a lightweight agent connected to an

encrypted relay room. Your AI assistant (Claude or opencode) sees the whole

fleet as one machine and calls tools like`schedule_add`

,`fleet_exec`

,

`file_stat`

,`fleet_git`

,`send_file`

, and`set_mode`

. Cron jobs liveon the, so a nightly

host`pg_dump`

keeps running even if the relay link drops.

Payloads are end-to-end encrypted (AES-GCM-256) — the relay forwards

ciphertext blind.

Let's take a realistic small-SaaS database tier: a Postgres primary, two

streaming replicas for read scaling and failover, and an offsite box that only

stores compressed dumps. We'll also nod to a Windows SQL Server host at the end

to show the cross-platform story.

| Host | Role | Tag | Stack |
|---|---|---|---|
`db-primary` |
Postgres primary (writes) | `db,primary` |
PostgreSQL 16, Linux |
`db-replica-1` |
Streaming replica | `db,replica` |
PostgreSQL 16, Linux |
`db-replica-2` |
Streaming replica | `db,replica` |
PostgreSQL 16, Linux |
`backup-box` |
Offsite dump target | `backup` |
rsync, gzip, Linux |
`sql-win` |
SQL Server (optional) | `db,windows` |
SQL Server 2022, Windows |

All agents join one relay room (say `dbfleet`

). Tags let you address a **group**

in a single call: `target="db,replica"`

hits both replicas, `target="os:linux"`

hits every Linux box, and `target="all"`

sweeps the whole fleet. Note that

multi-tag targets match a host carrying **either** tag, so `db,primary`

resolves

to every database host.

```
            ┌────────────── AI (Claude / opencode) ──────────────┐
            │            remote-agents (MCP, stdio)               │
            └───────────────────────┬────────────────────────────┘
                                    │ wss:// (E2E-encrypted)
                            ┌───────┴────────┐  relay (CF Worker or self-hosted)
                            │   room=dbfleet  │
        ┌───────────┬───────┴────┬────────────┬─────────────┐
   db-primary   db-replica-1  db-replica-2  backup-box   sql-win
  (db,primary)  (db,replica)  (db,replica)  (backup)    (db,windows)
```

The relay is interchangeable: use the hosted Cloudflare Worker, or run your own

Rust relay with `remote-agents-relay --bind 0.0.0.0:8080`

and point agents at

`ws://your-host:8080`

. Either way the relay only ever sees encrypted frames.

On each host you install the package once and start the agent with the right

tags. For 24/7 database servers you'll want the background service form so the

agent survives reboots:

```
# once on every machine
npm i -g remote-agents

# the Postgres primary
remote-agents run --relay wss://<relay> --room dbfleet --token <secret> \
  --name db-primary --tags db,primary

# the two replicas
remote-agents run ... --name db-replica-1 --tags db,replica
remote-agents run ... --name db-replica-2 --tags db,replica

# the offsite backup target
remote-agents run ... --name backup-box --tags backup

# install as a systemd service for 24/7 hosts (instead of `run`)
remote-agents install --relay wss://<relay> --room dbfleet --token <secret> \
  --name db-primary --tags db,primary
```

Then confirm the whole fleet is online. Ask your AI:

"List the agents in the room."

Under the hood that calls ** list_agents**, which returns each peer's OS family,

`update_available`

flag when a newer agent*Also read: Run a security audit across your whole fleet*

| # Recipe | What it does |
|---|---|
1. Plan-mode look |
Tag db hosts, set them to read-only `plan` , and inspect Postgres safely before touching anything. |
2. Nightly pg_dump |
`schedule_add` a `0 3 * * * *` host-local cron that dumps + gzips and prunes anything older than 14 days. |
3. Verify freshness |
`fleet_exec` + `file_stat` to confirm the dump exists, its size, and its mtime on every host at once. |
4. Replica migration |
`fleet_git` pull migrations + `fleet_exec` across `db,replica` with per-host results so a failed node is obvious. |
5. Lag / health check |
`fleet_exec target="db,replica"` running `pg_stat_replication` to catch lag and broken streaming. |
6. Read prod config |
`read_file` a production `.env` in `plan` mode — zero write risk. |
7. Ship a dump |
`send_file` a dump from primary to `backup-box` over a direct UDP channel, SHA-256 verified. |

Database fleets are where "I'll just SSH in real quick" goes to die. A single

forgotten retention prune fills a disk; a silent `pg_dump`

failure isn't noticed

until the day you need the dump; a schema migration applied to the primary but

not the replicas causes mysterious read errors hours later. Here's where driving

the fleet through one AI interface pays off:

`schedule_add`

installs the cron `0 3 * * * *`

dump runs even if the relay link
is down, your laptop is closed, or the AI session has long ended. The schedule
is not tied to your connection.`fleet_exec`

and `file_stat`

give you the dump
size and mtime on all four boxes in a single round trip instead of four SSH
sessions. One failing host does not sink the batch — it just shows up red in
the aggregated result.`fleet_git`

+ `fleet_exec`

against the
`db,replica`

tag apply the same migration to both replicas and report each one
separately, so a half-applied change is impossible to miss.`plan`

mode makes a host read-only, so you can pull
a prod `.env`

or run `pg_stat_replication`

during an incident with no chance of
fat-fingering a write.`target="os:linux"`

hits the Postgres boxes; a
Windows SQL Server host joins the same room and answers `os:windows`

targeting
with a `sqlcmd`

backup instead of `pg_dump`

.This is not a replacement for a declarative backup orchestrator or a managed

RDS-style service. It shines for self-hosted databases, small-to-mid fleets,

dev/staging tiers, and the operational glue around backups that nobody wants to

maintain in YAML.

Before automating anything, look first and touch nothing. Set every database

host to ** plan** mode — read-only

`read_file`

, `git_status`

, and safe `exec`

```
set_mode    target="db,replica"   mode=plan
set_mode    agent_id=db-primary   mode=plan

fleet_exec  target="db,primary"   command="systemctl is-active postgresql && psql -tAc 'SELECT version();'"
fleet_exec  target="db,primary"   command="du -sh /var/lib/postgresql/16/main; df -h /var"
list_dir    agent_id=backup-box   path=/srv/backups
```

Tip:`plan`

mode still allows a read-only`exec`

, so`psql -tAc 'SELECT ...'`

and`du -sh`

work fine. The moment a command would write, the agent rejects it.

Start every incident here — you literally cannot break prod from`plan`

.

The aggregated reply comes back **per host**: you'll see `db-primary`

reporting

`active`

and a 240 GB data directory, while a replica that's catching up might

report differently. That per-host shape is the whole point — no guessing which

box you're looking at.

This is the core of **scheduled pg_dump** automation. We install a host-local

cron on `db-primary`

that dumps the database, gzips it, and prunes anything

older than 14 days. Because `schedule_add`

writes the cron **to the host**, it

keeps firing at 03:00 every night whether or not anyone is connected.

Remember the cron is a **6-field** spec — `sec min hour day month dow`

— so

"3 AM nightly" is `0 0 3 * * *`

.

`edit`

so the agent may create the dump file and the
wrapper script.

```
# the command the cron runs on db-primary (one line, gzip + 14-day prune)
pg_dump -Fc -U postgres app_prod \
  | gzip > /srv/backups/app_prod-$(date +\%F).sql.gz \
  && find /srv/backups -name 'app_prod-*.sql.gz' -mtime +14 -delete
set_mode      agent_id=db-primary  mode=edit

schedule_add  agent_id=db-primary  name=nightly-pgdump \
  cron="0 0 3 * * *" \
  command="pg_dump -Fc -U postgres app_prod | gzip > /srv/backups/app_prod-$(date +%F).sql.gz && find /srv/backups -name 'app_prod-*.sql.gz' -mtime +14 -delete"

schedule_list agent_id=db-primary
```

The `-Fc`

custom format gives you a compressed, `pg_restore`

-friendly archive; a

240 GB cluster typically lands around a 30–45 GB gzipped dump depending on how

much of it is indexes and TOAST. The `-mtime +14 -delete`

clause keeps roughly

two weeks of dumps and nothing more, so the disk doesn't quietly fill.

Note:for amysqldump cronthe only thing that changes is the command —

`mysqldump --single-transaction app_prod | gzip > ...`

— wrapped in the exact

same`schedule_add`

. And for the Windows SQL Server host you'd schedule a

`sqlcmd -Q "BACKUP DATABASE ..."`

instead. The scheduler, retention, and

verification flow are identical across all of them.

To remove the schedule later (say you've migrated to WAL archiving):

```
schedule_remove  agent_id=db-primary  name=nightly-pgdump
```

A backup you never check is a backup you don't have. This recipe answers the only

question that matters at 9 AM: **did last night's dump actually run, and is it the
right size, on every host?** We combine

`fleet_exec`

to find the newest dump with`file_stat`

to read its exact size and mtime.`file_stat`

that file for an authoritative size and modification time.

```
fleet_exec  target="db,primary"  \
  command="ls -t /srv/backups/app_prod-*.sql.gz 2>/dev/null | head -n1"

file_stat   agent_id=db-primary  path=/srv/backups/app_prod-2026-06-18.sql.gz

fleet_exec  target="db,primary"  \
  command="find /srv/backups -name 'app_prod-*.sql.gz' -mmin -1560 -printf '%p %s bytes\n' || echo NO_FRESH_BACKUP"
```

`file_stat`

returns the size and mtime directly, which is more trustworthy than

parsing `ls`

output. The `-mmin -1560`

window (26 hours) gives the 03:00 job some

slack — anything that didn't write a fresh file in that window prints

`NO_FRESH_BACKUP`

, and because results are aggregated **per host**, a replica

that skipped its dump stands out instantly while the healthy boxes report a clean

`~42 GB`

file.

Important:size is your early-warning signal. A dump that suddenly drops

from 42 GB to 200 KB almost always means`pg_dump`

errored out mid-run (bad

credentials, a dropped connection, a full disk) and gzipped only the error.

Watching size over time catches silent failures that a green exit code hides.

You can fold this into a once-a-day self-check by wrapping the `find`

in another

`schedule_add`

that appends to a log — same host-local cron pattern as Recipe 2.

*Also read: Build a cross-platform CI test farm*

Replicas in **database fleet management** drift the moment a migration lands on

one box but not another. With `fleet_git`

to pull the migration repo and

`fleet_exec`

to apply it across the `db,replica`

tag, both replicas move in

lockstep and you get a per-host verdict.

On true streaming replicas you don't run DDL directly (they're read-only and

replay WAL from the primary). This recipe fits the common real-world setup where

"replica" hosts also run migration tooling against a logical target, or where you

roll out an out-of-band maintenance script. Adjust to your topology.

```
fleet_git   target="db,replica"  op=pull  repo=/srv/migrations  remote=origin  branch=main

fleet_exec  target="db,replica"  \
  command="cd /srv/migrations && ./run_migrations.sh 2>&1 | tail -n 5 || echo MIGRATION_FAILED"

fleet_exec  target="db,replica"  \
  command="psql -tAc 'SELECT max(version) FROM schema_migrations;'"
```

The payoff is the **per-host aggregation**: if `db-replica-2`

returns

`MIGRATION_FAILED`

while `db-replica-1`

reports the new version, you know exactly

which node to fix — no scrolling through interleaved SSH output trying to figure

out whose error you're reading. One failing replica does not block the batch from

finishing on the healthy one.

Tip:put the hosts in`edit`

mode (not`bypass`

) for migrations. In`edit`

,

the agent auto-creates abackup before any overwrite, so a botched

`write_file`

to a config or script leaves you the original. Drop back to`plan`

the moment you're done.

The single most useful **replica maintenance** check is replication lag: how far

behind the primary each replica is. From the primary you query

`pg_stat_replication`

; from each replica you check `pg_last_wal_replay_lag`

.

Both are one `fleet_exec`

away.

```
fleet_exec  target="db,primary"  \
  command="psql -xtAc \"SELECT client_addr, state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;\""

fleet_exec  target="db,replica"  \
  command="psql -tAc 'SELECT now() - pg_last_xact_replay_timestamp() AS replay_lag;'"

fleet_exec  target="db,replica"  \
  command="psql -tAc 'SELECT CASE WHEN pg_is_in_recovery() THEN 1 ELSE 0 END;' | grep -q 1 && echo OK_RECOVERY || echo NOT_A_REPLICA"
```

Run from `plan`

mode — these are all read-only queries, so there's zero risk even

on a busy primary. The first query, sent to `db-primary`

, shows you both replicas

as rows with their `state`

(`streaming`

is what you want) and `lag_bytes`

. The

second, fanned out across the `db,replica`

tag, returns each replica's replay lag

side by side. A replica that has fallen out of streaming shows up either as a

missing row in the primary's view or a ballooning `replay_lag`

.

Note:a healthy LAN replica usually sits under a few hundred milliseconds

of replay lag. A sudden jump to tens of seconds (or a replica vanishing from

`pg_stat_replication`

entirely) means streaming broke — often a WAL retention

or network issue. Catching it here, across both replicas in one call, beats

discovering it when a read query returns stale data.

Sometimes you just need to confirm a connection string or a backup target path

without any chance of editing it. With the host in ** plan** mode,

`read_file`

`plan`

mode.`read_file`

the config.`exec`

.

```
set_mode   agent_id=db-primary  mode=plan

read_file  agent_id=db-primary  path=/srv/app/.env

exec       agent_id=db-primary  command="grep -E '^(PGHOST|PGDATABASE|BACKUP_DIR)=' /srv/app/.env"
```

Because the agent is read-only, even if you (or the AI) followed up with a

`write_file`

, it would be rejected. On top of that, a hard **deny-list** —

covering paths like `/etc/shadow`

and `/boot`

— applies on **every** host in

**every** mode, including `bypass`

, so the truly sensitive system files are never

readable or writable through the agent regardless of how you set things up.

Important:`plan`

mode is the right default for production database hosts.

Keep them in`plan`

day to day and only flip a single host to`edit`

for the

specific window you're changing something — then flip it straight back. It

turns "I'll be careful" into "I literally can't write right now."

Finally, get the dump off the primary and onto the offsite `backup-box`

.

`send_file`

streams a file **host→host over a direct UDP data channel** (opened on

demand, with automatic relay fallback) and verifies it end-to-end with a

**SHA-256** checksum. Because the receiving side performs a write, the

destination host needs `edit`

or `bypass`

.

`backup-box`

in `edit`

so it can write the incoming file.`send_file`

the latest dump from `db-primary`

to `backup-box`

.`file_stat`

on the receiver.

```
set_mode     agent_id=backup-box  mode=edit

send_file    agent_id=db-primary  \
  path=/srv/backups/app_prod-2026-06-18.sql.gz \
  to=backup-box \
  dest=/srv/offsite/app_prod-2026-06-18.sql.gz

file_stat    agent_id=backup-box  path=/srv/offsite/app_prod-2026-06-18.sql.gz
```

The transfer goes peer-to-peer where the network allows, so a 42 GB dump doesn't

have to round-trip through the relay — and if a direct UDP path can't be

established, it transparently falls back to the relay so the copy still

completes. The SHA-256 check means you find out about a corrupted transfer

immediately, not when a restore fails three weeks later. If you'd rather pull

from the receiving side, `transfer_get`

is the mirror-image tool.

Tip:in the browser fleet-chat panel there's a Files view that does the same

move with a live progress bar, plus a binary-safe chunked download of any file

through the relay. Handy when you want to eyeball a dump's size or grab one to

your laptop without dropping to the CLI.

Set `backup-box`

back to `plan`

when the copy is done, so the offsite store is

read-only at rest:

```
set_mode  agent_id=backup-box  mode=plan
```

It's the same idea, but the cron is installed and managed through one AI

interface across the whole fleet, and `schedule_add`

registers it **on the host**

so it survives relay outages and reboots. The difference is operational: you add,

list, verify, and remove backup schedules on four boxes from one place, and you

get per-host confirmation that each dump actually ran, instead of trusting four

independent crontabs you never look at.

Yes. `schedule_add`

writes a host-local cron that lives entirely on the database

server. The relay link and your AI session are only needed to **create, list, or
remove** the schedule — not to run it. Your

`0 0 3 * * *`

dump fires at 03:00Start every host in `plan`

mode, which is read-only — you can run

`pg_stat_replication`

, `read_file`

a `.env`

, and check dump freshness with zero

write risk. Flip a single host to `edit`

only for the specific change you're

making (`edit`

auto-backs-up before overwriting), and a hard deny-list on paths

like `/etc/shadow`

and `/boot`

applies even in `bypass`

. All command payloads and

results are end-to-end encrypted; the relay forwards ciphertext and never sees

your data or keys.

Yes — the workflow is database-agnostic. Swap the command inside `schedule_add`

:

`mysqldump --single-transaction ... | gzip`

for MySQL, or

`sqlcmd -Q "BACKUP DATABASE ..."`

on a Windows host targeted via `os:windows`

.

The scheduling, 14-day retention prune, freshness verification, and SHA-256

transfer all stay identical.

Run ** fleet_update_check**, which tells you which idle hosts have a newer agent

`npm i -g remote-agents@latest`

on those hosts. ItOne MCP interface, five hosts, zero manual SSH sessions: a nightly **scheduled
pg_dump** with 14-day retention runs host-local on the primary, freshness is

`file_stat`

, replicas get migrations and lag checks`plan`

mode, and the dump lands offsiteInstall:

`npm i -g remote-agents`

→

[package on npm]·

[source and docs]