Backing Up a Vector Database to Box: Preserving Vector and ID Fields in JSONL

wpnews.pro

I built a RAG system.

I ingested documents, computed embeddings, and wrote them into a managed vector database. In my case, that database was DataStax AstraDB.

A few weeks later, someone asked the obvious question:

What is our backup story?

And I realized there wasn’t one.

There is no pg_dump equivalent here. There is no one-button “export keyspace” option that hands you a file you can confidently restore from.

You can call find() and write the documents to disk, sure. But I learned the hard way that a naive dump can quietly throw away the single most expensive thing in the database:

the embedding vectors.

That is the trap I want to save you from.

Your dump may look complete. Every document is there. Every metadata field is present. The counts match. At a glance, everything looks fine.

Then, during restore, you realize the vectors are missing.

And now your “backup” is not really a backup. It is just a metadata export.

This article walks through the backup script I built for an internal catalog app. It runs as a scheduled cron job on a serverless container platform and pushes a zipped snapshot to Box as off-host storage.

The interesting parts are not the plumbing. The interesting parts are the small decisions that separated a backup I could actually trust from one that merely looked fine.

In practice, five details mattered most: using projection={"*": True} to actually capture $vector, choosing JSONL instead of one giant JSON file, wrapping non-JSON values in $backupType envelopes, adding a fail-loud guard against silent embedding loss, and making the backup atomic with .inprogress writes, SHA-256 manifests, and useful exit codes.

Here is the most important line in the entire backup script:

cursor = collection.find(    {},    projection={"*": True},    request_timeout_ms=request_timeout_ms,)

By default, AstraDB’s Data API does not necessarily return special fields like $vector when you call find().

That default makes sense for normal application queries. Embeddings are large. A 384-dimension float array per document adds up fast, and most app queries do not need the raw vector echoed back.

But backups are different.

The obvious backup code looks like this:

for doc in collection.find({}):    write(doc)

That can produce a backup with every document and zero embeddings.

It passes the eyeball test. The document counts match. The metadata is all there.

But it is not a full-fidelity restore.

If you need to regenerate those embeddings later, you may have to re-run your entire ingestion pipeline against an embedding model that may have changed versions since the original data was written.

That is not a restore. That is a rebuild.

projection={"*": True} is the explicit instruction to return everything, including special $-prefixed fields.

One line.

Miss it, and your backup can become a lie.

There is a companion decision in how I build the client.

AstraPy can return rich custom types, such as DataAPIVector instead of a plain list of floats, or custom timestamp wrappers instead of standard Python types.

Those are useful in application code.

They are less fun when your goal is to serialize a backup.

So the backup client disables custom read datatypes:

from astrapy.api_options import APIOptions, SerdesOptionsoptions = APIOptions(    serdes_options=SerdesOptions(        custom_datatypes_in_reading=False,        unroll_iterables_to_lists=True,    ))client = DataAPIClient(token=token, api_options=options)

With custom_datatypes_in_reading=False, the common cases come back closer to JSON-native values. $vector becomes a list of floats, _id remains a string, numbers remain normal ints or floats, and iterable values can be unrolled into lists.

The closer the read output is to plain JSON, the less custom serialization code I need.

And the less custom serialization code I need, the fewer places the backup can lie to me.

The second important point is that _id is preserved as-is.

A restore that re-inserts documents with the original _id and original $vector can reproduce the database faithfully.

No re-chunking.

No re-embedding.

No drift.

The document format is JSON Lines: one JSON object per line.

opener = gzip.open if compress else open
with opener(data_path, "wt", encoding="utf-8") as handle:    for doc in cursor:        handle.write(json.dumps(doc, ensure_ascii=False, default=_json_default))        handle.write("\n")        written += 1

This looks like a small choice, but it matters.

The tempting alternative is to dump one giant JSON array containing every document.

I avoided that for a few reasons.

JSONL lets me write each document as I read it from the cursor.

A large collection, with each document carrying a 384-float vector, never has to fit in memory all at once.

A single giant JSON array pushes you toward buffering too much data before serialization.

On restore, JSONL is just as useful.

You read line by line and insert in batches.

The memory profile stays flat in both directions.

If something truncates a JSONL file, I may lose the last line.

If something corrupts a giant JSON array, the entire file may become unparseable.

Embedding floats and repeated metadata keys compress nicely.

The backup defaults to .jsonl.gz, with a --no-compress flag for debugging.

I also record the format choice in the manifest, so the restore tool never has to guess:

{  "document_format": {    "layout": "one JSON document per line (JSONL)",    "vector_field": "$vector stored as a plain JSON array of floats",    "special_types": "non-JSON types wrapped as {\"$backupType\": ..., \"value\": ...}"  }}

The format is not just an implementation detail.

It is part of the backup contract.

Even with serdes configured to return JSON-native values for common cases, production data is a zoo.

Sooner or later, a document carries values that plain JSON cannot represent cleanly: datetime, date, Decimal, UUID, raw bytes, or set-like values.

Plain json.dumps() does not know what to do with those.

The lazy fix is default=str.

I would avoid that.

default=str is lossy.

A real timestamp and the string "2026-06-16T08:52:46+00:00" are not the same thing. On restore, you would insert a string where a timestamp used to be.

That means the backup round-trips into a subtly different database.

Instead, I wrap every non-JSON type in a typed envelope:

def _json_default(obj):    if isinstance(obj, dt.datetime):        return {"$backupType": "datetime", "value": obj.isoformat()}
if isinstance(obj, dt.date):        return {"$backupType": "date", "value": obj.isoformat()}    if isinstance(obj, (bytes, bytearray)):        return {            "$backupType": "bytes_b64",            "value": base64.b64encode(bytes(obj)).decode("ascii"),        }    if isinstance(obj, Decimal):        return {"$backupType": "decimal", "value": str(obj)}    if isinstance(obj, UUID):        return {"$backupType": "uuid", "value": str(obj)}    if isinstance(obj, (set, frozenset)):        return {"$backupType": "set", "value": sorted(obj, key=repr)}    if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes, dict)):        return list(obj)    raise TypeError(        f"Refusing to write a lossy backup: cannot serialize object of type "        f"{type(obj).__name__!r}. Extend _json_default() to handle it."    )

Two design principles are worth calling out here.

This survives a round trip:

{  "$backupType": "decimal",  "value": "19.99"}

This does not:

19.99

The envelope is reversible by design.

The float is not.

The final TypeError is the whole philosophy in one line.

If a type appears that the encoder does not recognize, the backup crashes instead of writing a value it cannot promise to restore.

That may sound harsh, but it is exactly what I want.

A backup that loudly refuses to run is better than one that silently corrupts data I will not inspect again until disaster recovery.

The error also tells the next maintainer what to do:

extend _json_default().

projection={"*": True} is necessary.

It is not sufficient.

What if a future AstraPy upgrade changes projection semantics?

What if a field gets renamed?

What if a collection definition changes?

You could be back to a vectorless backup, and the whole problem is that vector loss is easy to miss.

So the script counts vectors as it writes and refuses to finish a backup that lost them:

if vector_dim and written > 0 and vectors_seen == 0:    if vectorize_service:        LOG.warning(            "%s: declares a %d-dim VECTORIZE vector but no dumped doc carried "            "a $vector, restore will regenerate from $vectorize.",            name,            vector_dim,        )    else:        raise RuntimeError(            f"{name}: collection declares a {vector_dim}-dim vector but NONE of the "            f"{written} dumped document(s) contained a $vector. Refusing to write a "            f"silently vectorless backup. Check the find projection / astrapy version."        )

The nuance is the distinction between two kinds of vector collections.

In a vectorize collection, the document carries source text in $vectorize, and the server computes $vector.

If the dump is missing $vector, that may be recoverable because re-inserting $vectorize can regenerate the embedding.

So the script warns, but proceeds.

In an explicit-vector collection, embeddings were supplied by the client.

If $vector is missing, that is real data loss.

The script fails the collection, marks the backup incomplete, and blocks the Box upload.

That is the difference between a backup system that hopes and a backup system that checks.

The guard turns an invisible failure mode into a loud one.

A backup is only as good as its weakest failure mode.

Three mechanisms keep a half-finished or corrupted snapshot from ever pretending to be a good one.

The backup writes into a hidden .inprogress directory first.

Only at the very end does it rename the directory to its final name.

work_dir = output_dir / f".{backup_name}.inprogress"if failures:    os.replace(work_dir, output_dir / f"{backup_name}.INCOMPLETE")    return 2os.replace(work_dir, output_dir / backup_name)

os.replace() is atomic on a local filesystem.

That means a directory named like this:

20260616_085246

is guaranteed to be a complete, finished backup.

A crash mid-run leaves a .inprogress directory that the restore tool ignores.

A partial failure leaves an .INCOMPLETE directory that the restore tool refuses to use unless explicitly forced.

There is no in-between state that looks complete but is not.

The same idea applies inside a collection: a retry truncates and rewrites the data file from scratch, so a failed page can never leave behind a half-written .jsonl.

This is the part I am quietly proud of.

Every backup writes a manifest.json that fully describes itself.

The restore tool does not have to guess anything.

It does not have to trust the filesystem.

The manifest becomes the contract between:

I wrote this.

and:

I can safely read this back.

A simplified manifest looks like this:

{  "backup_format_version": "1.0",  "status": "complete",  "created_at_utc": "2026-06-16T08:52:46+00:00",  "finished_at_utc": "2026-06-16T08:54:09+00:00",  "duration_seconds": 83.4,  "astrapy_version": "...",  "source": {    "api_endpoint": "https://<your-db>.apps.astra.datastax.com",    "keyspace": "default_keyspace"  },  "serdes_options": {    "custom_datatypes_in_reading": false,    "unroll_iterables_to_lists": true  },  "document_format": {    "layout": "one JSON document per line (JSONL)",    "vector_field": "$vector stored as a plain JSON array of floats"  },  "totals": {    "collections_ok": 3,    "collections_failed": 0,    "documents": 12873  },  "collections": [    {      "name": "assets",      "status": "ok",      "definition": {        "vector": {          "dimension": 384,          "metric": "cosine"        }      },      "data_file": "assets.jsonl.gz",      "compressed": true,      "document_count": 5120,      "estimated_document_count": 5118,      "data_sha256": "9f2c...",      "data_size_bytes": 41203847    }  ]}

The manifest gives me three important things.

Each collection’s data file is hashed with SHA-256 as it is written.

Before restore, the restore tool verifies every file against its recorded digest.

If a zip gets truncated in transit to or from Box, the restore catches it before writing a single document to the database.

That matters.

You do not want to discover corruption halfway through clobbering a live collection.

Each collection entry carries the collection definition: vector dimension, similarity metric, lexical config, rerank config, and indexing rules.

A restore recreates the collection from that definition first, then streams the documents in.

Schema and data are never separated.

The manifest records where the snapshot came from, which keyspace it used, when it ran, how long it took, which AstraPy version created it, and which serdes options were active.

One deliberate omission:

the application token is never written anywhere in the backup.

The endpoint is not a secret.

The token is.

So the token stays out.

The status field and per-collection statuses make the backup self-auditing. You can tell whether a backup is trustworthy by reading one small JSON file, without unpacking gigabytes of vectors.

Finally, the process speaks in exit codes that a scheduler can branch on.

Exit code 0 means the backup completed cleanly and was uploaded to Box, unless --no-box was used.

Exit code 1 means there was a config or connection error, and nothing useful was written.

Exit code 2 means one or more collections failed, so the backup was marked INCOMPLETE and was not uploaded.

Exit code 3 means the local backup is good, but the Box upload failed, so the local copy was kept.

Exit code 3 is the one I like most.

A Box outage should not destroy a perfectly good local backup.

The upload step zips the finished directory, pushes it to Box, and if anything fails, it keeps the local snapshot untouched:

except Exception as exc:    LOG.error("Box upload FAILED: %s. Local backup kept at %s", exc, final_dir)    return 3

Distinct codes let the cron job, or a human reading the logs, tell the difference between the database being unreachable, the backup being incomplete, and the database being fine while Box is down.

Those are three very different things to get paged about.

Off-host storage matters.

A backup sitting on the same host as the system it protects is not much of a backup.

The script zips the finished snapshot as a timestamped file, for example:

20260616_085246.zip

Then it uploads that zip to a Box folder using Client Credentials Grant auth.

A few pragmatic details made this more reliable.

Small files use a simple one-shot upload.

Larger files use Box’s chunked up.

Vector backups can get big quickly.

If a file with the same backup name already exists, Box may return an HTTP 409.

Instead of failing, the helper uploads the new contents as a new version of the existing file.

That makes idempotent re-runs easier.

The restore tool determines the latest snapshot from the timestamp embedded in the filename, not from Box’s upload time.

That makes the restore behavior deterministic, even if files are copied, re-uploaded, or synced later.

If you are rolling your own backup for a vector database, or any managed datastore without a blessed dump tool, the lesson is not one specific line of code.

The lesson is a posture:

Assume your backup is broken until something proves it is not, and make that proof part of the backup itself.

For me, that meant being explicit about expensive fields, streaming instead of buffering, refusing to flatten types that cannot be reconstructed, counting vectors and failing loudly when they disappear, making “done” atomic, carrying SHA-256 integrity checks, and using exit codes that distinguish different failure modes.

None of this is exotic.

It was maybe a day of work on top of the naive version.

But it is the difference between a backup you technically have and a backup you can bet your data on.

And you only find out which one you built on the day you need it.

The system behind this was an internal catalog app: a React SPA and Flask API over a hybrid RAG pipeline on AstraDB, with backup and restore running as scheduled cron jobs on a serverless container platform.

The backup script is strictly read-only.

It never issues a single write to the database it protects.

The real lesson for me was simple: a backup is not successful because the script finished. It is successful only when it proves that the data, vectors, schema, and restore path are all still intact. For vector databases, that proof has to be designed in from the beginning.

Backing Up a Vector Database to Box: Preserving Vector and ID Fields in JSONL was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article Clever Prompts Are Cheap Now. Reliable LLM Prompting Systems Are the Skill. From Hallucinations to Trust: A Human-in-the-Loop Playbook Hermes Agent Doesn’t Learn.

Backing Up a Vector Database to Box: Preserving Vector and ID Fields in JSONL

Run your AI side-project on zahid.host