Backing Up a Vector Database to Box: Preserving Vector and ID Fields in JSONL A developer built a backup script for DataStax AstraDB vector databases that preserves embedding vectors and ID fields in JSONL format, addressing the common pitfall of naive dumps that silently discard vectors. The script uses projection={"*": True} to capture $vector fields, disables custom datatypes for JSON-native output, and implements atomic writes with SHA-256 manifests to ensure trustworthy backups. The solution runs as a scheduled cron job on a serverless container platform and pushes zipped snapshots to Box for off-host storage. I built a RAG system. I ingested documents, computed embeddings, and wrote them into a managed vector database. In my case, that database was DataStax AstraDB. A few weeks later, someone asked the obvious question: What is our backup story? And I realized there wasn’t one. There is no pg dump equivalent here. There is no one-button “export keyspace” option that hands you a file you can confidently restore from. You can call find and write the documents to disk, sure. But I learned the hard way that a naive dump can quietly throw away the single most expensive thing in the database: the embedding vectors. That is the trap I want to save you from. Your dump may look complete. Every document is there. Every metadata field is present. The counts match. At a glance, everything looks fine. Then, during restore, you realize the vectors are missing. And now your “backup” is not really a backup. It is just a metadata export. This article walks through the backup script I built for an internal catalog app. It runs as a scheduled cron job on a serverless container platform and pushes a zipped snapshot to Box as off-host storage. The interesting parts are not the plumbing. The interesting parts are the small decisions that separated a backup I could actually trust from one that merely looked fine. In practice, five details mattered most: using projection={" ": True} to actually capture $vector, choosing JSONL instead of one giant JSON file, wrapping non-JSON values in $backupType envelopes, adding a fail-loud guard against silent embedding loss, and making the backup atomic with .inprogress writes, SHA-256 manifests, and useful exit codes. Here is the most important line in the entire backup script: cursor = collection.find {}, projection={" ": True}, request timeout ms=request timeout ms, By default, AstraDB’s Data API does not necessarily return special fields like $vector when you call find . That default makes sense for normal application queries. Embeddings are large. A 384-dimension float array per document adds up fast, and most app queries do not need the raw vector echoed back. But backups are different. The obvious backup code looks like this: for doc in collection.find {} : write doc That can produce a backup with every document and zero embeddings. It passes the eyeball test. The document counts match. The metadata is all there. But it is not a full-fidelity restore. If you need to regenerate those embeddings later, you may have to re-run your entire ingestion pipeline against an embedding model that may have changed versions since the original data was written. That is not a restore. That is a rebuild. projection={" ": True} is the explicit instruction to return everything, including special $-prefixed fields. One line. Miss it, and your backup can become a lie. There is a companion decision in how I build the client. AstraPy can return rich custom types, such as DataAPIVector instead of a plain list of floats, or custom timestamp wrappers instead of standard Python types. Those are useful in application code. They are less fun when your goal is to serialize a backup. So the backup client disables custom read datatypes: python from astrapy.api options import APIOptions, SerdesOptionsoptions = APIOptions serdes options=SerdesOptions custom datatypes in reading=False, unroll iterables to lists=True, client = DataAPIClient token=token, api options=options With custom datatypes in reading=False, the common cases come back closer to JSON-native values. $vector becomes a list of floats, id remains a string, numbers remain normal ints or floats, and iterable values can be unrolled into lists. The closer the read output is to plain JSON, the less custom serialization code I need. And the less custom serialization code I need, the fewer places the backup can lie to me. The second important point is that id is preserved as-is. A restore that re-inserts documents with the original id and original $vector can reproduce the database faithfully. No re-chunking. No re-embedding. No drift. The document format is JSON Lines: one JSON object per line. opener = gzip.open if compress else open with opener data path, "wt", encoding="utf-8" as handle: for doc in cursor: handle.write json.dumps doc, ensure ascii=False, default= json default handle.write "\n" written += 1 This looks like a small choice, but it matters. The tempting alternative is to dump one giant JSON array containing every document. I avoided that for a few reasons. JSONL lets me write each document as I read it from the cursor. A large collection, with each document carrying a 384-float vector, never has to fit in memory all at once. A single giant JSON array pushes you toward buffering too much data before serialization. On restore, JSONL is just as useful. You read line by line and insert in batches. The memory profile stays flat in both directions. If something truncates a JSONL file, I may lose the last line. If something corrupts a giant JSON array, the entire file may become unparseable. Embedding floats and repeated metadata keys compress nicely. The backup defaults to .jsonl.gz, with a --no-compress flag for debugging. I also record the format choice in the manifest, so the restore tool never has to guess: { "document format": { "layout": "one JSON document per line JSONL ", "vector field": "$vector stored as a plain JSON array of floats", "special types": "non-JSON types wrapped as {\"$backupType\": ..., \"value\": ...}" }} The format is not just an implementation detail. It is part of the backup contract. Even with serdes configured to return JSON-native values for common cases, production data is a zoo. Sooner or later, a document carries values that plain JSON cannot represent cleanly: datetime, date, Decimal, UUID, raw bytes, or set-like values. Plain json.dumps does not know what to do with those. The lazy fix is default=str. I would avoid that. default=str is lossy. A real timestamp and the string "2026-06-16T08:52:46+00:00" are not the same thing. On restore, you would insert a string where a timestamp used to be. That means the backup round-trips into a subtly different database. Instead, I wrap every non-JSON type in a typed envelope: python def json default obj : if isinstance obj, dt.datetime : return {"$backupType": "datetime", "value": obj.isoformat } if isinstance obj, dt.date : return {"$backupType": "date", "value": obj.isoformat } if isinstance obj, bytes, bytearray : return { "$backupType": "bytes b64", "value": base64.b64encode bytes obj .decode "ascii" , } if isinstance obj, Decimal : return {"$backupType": "decimal", "value": str obj } if isinstance obj, UUID : return {"$backupType": "uuid", "value": str obj } if isinstance obj, set, frozenset : return {"$backupType": "set", "value": sorted obj, key=repr } if isinstance obj, Iterable and not isinstance obj, str, bytes, dict : return list obj raise TypeError f"Refusing to write a lossy backup: cannot serialize object of type " f"{type obj . name r}. Extend json default to handle it." Two design principles are worth calling out here. This survives a round trip: { "$backupType": "decimal", "value": "19.99"} This does not: 19.99 The envelope is reversible by design. The float is not. The final TypeError is the whole philosophy in one line. If a type appears that the encoder does not recognize, the backup crashes instead of writing a value it cannot promise to restore. That may sound harsh, but it is exactly what I want. A backup that loudly refuses to run is better than one that silently corrupts data I will not inspect again until disaster recovery. The error also tells the next maintainer what to do: extend json default . projection={" ": True} is necessary. It is not sufficient. What if a future AstraPy upgrade changes projection semantics? What if a field gets renamed? What if a collection definition changes? You could be back to a vectorless backup, and the whole problem is that vector loss is easy to miss. So the script counts vectors as it writes and refuses to finish a backup that lost them: if vector dim and written 0 and vectors seen == 0: if vectorize service: LOG.warning "%s: declares a %d-dim VECTORIZE vector but no dumped doc carried " "a $vector, restore will regenerate from $vectorize.", name, vector dim, else: raise RuntimeError f"{name}: collection declares a {vector dim}-dim vector but NONE of the " f"{written} dumped document s contained a $vector. Refusing to write a " f"silently vectorless backup. Check the find projection / astrapy version." The nuance is the distinction between two kinds of vector collections. In a vectorize collection, the document carries source text in $vectorize, and the server computes $vector. If the dump is missing $vector, that may be recoverable because re-inserting $vectorize can regenerate the embedding. So the script warns, but proceeds. In an explicit-vector collection, embeddings were supplied by the client. If $vector is missing, that is real data loss. The script fails the collection, marks the backup incomplete, and blocks the Box upload. That is the difference between a backup system that hopes and a backup system that checks. The guard turns an invisible failure mode into a loud one. A backup is only as good as its weakest failure mode. Three mechanisms keep a half-finished or corrupted snapshot from ever pretending to be a good one. The backup writes into a hidden .inprogress directory first. Only at the very end does it rename the directory to its final name. work dir = output dir / f".{backup name}.inprogress"if failures: os.replace work dir, output dir / f"{backup name}.INCOMPLETE" return 2os.replace work dir, output dir / backup name os.replace is atomic on a local filesystem. That means a directory named like this: 20260616 085246 is guaranteed to be a complete, finished backup. A crash mid-run leaves a .inprogress directory that the restore tool ignores. A partial failure leaves an .INCOMPLETE directory that the restore tool refuses to use unless explicitly forced. There is no in-between state that looks complete but is not. The same idea applies inside a collection: a retry truncates and rewrites the data file from scratch, so a failed page can never leave behind a half-written .jsonl. This is the part I am quietly proud of. Every backup writes a manifest.json that fully describes itself. The restore tool does not have to guess anything. It does not have to trust the filesystem. The manifest becomes the contract between: I wrote this. and: I can safely read this back. A simplified manifest looks like this: { "backup format version": "1.0", "status": "complete", "created at utc": "2026-06-16T08:52:46+00:00", "finished at utc": "2026-06-16T08:54:09+00:00", "duration seconds": 83.4, "astrapy version": "...", "source": { "api endpoint": "https://