{"slug": "backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl", "title": "Backing Up a Vector Database to Box: Preserving Vector and ID Fields in JSONL", "summary": "A developer built a backup script for DataStax AstraDB vector databases that preserves embedding vectors and ID fields in JSONL format, addressing the common pitfall of naive dumps that silently discard vectors. The script uses projection={\"*\": True} to capture $vector fields, disables custom datatypes for JSON-native output, and implements atomic writes with SHA-256 manifests to ensure trustworthy backups. The solution runs as a scheduled cron job on a serverless container platform and pushes zipped snapshots to Box for off-host storage.", "body_md": "I built a RAG system.\n\nI ingested documents, computed embeddings, and wrote them into a managed vector database. In my case, that database was DataStax AstraDB.\n\nA few weeks later, someone asked the obvious question:\n\nWhat is our backup story?\n\nAnd I realized there wasn’t one.\n\nThere is no pg_dump equivalent here. There is no one-button “export keyspace” option that hands you a file you can confidently restore from.\n\nYou can call find() and write the documents to disk, sure. But I learned the hard way that a naive dump can quietly throw away the single most expensive thing in the database:\n\n**the embedding vectors.**\n\nThat is the trap I want to save you from.\n\nYour dump may look complete. Every document is there. Every metadata field is present. The counts match. At a glance, everything looks fine.\n\nThen, during restore, you realize the vectors are missing.\n\nAnd now your “backup” is not really a backup. It is just a metadata export.\n\nThis article walks through the backup script I built for an internal catalog app. It runs as a scheduled cron job on a serverless container platform and pushes a zipped snapshot to Box as off-host storage.\n\nThe interesting parts are not the plumbing. The interesting parts are the small decisions that separated a backup I could actually trust from one that merely looked fine.\n\nIn practice, five details mattered most: using projection={\"*\": True} to actually capture $vector, choosing JSONL instead of one giant JSON file, wrapping non-JSON values in $backupType envelopes, adding a fail-loud guard against silent embedding loss, and making the backup atomic with .inprogress writes, SHA-256 manifests, and useful exit codes.\n\nHere is the most important line in the entire backup script:\n\n```\ncursor = collection.find(    {},    projection={\"*\": True},    request_timeout_ms=request_timeout_ms,)\n```\n\nBy default, AstraDB’s Data API does not necessarily return special fields like $vector when you call find().\n\nThat default makes sense for normal application queries. Embeddings are large. A 384-dimension float array per document adds up fast, and most app queries do not need the raw vector echoed back.\n\nBut backups are different.\n\nThe obvious backup code looks like this:\n\n```\nfor doc in collection.find({}):    write(doc)\n```\n\nThat can produce a backup with every document and zero embeddings.\n\nIt passes the eyeball test. The document counts match. The metadata is all there.\n\nBut it is not a full-fidelity restore.\n\nIf you need to regenerate those embeddings later, you may have to re-run your entire ingestion pipeline against an embedding model that may have changed versions since the original data was written.\n\nThat is not a restore. That is a rebuild.\n\nprojection={\"*\": True} is the explicit instruction to return everything, including special $-prefixed fields.\n\nOne line.\n\nMiss it, and your backup can become a lie.\n\nThere is a companion decision in how I build the client.\n\nAstraPy can return rich custom types, such as DataAPIVector instead of a plain list of floats, or custom timestamp wrappers instead of standard Python types.\n\nThose are useful in application code.\n\nThey are less fun when your goal is to serialize a backup.\n\nSo the backup client disables custom read datatypes:\n\n``` python\nfrom astrapy.api_options import APIOptions, SerdesOptionsoptions = APIOptions(    serdes_options=SerdesOptions(        custom_datatypes_in_reading=False,        unroll_iterables_to_lists=True,    ))client = DataAPIClient(token=token, api_options=options)\n```\n\nWith custom_datatypes_in_reading=False, the common cases come back closer to JSON-native values. $vector becomes a list of floats, _id remains a string, numbers remain normal ints or floats, and iterable values can be unrolled into lists.\n\nThe closer the read output is to plain JSON, the less custom serialization code I need.\n\nAnd the less custom serialization code I need, the fewer places the backup can lie to me.\n\nThe second important point is that _id is preserved as-is.\n\nA restore that re-inserts documents with the original _id and original $vector can reproduce the database faithfully.\n\nNo re-chunking.\n\nNo re-embedding.\n\nNo drift.\n\nThe document format is JSON Lines: one JSON object per line.\n\n```\nopener = gzip.open if compress else open\nwith opener(data_path, \"wt\", encoding=\"utf-8\") as handle:    for doc in cursor:        handle.write(json.dumps(doc, ensure_ascii=False, default=_json_default))        handle.write(\"\\n\")        written += 1\n```\n\nThis looks like a small choice, but it matters.\n\nThe tempting alternative is to dump one giant JSON array containing every document.\n\nI avoided that for a few reasons.\n\nJSONL lets me write each document as I read it from the cursor.\n\nA large collection, with each document carrying a 384-float vector, never has to fit in memory all at once.\n\nA single giant JSON array pushes you toward buffering too much data before serialization.\n\nOn restore, JSONL is just as useful.\n\nYou read line by line and insert in batches.\n\nThe memory profile stays flat in both directions.\n\nIf something truncates a JSONL file, I may lose the last line.\n\nIf something corrupts a giant JSON array, the entire file may become unparseable.\n\nEmbedding floats and repeated metadata keys compress nicely.\n\nThe backup defaults to .jsonl.gz, with a --no-compress flag for debugging.\n\nI also record the format choice in the manifest, so the restore tool never has to guess:\n\n```\n{  \"document_format\": {    \"layout\": \"one JSON document per line (JSONL)\",    \"vector_field\": \"$vector stored as a plain JSON array of floats\",    \"special_types\": \"non-JSON types wrapped as {\\\"$backupType\\\": ..., \\\"value\\\": ...}\"  }}\n```\n\nThe format is not just an implementation detail.\n\nIt is part of the backup contract.\n\nEven with serdes configured to return JSON-native values for common cases, production data is a zoo.\n\nSooner or later, a document carries values that plain JSON cannot represent cleanly: datetime, date, Decimal, UUID, raw bytes, or set-like values.\n\nPlain json.dumps() does not know what to do with those.\n\nThe lazy fix is default=str.\n\nI would avoid that.\n\ndefault=str is lossy.\n\nA real timestamp and the string \"2026-06-16T08:52:46+00:00\" are not the same thing. On restore, you would insert a string where a timestamp used to be.\n\nThat means the backup round-trips into a subtly different database.\n\nInstead, I wrap every non-JSON type in a typed envelope:\n\n``` python\ndef _json_default(obj):    if isinstance(obj, dt.datetime):        return {\"$backupType\": \"datetime\", \"value\": obj.isoformat()}\nif isinstance(obj, dt.date):        return {\"$backupType\": \"date\", \"value\": obj.isoformat()}    if isinstance(obj, (bytes, bytearray)):        return {            \"$backupType\": \"bytes_b64\",            \"value\": base64.b64encode(bytes(obj)).decode(\"ascii\"),        }    if isinstance(obj, Decimal):        return {\"$backupType\": \"decimal\", \"value\": str(obj)}    if isinstance(obj, UUID):        return {\"$backupType\": \"uuid\", \"value\": str(obj)}    if isinstance(obj, (set, frozenset)):        return {\"$backupType\": \"set\", \"value\": sorted(obj, key=repr)}    if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes, dict)):        return list(obj)    raise TypeError(        f\"Refusing to write a lossy backup: cannot serialize object of type \"        f\"{type(obj).__name__!r}. Extend _json_default() to handle it.\"    )\n```\n\nTwo design principles are worth calling out here.\n\nThis survives a round trip:\n\n```\n{  \"$backupType\": \"decimal\",  \"value\": \"19.99\"}\n```\n\nThis does not:\n\n```\n19.99\n```\n\nThe envelope is reversible by design.\n\nThe float is not.\n\nThe final TypeError is the whole philosophy in one line.\n\nIf a type appears that the encoder does not recognize, the backup crashes instead of writing a value it cannot promise to restore.\n\nThat may sound harsh, but it is exactly what I want.\n\nA backup that loudly refuses to run is better than one that silently corrupts data I will not inspect again until disaster recovery.\n\nThe error also tells the next maintainer what to do:\n\nextend _json_default().\n\nprojection={\"*\": True} is necessary.\n\nIt is not sufficient.\n\nWhat if a future AstraPy upgrade changes projection semantics?\n\nWhat if a field gets renamed?\n\nWhat if a collection definition changes?\n\nYou could be back to a vectorless backup, and the whole problem is that vector loss is easy to miss.\n\nSo the script counts vectors as it writes and refuses to finish a backup that lost them:\n\n```\nif vector_dim and written > 0 and vectors_seen == 0:    if vectorize_service:        LOG.warning(            \"%s: declares a %d-dim VECTORIZE vector but no dumped doc carried \"            \"a $vector, restore will regenerate from $vectorize.\",            name,            vector_dim,        )    else:        raise RuntimeError(            f\"{name}: collection declares a {vector_dim}-dim vector but NONE of the \"            f\"{written} dumped document(s) contained a $vector. Refusing to write a \"            f\"silently vectorless backup. Check the find projection / astrapy version.\"        )\n```\n\nThe nuance is the distinction between two kinds of vector collections.\n\nIn a vectorize collection, the document carries source text in $vectorize, and the server computes $vector.\n\nIf the dump is missing $vector, that may be recoverable because re-inserting $vectorize can regenerate the embedding.\n\nSo the script warns, but proceeds.\n\nIn an explicit-vector collection, embeddings were supplied by the client.\n\nIf $vector is missing, that is real data loss.\n\nThe script fails the collection, marks the backup incomplete, and blocks the Box upload.\n\nThat is the difference between a backup system that hopes and a backup system that checks.\n\nThe guard turns an invisible failure mode into a loud one.\n\nA backup is only as good as its weakest failure mode.\n\nThree mechanisms keep a half-finished or corrupted snapshot from ever pretending to be a good one.\n\nThe backup writes into a hidden .inprogress directory first.\n\nOnly at the very end does it rename the directory to its final name.\n\n```\nwork_dir = output_dir / f\".{backup_name}.inprogress\"if failures:    os.replace(work_dir, output_dir / f\"{backup_name}.INCOMPLETE\")    return 2os.replace(work_dir, output_dir / backup_name)\n```\n\nos.replace() is atomic on a local filesystem.\n\nThat means a directory named like this:\n\n```\n20260616_085246\n```\n\nis guaranteed to be a complete, finished backup.\n\nA crash mid-run leaves a .inprogress directory that the restore tool ignores.\n\nA partial failure leaves an .INCOMPLETE directory that the restore tool refuses to use unless explicitly forced.\n\nThere is no in-between state that looks complete but is not.\n\nThe same idea applies inside a collection: a retry truncates and rewrites the data file from scratch, so a failed page can never leave behind a half-written .jsonl.\n\nThis is the part I am quietly proud of.\n\nEvery backup writes a manifest.json that fully describes itself.\n\nThe restore tool does not have to guess anything.\n\nIt does not have to trust the filesystem.\n\nThe manifest becomes the contract between:\n\nI wrote this.\n\nand:\n\nI can safely read this back.\n\nA simplified manifest looks like this:\n\n```\n{  \"backup_format_version\": \"1.0\",  \"status\": \"complete\",  \"created_at_utc\": \"2026-06-16T08:52:46+00:00\",  \"finished_at_utc\": \"2026-06-16T08:54:09+00:00\",  \"duration_seconds\": 83.4,  \"astrapy_version\": \"...\",  \"source\": {    \"api_endpoint\": \"https://<your-db>.apps.astra.datastax.com\",    \"keyspace\": \"default_keyspace\"  },  \"serdes_options\": {    \"custom_datatypes_in_reading\": false,    \"unroll_iterables_to_lists\": true  },  \"document_format\": {    \"layout\": \"one JSON document per line (JSONL)\",    \"vector_field\": \"$vector stored as a plain JSON array of floats\"  },  \"totals\": {    \"collections_ok\": 3,    \"collections_failed\": 0,    \"documents\": 12873  },  \"collections\": [    {      \"name\": \"assets\",      \"status\": \"ok\",      \"definition\": {        \"vector\": {          \"dimension\": 384,          \"metric\": \"cosine\"        }      },      \"data_file\": \"assets.jsonl.gz\",      \"compressed\": true,      \"document_count\": 5120,      \"estimated_document_count\": 5118,      \"data_sha256\": \"9f2c...\",      \"data_size_bytes\": 41203847    }  ]}\n```\n\nThe manifest gives me three important things.\n\nEach collection’s data file is hashed with SHA-256 as it is written.\n\nBefore restore, the restore tool verifies every file against its recorded digest.\n\nIf a zip gets truncated in transit to or from Box, the restore catches it before writing a single document to the database.\n\nThat matters.\n\nYou do not want to discover corruption halfway through clobbering a live collection.\n\nEach collection entry carries the collection definition: vector dimension, similarity metric, lexical config, rerank config, and indexing rules.\n\nA restore recreates the collection from that definition first, then streams the documents in.\n\nSchema and data are never separated.\n\nThe manifest records where the snapshot came from, which keyspace it used, when it ran, how long it took, which AstraPy version created it, and which serdes options were active.\n\nOne deliberate omission:\n\n**the application token is never written anywhere in the backup.**\n\nThe endpoint is not a secret.\n\nThe token is.\n\nSo the token stays out.\n\nThe status field and per-collection statuses make the backup self-auditing. You can tell whether a backup is trustworthy by reading one small JSON file, without unpacking gigabytes of vectors.\n\nFinally, the process speaks in exit codes that a scheduler can branch on.\n\nExit code 0 means the backup completed cleanly and was uploaded to Box, unless --no-box was used.\n\nExit code 1 means there was a config or connection error, and nothing useful was written.\n\nExit code 2 means one or more collections failed, so the backup was marked INCOMPLETE and was not uploaded.\n\nExit code 3 means the local backup is good, but the Box upload failed, so the local copy was kept.\n\nExit code 3 is the one I like most.\n\nA Box outage should not destroy a perfectly good local backup.\n\nThe upload step zips the finished directory, pushes it to Box, and if anything fails, it keeps the local snapshot untouched:\n\n```\nexcept Exception as exc:    LOG.error(\"Box upload FAILED: %s. Local backup kept at %s\", exc, final_dir)    return 3\n```\n\nDistinct codes let the cron job, or a human reading the logs, tell the difference between the database being unreachable, the backup being incomplete, and the database being fine while Box is down.\n\nThose are three very different things to get paged about.\n\nOff-host storage matters.\n\nA backup sitting on the same host as the system it protects is not much of a backup.\n\nThe script zips the finished snapshot as a timestamped file, for example:\n\n```\n20260616_085246.zip\n```\n\nThen it uploads that zip to a Box folder using Client Credentials Grant auth.\n\nA few pragmatic details made this more reliable.\n\nSmall files use a simple one-shot upload.\n\nLarger files use Box’s chunked uploader.\n\nVector backups can get big quickly.\n\nIf a file with the same backup name already exists, Box may return an HTTP 409.\n\nInstead of failing, the helper uploads the new contents as a new version of the existing file.\n\nThat makes idempotent re-runs easier.\n\nThe restore tool determines the latest snapshot from the timestamp embedded in the filename, not from Box’s upload time.\n\nThat makes the restore behavior deterministic, even if files are copied, re-uploaded, or synced later.\n\nIf you are rolling your own backup for a vector database, or any managed datastore without a blessed dump tool, the lesson is not one specific line of code.\n\nThe lesson is a posture:\n\nAssume your backup is broken until something proves it is not, and make that proof part of the backup itself.\n\nFor me, that meant being explicit about expensive fields, streaming instead of buffering, refusing to flatten types that cannot be reconstructed, counting vectors and failing loudly when they disappear, making “done” atomic, carrying SHA-256 integrity checks, and using exit codes that distinguish different failure modes.\n\nNone of this is exotic.\n\nIt was maybe a day of work on top of the naive version.\n\nBut it is the difference between a backup you technically have and a backup you can bet your data on.\n\nAnd you only find out which one you built on the day you need it.\n\nThe system behind this was an internal catalog app: a React SPA and Flask API over a hybrid RAG pipeline on AstraDB, with backup and restore running as scheduled cron jobs on a serverless container platform.\n\nThe backup script is strictly read-only.\n\nIt never issues a single write to the database it protects.\n\nThe real lesson for me was simple: a backup is not successful because the script finished. It is successful only when it proves that the data, vectors, schema, and restore path are all still intact. For vector databases, that proof has to be designed in from the beginning.\n\n[Backing Up a Vector Database to Box: Preserving Vector and ID Fields in JSONL](https://pub.towardsai.net/backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl-bd43bc728e3a) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl", "canonical_source": "https://pub.towardsai.net/backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl-bd43bc728e3a?source=rss----98111c9905da---4", "published_at": "2026-06-27 07:32:56+00:00", "updated_at": "2026-06-27 07:39:19.662109+00:00", "lang": "en", "topics": ["ai-infrastructure"], "entities": ["DataStax AstraDB", "Box", "AstraPy", "DataAPIClient"], "alternates": {"html": "https://wpnews.pro/news/backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl", "markdown": "https://wpnews.pro/news/backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl.md", "text": "https://wpnews.pro/news/backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl.txt", "jsonld": "https://wpnews.pro/news/backing-up-a-vector-database-to-box-preserving-vector-and-id-fields-in-jsonl.jsonld"}}