{"slug": "how-we-vectorize-33-7m-ukrainian-court-decisions-via-voyage-ai", "title": "How We Vectorize 33.7M Ukrainian Court Decisions via Voyage AI", "summary": "Qdrant has vectorized 33.7 million Ukrainian court decisions from the Unified State Register of Court Decisions using Voyage AI's voyage-3.5 embedding model. The system, running on a dedicated EC2 instance, currently holds over 44 million vectors across criminal, civil, commercial, and misdemeanor cases, with civil cases being the largest cohort at 14.3 million vectors processed so far. The pipeline uses chunking and asynchronous processing to achieve a throughput of 63 documents per second.", "body_md": "*EDRSR — the Unified State Register of Court Decisions — is effectively all of Ukraine's judicial practice in open access. Today Qdrant holds **44M+ vectors**: criminal (19M), civil (14.3M), commercial (5.1M), misdemeanors (5.6M). Vectorization of civil cases (CPC, justice_kind=1) — the largest cohort at 33.7M documents — runs on a dedicated EC2 instance (r6a.xlarge, 32 GB RAM, 2 TB gp3). Here's what's under the hood: models, pipeline, cost, rakes, and current status.*\n\nWhen a lawyer searches \"is there case law on recovering bank prepayment fees\" — they don't want to open 40 decisions and read them through. They want the system to surface the top 5 most relevant ones, pull out key paragraphs, and show how courts reasoned. Full-text search (FTS) over keywords doesn't give that — it returns every document containing the word \"fee\", and there are thousands.\n\nFor this semantic task you need vector representations of text. The model turns a paragraph from a decision into a point in a 1024-dimensional space; semantically similar paragraphs sit near each other. A kNN search in Qdrant returns the top K nearest, and an LLM composes the answer from exactly those relevant fragments.\n\nThe only problem: the register is big. Very big.\n\nOur prod database holds full texts of decisions starting from 2006. Breakdown by procedural type:\n\nThe Qdrant collection `edrsr_decisions`\n\non a dedicated EC2 currently holds **44M+ vectors** (122 segments, on_disk=true):\n\n| Proceeding type | justice_kind | Vectors |\n\n|—|—|—|\n\n| Criminal (CrPC) | 2 | 19,036,347 |\n\n| Civil (CPC) | 1 | 14,328,427 |\n\n| Misdemeanors (CUaP) | 5 | 5,579,432 |\n\n| Commercial (CC) | 3 | 5,098,662 |\n\n| **Total** | | **44,042,868** |\n\nCivil cases processed: 14.3M out of 33.7M — that's 42%. After CPC completes there will be roughly **63M+ vectors** in a single collection.\n\nFor scale: a typical RAG project holds 100K — 1M vectors. Ours is two orders of magnitude bigger.\n\n**Embedding model.** `voyage-3.5`\n\nfrom Voyage AI. 1024-dimensional output, 6 cents per million tokens. We tested Voyage 3 Large and OpenAI text-embedding-3-large, but the quality gain on legal text didn't justify the cost difference (Voyage 3 Large is 3x more expensive). We already had an index on 3.5 for prior jurisdictions, so we stay on it for compatibility.\n\n**Vector DB.** Qdrant v1.17, self-hosted in Docker on a dedicated EC2 (r6a.xlarge — 4 CPU, 32 GB RAM, 2 TB gp3). Collection `edrsr_decisions`\n\nwith HNSW index, on_disk=true for both vectors and payload. Payload carries doc_id, court_code, judge, justice_kind, adjudication_date, plus chunk_index/total_chunks and chunk text. Dedicated instance because 44M+ points with HNSW were killing RAM on prod and blocking the chat service (OOM kills during segment optimization).\n\n**Source-of-truth.** PostgreSQL 15, partitioned tables: RANGE by adjudication_date, LIST by adj_year. Full texts live in `edrsr_fulltext`\n\n, metadata in `edrsr_documents`\n\n. A JOIN across all partitions is 30M+ rows, so the pipeline walks year by year.\n\n**Runtime.** Python 3.11, asyncio, aiohttp. No frameworks — direct HTTP to Voyage and Qdrant. 440 lines of code, one file.\n\nCourt decisions are long. Average CPC ruling is 8–12K characters, longest reach 200K. Voyage accepts up to 32K tokens per input, but quality falls off on long contexts, and one long vector is poor for retrieval — the LLM can't tell which paragraph is relevant.\n\nSo we chunk: up to 2048 characters per chunk, 50-word overlap between neighbors. We split on paragraph boundaries to keep semantic coherence. On average one decision yields 2.7 chunks.\n\nEach chunk in Qdrant gets a composite ID (doc_id × 1000 + chunk_index) — no collisions, and a single payload filter query pulls all chunks of a specific decision.\n\nVoyage has a rate limit — 2000 RPM per key for voyage-3.5. We have two keys and round-robin between them, giving a theoretical 4000 RPM ceiling. In practice we hold concurrency 50 and get a steady **63 documents per second**. That's ~170 requests per minute per key — comfortably under the rate limit.\n\nWe tried concurrency 70 — first two million were fine, then the process stalled on the GIL (13% CPU, no progress, no errors — just stuck on a thread lock). Dropped to 50 — ran smooth, no deadlocks, no 429s.\n\nEvery 100 documents triggers a batch to Voyage (batch_size=500 chunks/request), gets embeddings, composes Qdrant points, and does one upsert. On Voyage error (429, network) — exponential backoff with jitter, max 5 retries. On Qdrant error — retry the same batch.\n\nAt 33.7M documents any failure — network, OOM, container crash — means hours of lost work. So:\n\n`{last_doc_id, processed_docs, total_chunks, total_tokens, timestamp}`\n\n`WHERE doc_id > last_doc_id`\n\nThis has saved us twice. First time — when postgres-prod ran out of memory (more on that below). Second time — when Qdrant restarted and lost its API key from env. Both times we just restarted from the same checkpoint with no duplicated work.\n\nAt 2.86M documents postgres-prod fell into recovery mode. Root cause: config mismatch — `shared_buffers=16GB`\n\n, container memory limit 12G. PG tried to allocate more than it had; OOM killer killed the process.\n\nFix in PR #1453: `mem_limit: 24G`\n\n, `shm_size: 16g`\n\n. After restarting the container with the new limits PG came up in 4 seconds and stopped falling over. The episode highlighted an infra pattern: postgresql.conf parameters (shared_buffers, work_mem, maintenance_work_mem) must align with container limits. Otherwise the system runs fine until the first load spike, then falls into recovery.\n\nWe also bumped swap on the local dev machine from 8GB to 24GB — heavy Voyage API traffic generates a lot of temporary objects in the Python process memory, especially while Qdrant is rebuilding its index in the background.\n\nOne civil document averages 2.7 chunks × 850 tokens = 2300 tokens. At voyage-3.5 pricing of 6 cents per million tokens, one document costs **0.014 cents** — roughly 138 microdollars.\n\nAs of today, 14.3M documents out of 33.7M are processed — that's 42% of the cohort. We've spent approximately **1,980 dollars** on the Voyage API and about 63 hours of pipeline runtime. Remaining 19.4M documents cost roughly **2,680 dollars** and **85 hours** (3.5 days of continuous processing). Total cost of the full CPC cohort vectorization — around **4,660 dollars**.\n\nPlus the EC2 r6a.xlarge for Qdrant — ~\\0.20/hr (on-demand), roughly \\145/month. Cheaper than OOM incidents on prod.\n\nFor scale: the same budget on OpenAI text-embedding-3-large would get us only a quarter of the volume. Voyage wins specifically at this scale.\n\nSemantic search already works across 44M+ vectors today. Once the civil cohort is fully indexed, the collection will hold 63M+ chunks. A lawyer types a natural-language query — \"case law on voiding a sale contract due to seller incapacity\" — and the system returns the most relevant decisions from the right jurisdiction, with key paragraph extracts and EDRSR links.\n\nThat's a different class of product compared to FTS. FTS finds documents where a phrase appears. Semantic search finds documents where your situation is being discussed — even when the court used entirely different words.\n\nRuns in tmux on a dedicated EC2, checkpoint fires every 1000 docs. Snapshot sync to prod Qdrant every 6 hours via cron. Boring reliable engineering, not heroics.\n\n*Originally published on legal.org.ua.*", "url": "https://wpnews.pro/news/how-we-vectorize-33-7m-ukrainian-court-decisions-via-voyage-ai", "canonical_source": "https://dev.to/overthelex/how-we-vectorize-337m-ukrainian-court-decisions-via-voyage-ai-3hlc", "published_at": "2026-07-03 21:35:48+00:00", "updated_at": "2026-07-03 21:49:11.875665+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-products", "ai-infrastructure", "developer-tools"], "entities": ["Qdrant", "Voyage AI", "Unified State Register of Court Decisions", "EC2", "PostgreSQL", "Python", "Docker", "HNSW"], "alternates": {"html": "https://wpnews.pro/news/how-we-vectorize-33-7m-ukrainian-court-decisions-via-voyage-ai", "markdown": "https://wpnews.pro/news/how-we-vectorize-33-7m-ukrainian-court-decisions-via-voyage-ai.md", "text": "https://wpnews.pro/news/how-we-vectorize-33-7m-ukrainian-court-decisions-via-voyage-ai.txt", "jsonld": "https://wpnews.pro/news/how-we-vectorize-33-7m-ukrainian-court-decisions-via-voyage-ai.jsonld"}}