# Deriving slugs from embeddings: vec2slug

> Source: <https://hash.dev/blog/vec2slug>
> Published: 2026-05-25 00:00:00+00:00

Building a tiny transformer decoder to extract URL slugs from vector embeddings

May 25th, 2026

Vector embeddings let us compare how similar two things are. Whether for search, RAG, deduplication, pattern analysis, or recommendation, similarity comparison is *usually* where we stop.

An embedding compresses some input — a document, a file, or perhaps just a string — into a single fixed-size vector. Each embedding model is trained with a contrastive objective that pushes semantically related documents together in that space. That's why similarity search works: take the cosine distance between two embeddings, and a low score means they're similar in topic. But if embeddings encode enough semantic meaning for similarity, then in principle they should also answer harder questions about the original document than just *is this near that?*

Last weekend we spent some time exploring answers and applications. The first model we trained produced the string `of-the-a`

for every input. The final model produces `amelia-earhart-pilot`

for an article about Amelia Earhart, terminates at the right length on its own, runs in ~89ms on a budget VPS, and is ~14–19× faster and ~85× cheaper than a Haiku-class LLM call for the same task 1.

To test whether embeddings actually encode recoverable semantic context, we needed a task harder than classification — ideally something useful enough that, if the model worked, it could go to production.

We landed on URL slug generation. Titles were an option too, but slugs are much easier to mine at scale. Slugs are the hyphenated strings at the end of a URL path (like `slug-from-embedding`

). Producing them requires a human, an LLM, or a verbatim extraction from the title — and the verbatim approach loses most of an article's semantic context.

Slugs sit in a gap that makes them an interesting test case. A classifier can't recover them: a slug is a composed sequence with ordering and specific word choices, not a label from a fixed set. But an LLM is overkill: the task is bounded enough that a small model and a single 1536-dim vector turn out to be sufficient. And the training data is free: it's in the URL of every SEO-optimized web page.

Embeddings are cheap to compute (~$11 for 2.3M documents at OpenAI batch rates) and moderately expensive to store (1536 × f32, so 6 KiB per vector), but neither cost applies here. The idea is to piggyback on embeddings a system already has for search or deduplication. Deriving a slug from an existing vector costs the CPU time of one inference call — far faster and cheaper than the source-text retrieval, API call, and inference cost of shelling out to an LLM for the same task.

The work that convinced us this was worth trying is [vec2text](https://github.com/jxmorris12/vec2text) (Morris et al., 2023). They showed that iterative correction can reconstruct 92% of 32-token inputs exactly from their embeddings, using a 235M-parameter T5-base model with multiple correction passes. If embeddings preserve enough information for that, recovering a five-word slug should be feasible with something much smaller. Our final model is 25M parameters with a single forward pass: 10× fewer parameters, no iteration, and it achieves comparable topic-recovery for the narrow task of slug generation. If a single-pass network can extract useful auxiliary outputs from the same vector, slug generation is just the first application. For HASH this matters for any human-readable identifier the UI generates: draft titles, suggested filenames, entity slugs.

Before training anything, we needed a dataset. We built an extraction pipeline that takes in arbitrary document sources, funnels them through a series of filters, and outputs a cleaned corpus.

We built two corpora, both targeting diverse educational content (more likely to yield meaningful slugs and meaningful text). The first was a 10,000-document feasibility set mixing arXiv (25%), GitHub issues (25%), and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (50%). Slugs are unavailable for both arXiv and GitHub issues, so instead of mechanical extraction we used Haiku at temperature 0 to generate synthetic slugs. The slugs were meaningful and representative, but at $5.25 per 10,000 documents, scaling to our target corpus would've cost ~$1.2k. Not crazy… but perhaps a little expensive for a weekend mini-project. That left local inference (too slow to finish in time) or finding another way to derive slugs.

The actual solution turned out to be obvious. Slugs exist to tell users (and search engines) what a webpage is all about. So instead of *deriving* our training slugs from documents, we *extracted* them directly from real URLs. FineWeb-Edu, our primary dataset, annotates each document with its source URL, so meaningful slugs come essentially free. The tradeoff is that we don't control them: some are truncated, SEO-stuffed, editorially inconsistent, or just look like slugs but are actually ids in disguise. Good enough for a prototype, we thought, and cleaning this up later *should* be straightforward — for example, a logistic-regression re-ranker that scores quality using both the slug's embedding and the URL.

The filtering pipeline is built on the [datatrove](https://github.com/huggingface/datatrove) library. A language filter ([fasttext](https://fasttext.cc/)) rejects non-English documents to keep the vocabulary manageable (FineWeb-Edu is primarily English, but other sources like GitHub issues needed filtering). Slug extraction from URLs comes next: take the last meaningful path segment, reject anything that doesn't match a kebab-case pattern, filter by length, numeric density, and stopword ratio. This is the largest single drop: 62% of documents lack an extractable slug. A Gopher repetition filter removes spam and boilerplate. Finally, a token-count filter admits only documents in the 50 to 1,000 token range. The token length filter isn't strictly necessary — the model sees a fixed-size embedding vector regardless — but it serves as a quality proxy and makes embedding-cost estimation easier. We didn't want to wake up to a multi-hundred-dollar OpenAI bill.

For the smaller corpus, slug extraction happens after the fact (via Haiku), so the URL-based slug filter doesn't apply. For the URL corpus, the pipeline takes FineWeb-Edu's ~9.7M documents down to 2.3M usable training samples ([Figure 2](#figure-pipeline)).

In a language without a large vocabulary, we might just pool words together, seek to predict the right ones, and assemble a slug. 5,000 output classes for slug-specific nouns might even be enough. But the English language has approximately 500,000 to 600,000 dictionary words. Including scientific terms, technical jargon, regional slang, and old/obsolete words this rises to somewhere north of a million.

The 2.3M slugs in our training corpus contained 316k unique words, with an astonishingly long tail: 62% of these were hapax legomena (words appearing exactly once). URL slugs are particularly tail-heavy because they exist to index *distinctive* content: proper nouns, niche topics, and specific terminology, as well as various conjugations, and compound words. Predicting over 316k classes isn't practical, so two options remained: compress the vocabulary through clustering, or abandon word-level prediction entirely and switch to a subword tokenization approach like byte-pair encoding (BPE).

But we weren't ready to give up on word-level prediction yet. A simpler architecture — a multilayer perceptron (MLP), we hypothesized, as opposed to a transformer — would be cheaper to train and run, and we wanted to see if it could recover slugs from embeddings. To compress the vocabulary we tried clustering: embed each of the 316k unique slug words through the same OpenAI model used in our example for documents, then group similar embeddings together. We tested four approaches: [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering), hierarchical density-based clustering ([HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html), specifically), [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) thresholding, and [Louvain](https://en.wikipedia.org/wiki/Louvain_method) community detection.

Most of these attempts failed:

That they cluster so tightly makes sense. The embedding model employed ([OpenAI's text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings)) is trained on full text with a contrastive objective; its job is to capture semantic similarity. For a single word,

`anne`

and `amelia`

are the same thing: female first names. Without surrounding context, the model has no way to distinguish them. From its perspective, it doesn't matter KMeans works differently from the graph-based approaches: it doesn't need pairwise similarity to be meaningful, just distance from centroids. With k=5,000 it compressed the vocabulary from 316k tokens to 5,000 clusters (63×). Of the algorithms we'd scoped, it was the only real option ([Figure 3](#figure-vocab-compression)).
