# Dell - data gravity is more than byte count

> Source: <https://www.blocksandfiles.com/ai-ml/2026/06/23/dell-data-gravity-is-more-than-byte-count/5259971>
> Published: 2026-06-23 12:36:13+00:00

# Dell - data gravity is more than byte count

Dell has published [three blogs](https://www.blocksandfiles.com/ai-ml/2026/06/22/dell-and-data-physics/5259281) explaining its view that data gravity is real and intrinsic to the distributed IT activities of enterprises. We need a federated IT architecture to handle this and feed data to AI processing rather than a single namespace, storage-embedded design.

Jon Hyde, Dell’s Senior Director for Competitive Intelligence, says there are three kinds of data:

Data — the heavy form. Files, records, images, video, telemetry, regulated tables. Massive, slow, expensive to move. It stays where it is for a reason.

Metadata — the descriptive form. Tags, lineage, schema, classification, ownership. Lightweight. Cheap to propagate. It lets AI see every asset without traveling to it.

[Vectors](https://www.blocksandfiles.com/ai-ml/2022/04/28/vector-embedding/1596580)— the meaning form. Mathematical representations generated by AI. Locality-sensitive, GPU-adjacent. They carry meaning across the estate without carrying the underlying data.

As I understand it, vectorized datasets can be significantly larger than the raw data on which they are based. The vector embeddings are [typically several times larger](https://discuss.elastic.co/t/vector-embedding-huge-size-increase/356670) than the original raw data they represent—often 3x to 20x or more per item, depending on chunking, dimensionality, and storage format. The "original dataset" is usually chunked before embedding (e.g., documents split into paragraphs or fixed token windows) for better retrieval quality. A typical text chunk (250–500 words) is roughly 0.5–2 KB of raw text. The vector alone is often 3–10x larger than its source chunk. It means that vector data, like Data (the heavy form) is "massive, slow, expensive to move."

We asked Dell if Jon Hyde agreed with this?

Since we asked the question, we’ll publish Hyde’s reply in full and let it speak for itself.

He tells us: ”The 3x–20x figure is real, and I'll concede it under the right conditions: dense text, high dimensionality, full-precision floats, overlapping chunks. But "gravity" in my original post was never about byte-count. It's about what keeps data in place, the fact that it's a system of record, wired into the applications around it, carrying regulatory weight. Vectors carry none of that. And if you lose a vector index, you rebuild it. You can't do that with a regulated record.

“There's also a size argument that runs the other way. Video and imaging embed to a tiny fraction of the original and only the five-minute anomaly out of six hours of surveillance footage is worth embedding. Vector size depends on choices you control, not on how big the source is.

“But here's what the question is really pointing at: what happens before the embedding? Vectors aren't generated directly from raw data. Data has to be preprocessed first - chunked, cleaned, normalized, filtered. That step is compute and I/O-intensive, and it has to happen close to the data. You can't efficiently preprocess petabytes of video, telemetry or regulated records across a wide-area network. This is where a federated architecture earns its keep. Preprocessing runs where the data already lives, and only the lighter resulting artifacts move onward. It also means vectors can be created and kept inside the same regional or legal boundary as the source, instead of pulling everything into one place first.

“Last thing: an embedding pipeline isn't a sync job. It's a one-way, repeatable process — run it again when the source changes, and you're done. No ongoing reconciliation, no broken downstream copies, no duplicate governance burden. The cost scales with how much new data you embed and doesn't compound as you add sources. A permanent sync relationship does the opposite.

“A storage-embedded platform assumes all of this happens inside its own namespace. Everything outside is invisible until it's been copied in. In a real enterprise estate that is distributed, sovereign and application-coupled, that copy step is often the bottleneck. Sometimes it simply isn't possible.”
