Near-lossless delta compression for fine-tuned neural network models.
Instead of storing 50 fine-tunes of the same base model, store one base and 50 small .wdelta
delta files. deltatensors
compresses the delta between a base and fine-tuned model, and reconstructs with sub-1% perplexity difference.
Tested on Qwen2.5-0.5B fine-tuned on WikiText-2:
- Perplexity: 19.11 (original) → 19.22 (reconstructed) — 0.58% perplexity difference
- Less degradation than standard int4 quantization of the full model
- 294 MB delta vs 953 MB fine-tuned model (3.2x)
- ~2.8x total storage reduction across 10 fine-tunes
base_model.safetensors 1.0 GB
checkpoint_01.wdelta 294 MB
checkpoint_02.wdelta 294 MB
...
checkpoint_10.wdelta 294 MB
─────────────────────────────────
Total 3.9 GB vs 11 GB naive
pip install deltatensors
pip install torch safetensors # for from safetensors directories
python
import deltatensors as dt
dt.save_delta_from_paths("checkpoint.wdelta", "qwen-wiki/", "qwen-base/", strategy="int4")
recon_sd = dt.load_delta_from_paths("checkpoint.wdelta", "qwen-base/")
info = dt.inspect("checkpoint.wdelta")
print(info)
| Strategy | Quality | Compression |
|---|---|---|
int4 |
||
| near-lossless (~0.5% PPL) | best | |
sparse |
||
tunable via sparsity= |
||
| good | ||
quantized |
||
| BitDelta-style 1-bit | aggressive |
int4
uses outlier extraction (top k% weights stored in float16) + 4-bit quantization for the remainder. This was the strategy used for the example at the start.
LoRA constrains the delta to be low-rank during training, which limits expressiveness. deltatensors
compresses arbitrary full fine-tune deltas after training - no constraints on how you fine-tune.
Lineage— chain multiple.wdelta
files to track and reconstruct full fine-tuning histories
MIT
p.s. If you find deltatensors useful, please consider leaving a ⭐ star on the repository to help others find it!