{"slug": "pollux-a-natively-vector-quantized-llm-with-0-76-bits-per-parameter", "title": "Pollux – a natively vector quantized LLM with 0.76 bits per parameter", "summary": "Researchers introduced Pollux, a new class of decoder-only LLMs that use native Leech-lattice quantization to achieve 0.76 bits per parameter, compressing a 1B-class model into 76 MB of SRAM. The model abandons continuous floating-point weights, training from scratch at its final quantization resolution, and achieves parity with continuous baselines on fluid syntax benchmarks while rejecting factual noise to eliminate hallucinations. This breakthrough enables SRAM-resident edge AI and a stateless reasoning engine for retrieval-augmented generation.", "body_md": "Paper:[Alexander Lavicka ·]0.76 Bits Is All You Need: Vector Ternary Logic via Native H24 Leech-Lattice Quantization in LLMs[lavicka@cantab.net]· Preprint 2026 WIPO Patent Application No. PCT/AT2026/060108 and Austrian Patent Application No. A65086/2026\n\nPollux is a fundamentally new class of decoder-only LLMs that abandons continuous floating-point weights in the transformer backbone to overcome the von Neumann memory wall.\n\n-\n**0.76 Bits per Parameter:** By mapping the neural parameter manifold natively onto the$H_{24}$ Leech lattice (the densest sphere packing in 24D), the backbone is compressed to extreme sub-1-bit levels. -\n**Zero-Continuous-Weight Backbone:** Observable layers carry no continuous structural weights—only discrete 18-bit codebook indices and a single global FP16 scale per row. -\n**SRAM-Resident Edge AI:** A 1B-class transformer backbone (Pollux-1920) is compressed into just**76 MB of SRAM**, converting inference from a memory-bandwidth-bound to a compute-bound operation. -\n**The \"Stateless CPU\" for RAG:** Pollux physically decouples fluid intelligence (syntax) from crystallised intelligence (factual trivia). Through a geometric Voronoi filter, it mechanically rejects high-entropy factual noise, eliminating the parametric knowledge conflicts that trigger hallucinations. -\n**Parameter-Free Thermodynamic Training:** Trained without learning rate schedules, warmup, or weight decay. The network is optimized via endogenous thermodynamic kinetics and Landauer erasure. The only environmental input is`H_floor`\n\n— the empirically measured corpus noise floor, analogous to ambient temperature in Carnot theory. All architectural constants are derived from two axioms; no hyperparameter search is required. -\n**Empirical Parity:** At less than 1% of the training data and less than half the active SRAM footprint, Pollux achieves strict fluid-syntax parity (BLiMP) with continuous baselines (Pythia 160M–410M).\n\nStandard quantization (INT8, GPTQ, 1.58-bit) either approximates a pre-trained FP16 model after the fact (destroying capacity) or relies on one-dimensional scalar rounding that wastes combinatorial state-space. Pollux is trained **at its final quantization resolution from step zero**. There is no continuous baseline to approximate.\n\nInstead of scalar ternary quantization (**196,560 mathematically optimal kissing points** of the Leech lattice\n\nThe ** $C=\\sqrt{2}$ Voronoi deep-hole barrier** of the lattice acts as a physical high-pass filter on gradients:\n\n-\n**Fluid intelligence (structural syntax):** Coherent, recurring syntactic rules accumulate directed momentum, cross the Voronoi barrier, and permanently crystallise into$H_{24}$ kissing points. -\n**Crystallised intelligence (factual trivia):** Incoherent, high-entropy factual noise fails to cross the threshold and is routed into the zero-potential null attractor. Pollux therefore scores near or modestly above random chance on factual benchmarks — bounded by high-frequency leakage for ubiquitous facts that generate coherent gradients over billions of tokens.\n\nThis makes Pollux the ultimate engine for **Retrieval-Augmented Generation (Macro-RAG)**: It acts as a pure, stateless reasoning engine that blindly obeys external vector databases without contaminating the output with internal parametric bias or hallucinations.\n\nPollux models are evaluated under a strict **Iso-Memory paradigm**: the active backbone SRAM footprint—not raw parameter count—is the execution-critical metric during autoregressive generation. *(Note: the Iso-Memory criterion isolates memory-bandwidth footprint under the targeted native LUT runtime. Under the current FP16 reference materialisation, FLOPs per token scale with backbone parameter count and are not matched between Pollux and Pythia baselines.)*\n\nContinuous models are trapped in a Pareto trilemma: they cannot simultaneously minimize SRAM, maximize fluid syntax (BLiMP), and suppress factual contamination (SciQ/HellaSwag). Pollux-1920 breaks this frontier, matching the reasoning capacity of Pythia-410M inside a 76 MB envelope while mechanically resisting factual memorisation.\n\n| Model | Training tokens | BLiMP (Syntax) | SciQ (Facts) | HellaSwag (Facts) | PIQA (Facts) | Backbone SRAM | Total disk |\n|---|---|---|---|---|---|---|---|\nPollux-1152 |\n2.6B (step 10k) | 69.9% |\n50.3% | 26.4% | 57.7% | 27 MB |\n142 MB |\nPollux-1920 |\n2.6B (step 10k) | 73.0% |\n60.7% | 27.2% | 59.8% | 76 MB |\n265 MB |\n| Pythia-160M @ step 2k | 4.2B | 69.7% | 58.7% | 26.9% | 58.4% | 162 MB | 247 MB |\n| Pythia-410M @ step 2k | 4.2B | 73.1% | 57.2% | 27.3% | 58.2% | 577 MB | 707 MB |\n| Pythia-160M @ Asymptotic | 300B | 73.1% | 72.3% | 29.1% | 61.9% | 162 MB | 247 MB |\n| Pythia-410M @ Asymptotic | 300B | 81.9% | 82.4% | 34.5% | 67.2% | 577 MB | 707 MB |\n\n(Random-chance baselines: BLiMP (2-way) = 50.0%; HellaSwag / SciQ (4-way) ≈ 25%; PIQA (2-way) ≈ 50%. All Pollux scores measured on packed`.plx`\n\ndeployment artifacts.)\n\nAt 10k steps (~2.6B tokens), the network reaches its thermodynamic crystallisation peak. Beyond this point, the **thermodynamic stasis** (the \"Deep Freeze\"): BLiMP shifts by ≤ 0.5% and all factual benchmarks shift by ≤ 1.0% over ≥ 1.3B additional tokens. Capacity churn ceases; the model neither gains new factual associations nor loses established syntactic structure.\n\n| Checkpoint | Tokens | BLiMP (Syntax) | SciQ (Facts) | HellaSwag (Facts) | PIQA (Facts) |\n|---|---|---|---|---|---|\nPollux-1152 |\n|||||\n| 5k steps | ~1.3B | 67.5% | 46.5% | 26.6% | 55.7% |\n10k steps ⬅ Crystallisation peak |\n~2.6B | 69.9% |\n50.3% |\n26.4% |\n57.7% |\n| 15k steps | ~3.9B | 69.9% | 48.4% | 26.6% | 57.7% |\nPollux-1920 |\n|||||\n| 5k steps | ~1.3B | 72.9% | 56.6% | 26.9% | 58.4% |\n10k steps ⬅ Crystallisation peak |\n~2.6B | 73.0% |\n60.7% |\n27.2% |\n59.8% |\n| 15k steps | ~3.9B | 73.2% | 61.7% | 27.3% | 60.1% |\n\nThe `.plx`\n\nserialization mathematically compresses the network to 0.76 bits/param. Compared to the raw continuous `.pt`\n\ncheckpoint, the maximum deviation across all BLiMP tasks is **0.2 pp** (and 0.01% aggregate mean difference)—engineering confirmation that the global row-scale quantization is practically lossless.\n\nPollux is a **functional reference implementation for research**. The following constraints apply to anyone deploying or extending the codebase:\n\n**Packed storage vs. PyTorch runtime:** While the packed `.plx`\n\nrepresentation fits entirely in on-chip memory (~27 MB backbone for Pollux-1152), the **current reference PyTorch path materialises dense FP16 weight matrices** at inference time (`PackedH24Linear.materialize()`\n\n) for standard `cuBLAS`\n\ncompatibility. This validates crystallisation and zero-shot benchmarks but **does not** deliver native SRAM-bound latency. **Native matrix-free LUT gather–accumulate kernels** (read index → fetch codebook vector → accumulate\n\n**Edge CPU Viability & The RAM Bottleneck:** Standard GPUs severely penalise sub-byte combinatorial addressing. However, modern CPUs feature large L3 caches (8–32 MB) capable of holding the entire 9 MB **265 MB on-disk footprint**, Pollux unlocks reasoning for IoT/Edge hardware where continuous models instantly trigger Out-Of-Memory (OOM) failures.\n\n**Architectural Strictness:** Custom configurations must satisfy ** n_embd % 24 == 0**. Every quantized linear\n\n`in_features`\n\nmust be cleanly divisible by 24 for proper Leech lattice atom tiling.\n\n```\npublish/\n│\n├── castor.py               # Axiom layer: Leech lattice codebook, constants,\n│                           #   nearest-neighbour quantizer, bit-packing.\n│                           #   Leaf node — imports nothing from this project.\n│\n├── pollux.py               # Zero-continuous-weight architecture +\n│                           #   parameter-free thermodynamic estimator\n│                           #   (pollux_step). Depends only on castor.\n│                           #   Contains both training (PolluxH24Linear) and\n│                           #   inference (PackedH24Linear) layer classes.\n│\n├── train.py                # Training entry point. Reads FineWeb-Edu memmap,\n│                           #   calls pollux_step, writes .pt checkpoints.\n│                           #   No LR schedule, no weight decay, no warmup.\n│\n├── prepare_fineweb.py      # Downloads FineWeb-Edu 10B, tokenizes with GPT-2,\n│                           #   writes uint16 memmap to data/fineweb_10b.bin.\n│\n├── pack.py                 # Checkpoint → .plx converter.\n│                           #   Quantizes H24 layers to 18-bit LUT indices +\n│                           #   FP16 σ_rms per row, INT8-quantized embeddings.\n│                           #   Pack at the 10k crystallisation peak checkpoint.\n│\n├── generate.py             # Text generation from .plx or .pt files.\n│                           #   .plx: index_select materialisation + F.linear;\n│                           #   native LUT kernels (future) eliminate dense\n│                           #   weight-matrix traffic, not FP activations.\n│\n├── evaluate.py             # lm-eval-harness wrapper. Prints stratified\n│                           #   Structural (4 BLiMP) vs Factual (4 MCQ) table.\n│                           #   Accepts both .plx and .pt inputs.\n│\n├── data/                   # Local training corpus (gitignored; created by\n│   └── fineweb_10b.bin     #   prepare_fineweb.py)\n│\n└── checkpoints/            # Training checkpoints (gitignored; written by\n    └── pollux_step_*.pt    #   train.py every 2.5k optimizer steps)\ntrain.py  ──(pollux_10k.pt)──►  pack.py  ──(model.plx)──►  generate.py\n                                                         ──►  evaluate.py\n```\n\nThe fully crystallized, 0.76-bit `.plx`\n\ndeployment artifacts are hosted on Hugging Face. These containers are fully packed and include the immutable H24 codebook indices alongside the global row-wise RMS scales.\n\n: 287M backbone parameters compressed into 27 MB SRAM (142 MB total on-disk including INT8 embeddings).[Pollux-1152](https://huggingface.co/alavicka/pollux-1152): 796M backbone parameters compressed into 76 MB SRAM (265 MB total on-disk including INT8 embeddings).[Pollux-1920](https://huggingface.co/alavicka/pollux-1920)\n\nNote on File Sizes: The on-disk footprints (142 MB / 265 MB) reported here and in the paper refer to binary Megabytes (MiB) as calculated by standard OS environments. The Hugging Face file explorer displays these identical files using decimal SI units (149 MB / 278 MB).\n\nTechnical Note on Native Inference:The current reference PyTorch runtime materialises 18-bit indices to FP16 weight tiles via`index_select`\n\n, executing via standard`F.linear`\n\n/`cuBLAS`\n\n. This explicitly validates the zero-shot crystallisation and Iso-Memory theoretical bounds, but does not yet deliver SRAM-bound latency on standard GPUs. True hardware acceleration requires a native C/CUDA/Triton kernel (or dedicated NPU logic) to performmatrix-free vector scaling: SRAM lookup of codebook vectors by index, combined with continuous FP16/BF16 activations via scalar–vector multiply–accumulate — eliminating dense$\\mathcal{O}(N^2)$ weight-matrix DRAM traffic entirely. This hardware-software isomorphism is detailed in Appendix C of the paper.\n\n```\nconda create -n pollux python=3.11 -y\nconda activate pollux\npip install torch torchvision --index-url https://download.pytorch.org/whl/cu124\npip install tiktoken lm_eval tqdm numpy\n# Optional: Triton (highly recommended — massively accelerates the Castor STE\n#   projection during training and avoids VRAM bottlenecks during the H24 snap;\n#   also speeds checkpoint packing)\npip install triton\npython generate.py model.plx --prompt \"The second law of thermodynamics\" \\\n    --max-new-tokens 200 --temperature 0.8 --top-k 50\n# Evaluate a packed .plx file (structural vs. factual stratified table)\npython evaluate.py model.plx --fullblimp\n\n# Evaluate a raw training checkpoint\npython evaluate.py pollux_10k.pt --batch-size 16\n\n# Quick smoke-test (10% of each task)\npython evaluate.py model.plx --limit 0.1\n# Pack the 10k crystallisation peak checkpoint (recommended)\npython pack.py checkpoints/pollux_step10000.pt --output pollux_1152_10k.plx --device cuda\npython\nimport torch\nfrom pathlib import Path\nfrom pollux import PolluxConfig, PolluxModel\n\n# ── Option A: load from a 0.76-bit .plx packed file ─────────────────────────\n# Pollux-1152: ~27.3 MB backbone SRAM; ~142 MB total on disk.\n# Weights are materialised from the SRAM codebook via index_select on first use.\n\nfrom generate import _read_plx   # private .plx reader (standalone, no deps)\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\npayload = _read_plx(\"pollux_1152_10k.plx\")\nmodel = PolluxModel.from_packed_checkpoint(\"pollux_1152_10k.plx\", device, payload=payload)\ncfg = PolluxConfig.from_dict(payload[\"config\"])\nmodel.eval()\n\n# ── Option B: load from a training checkpoint (.pt) ─────────────────────────\n# Observable weights = dynamic Castor H24 projection; continuous pre-weights\n# live in optimiser state only.\n# ckpt = torch.load(\"pollux_10k.pt\", map_location=device, weights_only=False)\n# cfg  = PolluxConfig(**ckpt[\"config\"])\n# model = PolluxModel(cfg).to(device)\n# model.load_state_dict(ckpt[\"model\"])\n# model.eval()\n\n# ── Tokenise and generate ────────────────────────────────────────────────────\nimport tiktoken\nenc = tiktoken.get_encoding(\"r50k_base\")\n\nprompt = \"The syntax of a relative clause requires\"\nids    = torch.tensor(enc.encode(prompt), dtype=torch.long, device=device).unsqueeze(0)\n\nwith torch.no_grad():\n    out = model.generate(\n        ids,\n        max_new_tokens=150,\n        temperature=0.8,\n        top_k=50,\n    )\n\nprint(enc.decode(out[0].tolist()))\n```\n\nTo download and tokenize the dataset locally, simply run `python prepare_fineweb.py`\n\n. This script will stream the 10B token subset, tokenize it, and save the resulting `uint16`\n\nbinary to `data/fineweb_10b.bin`\n\nfor fast memmap loading during training.\n\nRequires `datasets`\n\n, `transformers`\n\n, `numpy`\n\n, and `tqdm`\n\n(in addition to the core training stack). The download is ~20 GB on disk once complete.\n\nToken Budget Note:At sequence length 1024, batch size 8, and 32 grad-accum steps, 10,000 steps equal roughly 2.6 billion processed tokens. For larger configurations (e.g., Pollux-1920), training may be executed across multiple sequential resumed runs due to hardware interruptions; optimizer state is fully preserved at each resume point and loss trajectories are stitched by training step.\n\n```\n# Prepare FineWeb-Edu 10B token shard (creates data/fineweb_10b.bin)\npython prepare_fineweb.py\n\n# Train Pollux-1152 (1152-dim, 18 layers, 48 heads — default pollux.py config)\n# Targets the 10k crystallisation peak on a single RTX 5090 / ~6 hours\npython train.py \\\n    --target-tokens 9_953_989_333 \\\n    --wandboff   # remove to enable W&B logging\n\n# After ~10k steps, pack the checkpoint\npython pack.py checkpoints/pollux_step10000.pt --output pollux_1152_10k.plx\n```\n\nThe optimiser (`pollux_step`\n\n) has **no learning-rate schedule, no auxiliary weight decay, gradient clipping, or warmup** — but it does rely on exactly **one environmental boundary condition**: the dataset noise floor `H_floor`\n\n. (For a full mathematical derivation of how all other optimiser constants, such as the topological drag and Voronoi jitter floor, are derived strictly from the two\n\n`H_floor`\n\nis an **empirical material property** of the training corpus — the irreducible Shannon entropy of its linguistic structure, including factual noise — not an architectural hyperparameter. For FineWeb-Edu 10B, `DATASET_NOISE_FLOOR = 3.2`\n\nin `pollux.py`\n\nis anchored at the cross-entropy convergence ceiling of an uncompressed FP16 continuous-weight baseline on the same corpus.\n\n**If you train on a different corpus**, measure the FP16 continuous-weight convergence ceiling on your data, set `H_floor`\n\nto that value, and update `DATASET_NOISE_FLOOR`\n\nin `pollux.py`\n\nbefore launching `train.py`\n\n. A floor set too high underestimates corpus entropy; too low overstates it and distorts the heat normalisation.\n\n| Component | Class | Details |\n|---|---|---|\nTraining layer |\n`PolluxH24Linear` |\nForward uses discrete materialised weights; `pollux_step` maintains continuous latents and re-quantizes once per step |\nNormalization |\n`RMSNorm` |\nContinuous FP16 learnable gains; magnitude--structure decoupler for the residual stream |\nInference layer |\n`PackedH24Linear` |\nStores `uint8` 18-bit packed indices + `float16` one `materialize()` expands to FP16 via `codebook.index_select`\n|\nEmbeddings |\n`PackedInt8Embedding` |\nPer-row INT8 + FP16 scale (untied from LM head by physical necessity) |\nLM Head |\n`PackedInt8Linear` |\nPer-row INT8 + FP16 scale (untied: high-precision logit resolution incompatible with H24 gradient geometry) |\nOptimizer |\n`pollux_step` |\nHeat-modulated Adam with topological drag `H_floor` ) |\nCodebook |\n`castor.py` |\n196,561 entries (196,560 kissing + index-0 null attractor); ~9 MB FP16 |\nBit-packing |\n`castor.pack_indices` |\nBijective 4 × 18-bit → 9-byte; reversible via `unpack_indices`\n|\n\nThe source code is released under the **PolyForm Noncommercial License 1.0.0** for academic research, non-commercial experimentation, and scientific reproduction. A copy of the license is available at [https://polyformproject.org/licenses/noncommercial/1.0.0/](https://polyformproject.org/licenses/noncommercial/1.0.0/).\n\nThe underlying algorithmic principles — specifically the native 24-dimensional Leech lattice straight-through estimation and the thermodynamic optimization protocol — are the subject of a pending patent:\n\nWIPO Application No. PCT/AT2026/060108 and Austrian Patent Application No. A65086/2026\n\nCommercial utilization, deployment, or hardware integration of the proprietary Pollux architecture and its variants requires a commercial license from the patent holders. Contact: [lavicka@cantab.net](mailto:lavicka@cantab.net)\n\nIf you use Pollux in your research, please cite:\n\n```\n@misc{lavicka2026pollux,\n  title   = {0.76 Bits Is All You Need: Vector Ternary Logic via Native H24 Leech-Lattice\n             Quantization in LLMs},\n  author  = {Lavicka, Alexander},\n  year    = {2026},\n  note    = {Preprint. WIPO Patent Application No. PCT/AT2026/060108 and Austrian Patent Application No. A65086/2026},\n  url     = {https://papers.ssrn.com/abstract=6973978}\n}\n```\n\n", "url": "https://wpnews.pro/news/pollux-a-natively-vector-quantized-llm-with-0-76-bits-per-parameter", "canonical_source": "https://github.com/alavicka/pollux", "published_at": "2026-07-01 13:03:40+00:00", "updated_at": "2026-07-01 13:20:45.006301+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-research", "ai-infrastructure", "ai-safety"], "entities": ["Pollux", "Leech lattice", "Pythia", "BLiMP", "SciQ", "HellaSwag", "Voronoi", "Landauer"], "alternates": {"html": "https://wpnews.pro/news/pollux-a-natively-vector-quantized-llm-with-0-76-bits-per-parameter", "markdown": "https://wpnews.pro/news/pollux-a-natively-vector-quantized-llm-with-0-76-bits-per-parameter.md", "text": "https://wpnews.pro/news/pollux-a-natively-vector-quantized-llm-with-0-76-bits-per-parameter.txt", "jsonld": "https://wpnews.pro/news/pollux-a-natively-vector-quantized-llm-with-0-76-bits-per-parameter.jsonld"}}