cd /news/machine-learning/gefen-optimized-stochastic-optimizer · home topics machine-learning article
[ARTICLE · art-27560] src=arxiv.org ↗ pub= topic=machine-learning verified=true sentiment=↑ positive

Gefen: Optimized Stochastic Optimizer

Researchers propose Gefen, a memory-efficient optimizer that reduces AdamW's memory footprint by ~8x while maintaining performance, enabling larger microbatches and improved throughput in deep learning training. Gefen automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, reducing memory by 6.5 GiB per billion parameters. The method is validated across diverse experiments and is available as a drop-in replacement for AdamW.

read1 min publishedJun 15, 2026

arXiv:2606.13894v1 Announce Type: new Abstract: AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/gefen-optimized-stoc…] indexed:0 read:1min 2026-06-15 ·