Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

wpnews.pro

cd /news/machine-learning/max-window-scale-estimation-for-near… · home › topics › machine-learning › article

[ARTICLE · art-14887] src=arxiv.org ↗ pub=2026-05-27T04:00Z topic=machine-learning verified=true sentiment=· neutral

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

Researchers at OpenPangu identified two failure modes in HiF8 W8A8 quantization-aware training for the OpenPangu-Embedded-1B model: amax saturation, which corrupts knowledge-sensitive representations through forward-pass clipping, and catastrophic forgetting from aggressive learning rates. The team mitigated these issues using a 64-step max-algorithm DTS strategy and a 500-step BF16 warmup with a learning rate of 10⁻⁵. The final configuration achieved less than 0.6% accuracy drop across MMLU, HellaSwag, and ARC-Challenge benchmarks compared to a BF16 baseline.

read1 min views16 publishedMay 27, 2026

arXiv:2605.26189v1 Announce Type: new Abstract: Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^{-5}. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/max-window-scale-estimat…

Read original on arxiv.org → arxiv.org/abs/2605.26189

mentioned entities

HiF8

OpenPangu-Embedded-1B

Delayed Tensor Scaling

MMLU

HellaSwag

ARC-Challenge

BF16

QAT

metadata

slugmax-window-scale-estimation-for-near-lossless-hif8-w8a8-quantization-aware

topic#machine-learning

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevSejong University launches Asia’…

next →European AI adoption hits 99% wi…

── more in #machine-learning 4 stories · sorted by recency

letsdatascience.com · 9 Jul · #machine-learning

AI Benchmark Scores Overstate Model Performance

dev.to · 12 Jul · #machine-learning

Bayesian Neural Networks

pub.towardsai.net · 11 Jul · #machine-learning

AI Created a Brand-New GTA 6 City That Feels Real

machinebrief.com · 10 Jul · #machine-learning

Breaking Down Long-Context Transformer Bottlenecks

── more on @hif8 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

wpnews · 8 Jul · #artificial-intelligence

xAI Launches Grok 4.5 With Pricing Built to Undercut Anthropic's Opus 4.8

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required