FP16

mentions 7 type Organization feed RSS

// recent coverage 7 mentions

14:39

2026-06-19

letsdatascience.com

large-language-models

DigitalOcean Demonstrates LLM Compression with SparseGPT

DigitalOcean published a tutorial on June 19 demonstrating how to compress large language models using SparseGPT and Wanda pruning methods for GPU cloud deployment, targeting reduced inference costs a…

03:56

2026-06-17

dev.to

large-language-models

How much VRAM do you actually need to run Llama 3 or Gemma locally?

A developer calculated the actual VRAM requirements for running Llama 3 8B and Gemma 2 9B locally, revealing that the KV cache can consume far more memory than the model weights, especially at longer …

01:44

2026-06-16

dev.to

artificial-intelligence

Balanced Ternary for optimizing AI

A developer argues that balanced ternary (-1, 0, +1) could replace binary for AI hardware, citing 20× model compression, 3× inference speedup, and 8× power reduction. Microsoft's BitNet b1.58 demonstr…

14:58

2026-06-06

vettedconsumer.com

large-language-models

GGUF vs. GPTQ vs. AWQ: The Plain-English Guide to LLM Quantization

GGUF, GPTQ, and AWQ are the three dominant formats for running quantized large language models locally, each optimized for different hardware and use cases. GGUF, the format used by llama.cpp and its …

04:00

2026-05-29

arxiv.org

machine-learning

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

A new paired-minimum detectable effect (MDE) budget for 4-bit quantization benchmarks, derived from classical sample-size calculations, provides benchmark designers a one-line formula to pre-register …

16:18

2026-05-28

blog.kog.ai

large-language-models

Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs

The Kog AI team implemented a single-kernel LLM inference engine on AMD MI300X GPUs, achieving over 3,000 output tokens per second per request for a 2B-parameter model in FP16 precision. The monokerne…

06:22

2026-05-26

arxiv.org

machine-learning

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Researchers have developed ThriftAttention, a mixed-precision attention algorithm that selectively computes only 5% of query-key blocks in FP16 precision while processing the remaining 95% in FP4, rec…

// co-occurs with top 8 entities

NF4 2 Microsoft 1 BitNet b1.58 1 Transformer 1 ResNet-50 1 FPGA 1 ASIC 1 Elixir 1