grep -r "CUDA" /news · homesearch
grep -rli "CUDA" /news

CUDA

Full-text search across 9 articles. Combine with topic and date filters; results sorted by relevance.

results 9
19:12
2026-05-23
dev.to
developer-tools

The Microsecond Lie: Why your Go timers are lying about the GPU

The article explains that CPU-side timers in Go are unreliable for measuring GPU kernel execution time because CUDA kernel launches are asynchronous, meaning the CPU only measures the time to enqueue the task rather than…

18:48
2026-05-23
dev.to
large-language-models

GGUF & Modelfile: The Power User's Guide to Local LLMs

The article explains how power users can download GGUF (GPT-Generated Unified Format) model files directly from Hugging Face, quantize them (using Q4_K_M as the optimal balance of size and quality), and import them into …

13:14
2026-05-23
dev.to
large-language-models

Multi-Head Latent Attention (MLA)

**Summary:** Multi-Head Latent Attention (MLA) is an attention mechanism used in DeepSeek-V2/V3 and Kimi K2.x models that compresses the Key-Value (KV) cache by projecting full KV pairs into a shared, low-dimensional lat…

11:50
2026-05-23
horace.io
machine-learning

Making Deep Learning Go Brrrr from First Principles (2022)

The article explains that optimizing deep learning performance should be approached by identifying whether a system is bottlenecked by compute, memory bandwidth, or overhead, rather than relying on ad-hoc tricks. It argu…

11:50
2026-05-23
horace.io
machine-learning

Making Deep Learning Go Brrrr from First Principles

The article explains that optimizing deep learning performance should be approached by reasoning from first principles—identifying whether a system is bottlenecked by compute, memory bandwidth, or overhead—rather than re…

10:23
2026-05-23
dev.to
artificial-intelligence

The Brutal Reality of Running Gemma 4 Locally

The article details the author's experience running Google's Gemma 4 models locally on a consumer laptop with an RTX 3050 (4GB VRAM), revealing a gap between Google's demo claims and real-world performance. While the sma…

15:33
2026-05-22
dev.to
artificial-intelligence

Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers

The article explains how to deploy quantized open-source LLMs like Llama 3 8B directly within AWS Lambda containers using llama.cpp, enabling serverless, auto-scaling inference for high-volume, low-reasoning tasks such a…

20:22
2026-05-21
dev.to
artificial-intelligence

How to Fix CUDA Out of Memory Errors in Stable Diffusion WebUI

The "CUDA out of memory" error in Stable Diffusion WebUI is often caused by configuration issues rather than insufficient GPU hardware, particularly due to PyTorch's memory allocator failing to release VRAM between gener…

11:37
2026-05-21
dev.to
large-language-models

End-to-End Observability for vLLM and TGI: from DCGM to Tokens

Running large language model inference servers like vLLM and TGI in production requires specialized observability because they behave differently from standard web services, with key metrics like latency being multi-dime…

06:20
2026-05-20
dev.to
large-language-models

KV Cache Explained Like You're an LLM Engineer

The KV cache is a critical optimization for LLM inference that stores the Key and Value matrices from previously generated tokens, eliminating the need to recompute attention over the entire sequence at each generation s…