grep -r "CUDA" /news · home›search

grep -rli "CUDA" /news

CUDA

Full-text search across 9 articles. Combine with topic and date filters; results sorted by relevance.

results 9

03:51

2026-05-24

dev.to

artificial-intelligence

Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour

The article describes integrating Google's Gemma 4 speech recognition model into Parlotype, a privacy-focused Windows voice-to-text desktop app that runs entirely on-device. The author evaluated five available GGUF varia…

20:21

2026-05-23

dev.to

artificial-intelligence

I Build the Infrastructure That Serves AI Models. Gemma 4 Just Made My Job Existential.

The article describes the author's work building NeuroScale, a self-service AI inference platform that manages complex Kubernetes infrastructure for deploying models like Gemma 4. The author explains that most AI serving…

19:12

2026-05-23

dev.to

developer-tools

The Microsecond Lie: Why your Go timers are lying about the GPU

The article explains that CPU-side timers in Go are unreliable for measuring GPU kernel execution time because CUDA kernel launches are asynchronous, meaning the CPU only measures the time to enqueue the task rather than…

18:48

2026-05-23

dev.to

large-language-models

GGUF & Modelfile: The Power User's Guide to Local LLMs

The article explains how power users can download GGUF (GPT-Generated Unified Format) model files directly from Hugging Face, quantize them (using Q4_K_M as the optimal balance of size and quality), and import them into …

16:54

2026-05-23

dev.to

machine-learning

I Built a Neural Network Engine in C# That Runs in Your Browser - No ONNX Runtime, No JavaScript Bridge, No Native Binaries

The article announces the release of SpawnDev.ILGPU.ML 4.0.0-preview.4, a C# library that runs neural networks directly in the browser using six backends (WebGPU, WebGL, WebAssembly, CUDA, OpenCL, and CPU) without requir…

13:14

2026-05-23

dev.to

large-language-models

Multi-Head Latent Attention (MLA)

**Summary:** Multi-Head Latent Attention (MLA) is an attention mechanism used in DeepSeek-V2/V3 and Kimi K2.x models that compresses the Key-Value (KV) cache by projecting full KV pairs into a shared, low-dimensional lat…

11:50

2026-05-23

horace.io

machine-learning

Making Deep Learning Go Brrrr from First Principles (2022)

The article explains that optimizing deep learning performance should be approached by identifying whether a system is bottlenecked by compute, memory bandwidth, or overhead, rather than relying on ad-hoc tricks. It argu…

11:50

2026-05-23

horace.io

machine-learning

Making Deep Learning Go Brrrr from First Principles

The article explains that optimizing deep learning performance should be approached by reasoning from first principles—identifying whether a system is bottlenecked by compute, memory bandwidth, or overhead—rather than re…

10:23

2026-05-23

dev.to

artificial-intelligence

The Brutal Reality of Running Gemma 4 Locally

The article details the author's experience running Google's Gemma 4 models locally on a consumer laptop with an RTX 3050 (4GB VRAM), revealing a gap between Google's demo claims and real-world performance. While the sma…

04:38

2026-05-23

dev.to

large-language-models

Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling

NVIDIA released Nemotron-Labs Diffusion on May 23, 2026, a family of diffusion language models (DLMs) that generate entire blocks of tokens in parallel and iteratively refine them, rather than producing one token at a ti…

18:09

2026-05-22

dev.to

artificial-intelligence

Running Flux Schnell (12B) + LLMs on a Legacy AMD RX 580 (8GB) via Native Vulkan — Full Architecture Guide [2026]

The article provides a technical guide for running AI models like Flux Schnell (12B) and LLMs on a legacy AMD RX 580 (8GB) GPU using native Vulkan support, bypassing deprecated CUDA and ROCm ecosystems. It details specif…

15:33

2026-05-22

dev.to

artificial-intelligence

Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers

The article explains how to deploy quantized open-source LLMs like Llama 3 8B directly within AWS Lambda containers using llama.cpp, enabling serverless, auto-scaling inference for high-volume, low-reasoning tasks such a…

11:23

2026-05-22

dev.to

artificial-intelligence

Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

Technical solution for running the LTX-2.3 audio-to-video model (22B parameters) alongside TTS and other models on a single 96GB GPU by switching from a persistent server architecture to a cold-start design. The author r…

11:23

2026-05-22

dev.to

artificial-intelligence

Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap

The author successfully reduced peak VRAM usage of the LTX-2 22B video generation model from 40 GiB to 24 GiB using the model's native `fp8_cast` quantization method. In contrast, the author found that `optimum-quanto` q…

11:23

2026-05-22

dev.to

artificial-intelligence

HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked

Benchmarking the HiDream-O1-Image model revealed that its "skeleton mode" does not have a dedicated code path and instead processes all reference images (face, background, pose) through the same pipeline, relying solely …

04:47

2026-05-22

pythongiant.github.io

large-language-models

Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

KVBoost is a new open-source Python library that accelerates HuggingFace LLM inference by implementing chunk-level KV cache reuse, achieving 3–5× faster time-to-first-token (TTFT) and up to 85% cache hit rates in multi-t…

20:22

2026-05-21

dev.to

artificial-intelligence

How to Fix CUDA Out of Memory Errors in Stable Diffusion WebUI

The "CUDA out of memory" error in Stable Diffusion WebUI is often caused by configuration issues rather than insufficient GPU hardware, particularly due to PyTorch's memory allocator failing to release VRAM between gener…

11:37

2026-05-21

dev.to

large-language-models

End-to-End Observability for vLLM and TGI: from DCGM to Tokens

Running large language model inference servers like vLLM and TGI in production requires specialized observability because they behave differently from standard web services, with key metrics like latency being multi-dime…

14:16

2026-05-20

dev.to

artificial-intelligence

DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem

In April 2026, DeepSeek released its V4 model, a 1.6 trillion parameter MoE architecture, and for the first time officially validated its inference on Huawei's Ascend 950PR chip, marking a significant milestone for China…

06:20

2026-05-20

dev.to

large-language-models

KV Cache Explained Like You're an LLM Engineer

The KV cache is a critical optimization for LLM inference that stores the Key and Value matrices from previously generated tokens, eliminating the need to recompute attention over the entire sequence at each generation s…