Run Gemma-4 12B on WSL2 with llama.cpp
A developer has published a guide for running Google's Gemma-4 12B instruction-tuned model on Windows Subsystem for Linux 2 (WSL2) using the llama.cpp framework. The process involves installing build …
A developer has published a guide for running Google's Gemma-4 12B instruction-tuned model on Windows Subsystem for Linux 2 (WSL2) using the llama.cpp framework. The process involves installing build …
NVIDIA released the Jetson Orin Nano Super Developer Kit for $249, offering up to 67 TOPS of AI performance for edge computing and robotics. The tiny dev board runs the full NVIDIA AI software stack a…
JAX defaults to loading data directly onto GPU memory when a CUDA-enabled version is installed, causing out-of-memory errors for large datasets that would fit in system RAM. The framework's `jax.devic…
A developer released a tool that allows Linux users to repurpose Nvidia GPU VRAM as swap space, effectively tripling addressable memory on systems with soldered RAM. The nbd-vram daemon uses CUDA's me…
NVIDIA Apex's FusedAdam optimizer and FusedLayerNorm normalization layers can accelerate Transformer training by up to 30% compared to standard PyTorch implementations, according to benchmark tests. T…
The CUDA caching allocator in PyTorch fragments memory when allocated blocks prevent adjacent free blocks from merging, causing allocation failures despite sufficient total free memory. This fragmenta…
PyTorch users can now implement custom operations in C++ and CUDA for use in both Python and C++ inference programs, with automatic device dispatch between CPU and CUDA implementations. The approach s…
Microsoft announced the Surface Laptop Ultra, a device built in partnership with NVIDIA and designed for creators, developers, and AI builders. The laptop features an NVIDIA Blackwell RTX GPU, up to 1…
An open-source developer released NBD-VRAM, a tool that creates swap space on consumer NVIDIA GeForce GPUs under Linux. The software targets laptop users with soldered memory who need additional syste…
Researchers have developed DiffusionBlocks, a framework that partitions transformer neural networks into independently trainable blocks to reduce memory requirements proportionally while maintaining c…
A developer has released tiny-vllm, a high-performance LLM inference engine written in C++ and CUDA that serves as a smaller sibling to the vLLM project. The open-source repository includes both the f…
A developer benchmarked the Qwen3.6 27B model on Modal using llama.cpp, deploying a serverless pipeline that downloads GGUF shards from Hugging Face and runs perplexity evaluation on an A100-80GB GPU.…
An eBPF agent that attaches to the CUDA runtime, CUDA driver, and Linux kernel scheduler simultaneously can trace a GPU stall back to the exact Python source line that triggered it. The tool correlate…
AMD's ROCm (Radeon Open Compute platform) has reached production-ready maturity as an open-source alternative to NVIDIA's CUDA, now supporting LLM inference, fine-tuning, and image generation on AMD G…
AMD's ROCm platform now supports PyTorch, Ollama, LM Studio, and ComfyUI out of the box, enabling local AI workloads on AMD GPUs without the compatibility issues that plagued earlier versions. Users w…
NVIDIA has introduced Dynamo Snapshot, a checkpoint/restore approach for AI inference workloads on Kubernetes that reduces cold-start latency from minutes to near-instantaneous startup times. The solu…
NVIDIA released CUDA 13.3, introducing tile programming in C++ that automates low-level GPU management for optimized kernel development across all supported architectures. The update also includes CUD…
A developer implemented FlashAttention's forward and backward passes from scratch in pure CUDA C++, achieving O(N) memory complexity through manual SRAM tiling and online softmax recurrence. A rejecte…
KlongPy now supports a PyTorch backend that enables GPU acceleration and automatic differentiation for gradient-based computations. The torch backend outperforms NumPy by up to 8x on large arrays and …
At the ninth MLSys conference in Seattle, researchers and industry leaders focused overwhelmingly on improving the efficiency of training and deploying large language models, with specialized hardware…