PyTorch Meetup Singapore: A milestone in APAC
Eighty engineers, researchers, and community builders gathered for the inaugural PyTorch Meetup Singapore, hosted at the Red Hat Asia Pacific office. The event featured technical talks on inference, d…
Eighty engineers, researchers, and community builders gathered for the inaugural PyTorch Meetup Singapore, hosted at the Red Hat Asia Pacific office. The event featured technical talks on inference, d…
Helion kernels were integrated into vLLM for FP8 inference using Qwen3 models and evaluated across NVIDIA H100 and B200 GPUs. The experiments demonstrated that Helion provides a productive PyTorch-nat…
DeepSpeed has integrated the Muon Optimizer, a memory-efficient optimizer that uses a single momentum buffer and Newton-Schulz orthogonalization to improve training convergence, particularly for 2D we…
The CUDA caching allocator in PyTorch fragments memory when allocated blocks prevent adjacent free blocks from merging, causing allocation failures despite sufficient total free memory. This fragmenta…
LinkedIn re-architected its distributed linear programming solver, DuaLip, using a GPU-accelerated PyTorch version to solve extreme-scale optimization problems involving hundreds of millions of users …
PyTorch's Inductor compiler uses kernel fusion to accelerate model execution by up to 10x, grouping dependent operations into single Triton kernels to reduce memory traffic and kernel launch overhead.…
TokenSpeed, an open-source inference engine, achieved a record-breaking 580 tokens per second running the Qwen3.5-397B-A17B model on GPUs. The performance gain for agentic workloads comes from elimina…
Alibaba Cloud has joined the PyTorch Foundation as a Platinum member, gaining a seat on the foundation's Governing Board and a position on its Technical Advisory Committee. The Chinese cloud computing…
The PyTorch Foundation reopened applications for its Ambassador Program, seeking community leaders to mentor users, create tutorials, and organize events for a two-year term. The foundation especially…
The PyTorch Docathon 2026, held from May 5 to May 19, resulted in over 150 merged pull requests after more than 260 registrants and 30 active participants contributed fixes, API documentation, and Exe…
PyTorch 2.11 now enables direct installation of CUDA-enabled PyTorch wheels on aarch64 Linux from PyPI, eliminating the need for custom package indexes and workarounds that previously complicated depl…
The ExecuTorch MLX delegate now enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs through Apple's MLX framework. The new backend achieves 3-6x higher throughput on generative …
PyTorch 2.12 introduces a new device-agnostic `torch.accelerator.Graph` API that unifies graph capture and replay across CUDA, XPU, and out-of-tree backends. The release delivers up to 100x faster bat…
Arm has released a set of hands-on Jupyter labs demonstrating how to deploy AI models on edge devices using ExecuTorch, an extension of the PyTorch ecosystem designed for local inference on constraine…
Meta researchers developed In-Kernel Broadcast Optimization (IKBO), a kernel-model-system co-design that eliminates redundant user-embedding replication during recommendation model inference. Deployed…
Shepherd Model Gateway (SMG) has disaggregated all CPU-bound workloads from GPU inference in large language model serving, moving tokenization, detokenization, and parsing into a dedicated Rust gatewa…
Researchers at Microsoft have introduced AutoSP, a compiler-based solution that automatically converts standard training code into multi-GPU sequence parallel code for long-context language model trai…