cd/sources/pytorch-blog· home› sources› PyTorch Blog

cat /sources/pytorch-blog.feed | wc -l → 17

PyTorch Blog

articles 17 domain pytorch.org → feed RSS

13:47

2026-06-12

pytorch.org

artificial-intelligence

PyTorch Meetup Singapore: A milestone in APAC

Eighty engineers, researchers, and community builders gathered for the inaugural PyTorch Meetup Singapore, hosted at the Red Hat Asia Pacific office. The event featured technical talks on inference, d…

17:00

2026-06-10

pytorch.org

large-language-models

Portable vLLM Model Inference Kernels in Helion

Helion kernels were integrated into vLLM for FP8 inference using Qwen3 models and evaluated across NVIDIA H100 and B200 GPUs. The experiments demonstrated that Helion provides a productive PyTorch-nat…

15:05

2026-06-03

pytorch.org

machine-learning

Using Muon Optimizer with DeepSpeed

DeepSpeed has integrated the Muon Optimizer, a memory-efficient optimizer that uses a single momentum buffer and Newton-Schulz orthogonalization to improve training convergence, particularly for 2D we…

18:43

2026-06-01

pytorch.org

machine-learning

When does fragmentation occur in the CUDA caching allocator?

The CUDA caching allocator in PyTorch fragments memory when allocated blocks prevent adjacent free blocks from merging, causing allocation failures despite sufficient total free memory. This fragmenta…

14:53

2026-06-01

pytorch.org

machine-learning

How LinkedIn Uses PyTorch to Solve Extreme-Scale Optimization Problems

LinkedIn re-architected its distributed linear programming solver, DuaLip, using a GPU-accelerated PyTorch version to solve extreme-scale optimization problems involving hundreds of millions of users …

19:09

2026-05-27

pytorch.org

machine-learning

Why Is PyTorch Compile So Fast: Kernel Fusion

PyTorch's Inductor compiler uses kernel fusion to accelerate model execution by up to 10x, grouping dependent operations into single Triton kernels to reduce memory traffic and kernel launch overhead.…

15:39

2026-05-27

pytorch.org

large-language-models

Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed

TokenSpeed, an open-source inference engine, achieved a record-breaking 580 tokens per second running the Qwen3.5-397B-A17B model on GPUs. The performance gain for agentic workloads comes from elimina…

01:00

2026-05-27

pytorch.org

artificial-intelligence

Alibaba Cloud Joins the PyTorch Foundation as a Platinum Member

Alibaba Cloud has joined the PyTorch Foundation as a Platinum member, gaining a seat on the foundation's Governing Board and a position on its Technical Advisory Committee. The Chinese cloud computing…

20:24

2026-05-22

pytorch.org

machine-learning

Join the PyTorch Foundation Ambassador Program: A Global Network of Community Leaders

The PyTorch Foundation reopened applications for its Ambassador Program, seeking community leaders to mentor users, create tutorials, and organize events for a two-year term. The foundation especially…

15:45

2026-05-20

pytorch.org

machine-learning

PyTorch Docathon 2026 Results in 150+ Merged Pull Requests

The PyTorch Docathon 2026, held from May 5 to May 19, resulted in over 150 merged pull requests after more than 260 registrants and 30 active participants contributed fixes, API documentation, and Exe…

17:25

2026-05-18

pytorch.org

artificial-intelligence

vLLM and PyTorch Work Together to Improve the Developer Experience on aarch64

PyTorch 2.11 now enables direct installation of CUDA-enabled PyTorch wheels on aarch64 Linux from PyPI, eliminating the need for custom package indexes and workarounds that previously complicated depl…

15:30

2026-05-18

pytorch.org

machine-learning

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate

The ExecuTorch MLX delegate now enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs through Apple's MLX framework. The new backend achieves 3-6x higher throughput on generative …

18:36

2026-05-13

pytorch.org

machine-learning

PyTorch 2.12 Release Blog

PyTorch 2.12 introduces a new device-agnostic `torch.accelerator.Graph` API that unifies graph capture and replay across CUDA, XPU, and out-of-tree backends. The release delivers up to 100x faster bat…

15:50

2026-05-12

pytorch.org

artificial-intelligence

Efficient Edge AI on Arm CPUs and NPUs: Understanding ExecuTorch through Practical Labs

Arm has released a set of hands-on Jupyter labs demonstrating how to deploy AI models on edge devices using ExecuTorch, an extension of the PyTorch ecosystem designed for local inference on constraine…

16:56

2026-05-05

pytorch.org

machine-learning

In-Kernel Broadcast Optimization: Co-Designing Kernels for RecSys Inference

Meta researchers developed In-Kernel Broadcast Optimization (IKBO), a kernel-model-system co-design that eliminates redundant user-embedding replication during recommendation model inference. Deployed…

18:56

2026-04-30

pytorch.org

large-language-models

SMG: The Case for Disaggregating CPU from GPU in LLM Serving

Shepherd Model Gateway (SMG) has disaggregated all CPU-bound workloads from GPU inference in large language model serving, moving tokenization, detokenization, and parsing into a dedicated Rust gatewa…

15:25

2026-04-29

pytorch.org

large-language-models

Introducing AutoSP

Researchers at Microsoft have introduced AutoSP, a compiler-based solution that automatically converts standard training code into multi-GPU sequence parallel code for long-context language model trai…