CUTLASS

mentions 7 type Organization feed RSS

// recent coverage 7 mentions

16:36

2026-07-07

int21.ai

ai-agents

Stop Waiting for a Bigger Context Window

INT21 has built SwarmOS, a cloud-native platform for multi-agent AI systems, arguing that orchestrating specialized agents is more effective than relying on larger context windows. The company demonst…

17:00

2026-06-10

pytorch.org

large-language-models

Portable vLLM Model Inference Kernels in Helion

Helion kernels were integrated into vLLM for FP8 inference using Qwen3 models and evaluated across NVIDIA H100 and B200 GPUs. The experiments demonstrated that Helion provides a productive PyTorch-nat…

12:53

2026-05-29

zartbot.github.io

ai-chips

Dissecting the SM_120 Microarchitecture

NVIDIA's Blackwell consumer GPU (GB203/SM_120) features a unified TensorCore pipeline where all 12 non-FP64 precision formats share identical 29-cycle latency and 23-cycle throughput, reducing precisi…

03:32

2026-05-28

dev.to

ai-infrastructure

NVIDIA CUTLASS: High-Performance CUDA Templates for AI Linear Algebra

NVIDIA's CUTLASS library, a header-only C++ template framework for writing custom CUDA kernels, powers much of the AI infrastructure behind FlashAttention, vLLM, and PyTorch's internal kernels. The li…

03:12

2026-05-27

metaworld.me

ai-research

Finding deadlocks in CuTe kernels with SPIN

Researchers at the FlashInfer MLSYS Challenge developed a formal verification method using the SPIN model checker to detect deadlocks in CuTe DSL kernels running on NVIDIA B200 GPUs. The approach, dem…

08:50

2026-05-26

dev.to

ai-tools

Writing High-Performance Kernels in TileLang, from GEMM to MLA

TileLang introduces a middle-ground approach for writing high-performance GPU kernels, offering explicit control over shared memory allocation, pipeline staging, and warp partitioning through Python c…

12:11

2026-05-23

thonking.ai

artificial-intelligence

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data

Matrix multiplications on Nvidia A100 GPUs run up to 15% faster when the input matrices contain predictable values like zeros or integers, rather than random data. The performance difference stems fro…

// co-occurs with top 8 entities

NVIDIA 6 CuTe 2 FlashAttention 2 PyTorch 2 Triton 2 vLLM 2 SPIN 1 FlashInfer 1