Exploiting GPU Tensor Cores from Java Using Babylon Researchers extended the Heterogeneous Accelerator Toolkit (HAT) with a tensor-aware API and code transformations using OpenJDK Project Babylon's code reflection API, enabling Java programs to exploit GPU tensor cores for matrix-multiply-accumulate operations. On an NVIDIA Ampere A10 GPU, the approach achieved a speedup from 240 GFLOP/s to 7.3 TFLOP/s for naïve matrix multiplication, while remaining portable to Apple M4 GPU via OpenCL 1.2 with an 8x performance improvement. Exploiting GPU Tensor Cores from Java using Babylon Abstract abstract Tensor Cores are dedicated hardware on NVIDIA GPUs that can be programmed to accelerate matrix-multiply-accumulate MMA operations. Running MMA operations on these cores can increase performance of specific applications dramatically. However, NVIDIA tensor cores are only available for NVIDIA GPUs and exposed to the CUDA programming model through low-level APIs. Ideally, we would also like to make those operations accessible from Java to accelerate domain-specific workloads e.g., LLMs , but those operations must be portable across accelerators. MMA capabilities are also available for other computing platforms such as Apple devices using the Metal programming model, or Intel XPUs via the OpenCL and oneAPI software stacks. However, these operations are not always achievable for other programming models such as OpenCL 1.2 the OpenCL version that Apple supports , which emphasizes the need for portability. This article tackles the architectural specificity of NVIDIA Tensor Cores by exploring a portable approach to tensor operations across multiple hardware accelerators that can be used from Java. The goal of this article is twofold. First, we show that Java programs can reach close-to-native performance for matrix-multiply computations on hardware with accelerated MMA support, such as NVIDIA GPUs. Second, we study how the same Java Tensor API can be mapped across different parallel programming models and vendors while remaining portable for both, source code and runtime scheduling parameters. To support this approach, we extended the Heterogeneous Accelerator Toolkit HAT https://github.com/openjdk/babylon/tree/code-reflection/hat , a parallel programming framework to accelerate data-parallel workloads on hardware accelerators, with a tensor-aware API and a set of code transformations using the code reflection API from the OpenJDK Project Babylon https://github.com/openjdk/babylon . Finally, we evaluate the performance of the system using the HAT Tensor API from Java in the context of two GPU platforms, an Apple M4 Max GPU and an NVIDIA Ampere A10 GPU https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/a10-datasheet.pdf . We show that, by enabling tensor cores on supported hardware NVIDIA , we can speed up the naïve matrix multiplication kernel from 240 GFLOP/s to 7.3 TFLOP/s, while the application remains portable to run on Apple M4 GPU via OpenCL 1.2, where with some parameter tuning, we can increase performance by 8x over the naïve matrix-multiplication. Disclaimer : this article shows an approach to extend the HAT programming model with an API for explicit tensor-core programming. Furthermore, it shows how to make this approach generic to be able to process computations expressed with the proposed HAT tensor core API on accelerators without explicit tensor instructions. While this article shows a complete approach, the final integration into the HAT programming model is under discussion. What are GPU Tensor Cores? what-are-gpu-tensor-cores Tensor Cores are programmable matrix multiply-accumulate MMA dedicated hardware that have been present on NVIDIA GPUs since the NVIDIA Volta GPU microarchitecture https://en.wikipedia.org/wiki/Volta microarchitecture . These specialized units can improve, for example, training and inference performance for deep learning applications. The direct programmability of these units is exposed via APIs to the CUDA programming model, and can also be used from specific NVIDIA libraries such as cuDNN https://developer.nvidia.com/cudnn and cuBLAS https://developer.nvidia.com/cublas . While Tensor Cores are specific to the NVIDIA architecture, comparable MMA operations can be found in other GPU platforms and vendors. For instance, Intel exposes Advanced Matrix Extensions https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html AMX on recent CPUs and XMX https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-1/xmx.html matrix engines on some GPUs; these features accelerate matrix operations through different software stacks such as the Intel oneAPI https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html . Using the NVIDIA terminology https://www.nvidia.com/en-us/data-center/tensor-cores , a tensor operation is represented as a product and addition of small size matrices. The following diagram shows a representation of this operation in which three tensors of size 16x16 elements represented as 2D matrices are multiplied and accumulated. The Tensor Cores can process hundreds/thousands of scalar FMA operations in a single GPU clock cycle, leading to a performance increase. Note that NVIDIA tends to support larger MMA operations with each new GPU generation. Tensor operations often use mixed precision: input matrices are represented in a lower-precision format, such as FP16, while accumulation and output may use a higher-precision format, such as FP32. But why is this set of operations important? Matrix multiply-accumulate operations are widely used in many types of applications, including AI and LLMs for training and inference, taking more than 80% of the total computation for upstream LLMs, as reported by Modular https://www.modular.com/blog/matrix-multiplication-on-nvidias-blackwell-part-1-introduction . Besides AI and LLMs, matrix-multiply is used for other types of applications such as scientific computing, graphics and data analytics, just to name a few. NVIDIA Tensor Architecture and Native Tensor Performance nvidia-tensor-architecture-and-native-tensor-performance In a way, Tensor Cores are equivalent to CPU vector units but for 2D-range operations, and specifically, for matrix-multiply operations. Thus, in hardware, GPUs implement a set of functional units to perform multiple MMA operations per GPU clock cycle. The number of functional units per GPU depends on the GPU generation and the GPU-tier e.g., a GPU for a data center vs a consumer-grade GPU . To better understand where tensors are computed on GPUs, let’s look at the common organization of CUDA cores and Tensor Cores on current NVIDIA GPU architectures. The diagram represents a processing block from the NVIDIA GPU architecture. It is composed of a so-called warp-scheduler a warp on NVIDIA GPUs means a set of consecutive 32 threads that will run in lockstep on the GPU cores. Although since the NVIDIA Volta microarchitecture, independent thread-scheduling is also possible . The warp scheduler can dispatch a warp per clock/cycle. This is really a throughput machine. Each processing block contains a large register file for example, the NVIDIA B200 GPU contains more than 16k private registers . This is private space for the CUDA threads that run inside the processing block. Furthermore, the processing block contains a big set of functional units for computing floating point operations in 32-bit, and integers of 32 bits. These represent the common CUDA cores that NVIDIA advertises. Each GPU tier and GPU generation varies in the number of CUDA cores per processing block. For instance, the NVIDIA Blackwell microarchitecture contains 32 CUDA cores for each processing block. Additionally, each processing block contains units for performing loads and stores, and a special functional unit SFU , in which math operations such as sqrt , exp , etc., are processed. Finally, each processing block contains a big unit for explicitly computing tensors . The MMA tensor operation that we described previously will be executed in these units. NVIDIA GPUs do not provide just one processing block per GPU. They are, indeed, organized into larger processing structures called streaming multiprocessors SM . And, in the Blackwell microarchitecture, each SM contains four processing blocks as follows: Again, different GPU generations and different GPU tiers provide different numbers of SMs per GPU. To give you an example, the NVIDIA B200 GPU https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/ contains up to 160 Streaming Multiprocessors SMs , and each SM contains 4 Tensor Cores. Furthermore, each Tensor Core can perform up to 512 FMA operations per cycle https://cvw.cac.cornell.edu/gpu-architecture/horizon-gpus-blackwell-b200/tensor cores fifth gen in half precision using 16-bits floating point numbers . This gives a total of 2048 FMA Fused Multiply-Add operations per cycle, per SM We, as CUDA/GPU programmers, and hopefully, as Java developers too, can directly program the Tensor Core Unit via an API for performing fast MMA operations. How Fast can we Process MMA Operations with Tensor Cores? how-fast-can-we-process-mma-operations-with-tensor-cores Let’s run an experiment. I am going to use an NVIDIA A10 GPU, and the CUDA code used has been adapted from an article from NVIDIA https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ . The cited article includes a comparison between explicit-use of Tensors using a naïve version of the matrix-multiplication with the tensor WMMA API, and compares this version against the cublasGemmEx https://docs.nvidia.com/cuda/cublas/index.html cublasgemmex kernel from the cuBLAS library. While this experiment was written for illustration purposes, it gives an idea about how the APIs are used. I slightly modified that example to also include the naïve matrix multiplication implemented in CUDA without tensors. This gives a better idea about the performance gains for using CUDA Tensors. The source code can be found in the following link https://github.com/jjfumero/cuda-tensor-samples/blob/main/tensorsExample.cu . Thus, now we have 3 versions in this example: - A naïve matrix-multiplication implemented in CUDA. - A naïve matrix-multiplication implemented in CUDA using the Tensor API wmma in a row-major layout. - A library call to the NVIDIA cuBLAS for GEMM in FP16 16-bits floating point numbers . Let’s analyze the performance of each implementation using the NVIDIA Nsight Compute Profiler NCU https://developer.nvidia.com/nsight-compute . The matrix size used is 1024x1024. The following Figure shows a screenshot of the CUDA NCU profiler for each of the kernels evaluated. The relevant columns for us are the third column kernel name and the fifth column duration in ms for each kernel. As we can see, the naïve implementation illustrated in row 6 from the Figure takes 2.15 milliseconds. The naïve implementation using tensors wmma example kernel illustrated in the profiler report in line 7 takes 0.33 milliseconds. This means a speedup of ~6.5x in kernel time, just by enabling tensor cores And how do we know tensors are being used? When we look at the source section of the NCU profiler, we can identify the mma operations from the CUDA source and correlate with its SASS NVIDIA GPU assembly instructions. The SASS code looks as follows: HMMA.16816.F32 R4, R12, R22, R4 HMMA.16816.F32 R8, R12, R24, R8 HMMA operations represent tensor core operations, as described in the NVIDIA documentation https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html improved-tensor-core-operations . Using the NCU profiler, we also see that the GEMM kernel from cuBLAS takes 0.05 ms line 9 in the Figure . Thus, this kernel performs 42x faster than the naïve matrix multiplication, and about 6.5x faster than the naïve version using Tensor Core operations for this input size on the NVIDIA A10 GPU. In terms of GFLOP/s, the naïve matmul kernel achieves ~1 TFLOP/s, while the naïve matmul with tensors enabled performs over 6.5 TFLOP/s and cuBLAS reaches 42 TFLOP/s. Note that the performance gap between wmma and cuBLAS is expected, since our tensor version does not use GPU shared memory, 2D register tiling, or double buffering https://siboehm.com/articles/22/CUDA-MMM . Enabling tensor core programming from Java with HAT and Babylon enabling-tensor-core-programming-from-java-with-hat-and-babylon Now, the fun stuff. How can we enable Tensor Core Programming from Java? Since the goal of tensor core programming is to accelerate matrix-multiplications, let’s start with how the naïve matrix-multiplication matmul is expressed in HAT. static void mxmNaiveF16 KernelContext kc, F16Array matrixA, F16Array matrixB, F32Array matrixC, int size { float acc = 0.0f; for int k = 0; k < size; k++ { F16 ha = matrixA.array k size + kc.giy ; F16 hb = matrixB.array kc.gix size + k ; F16 hc = F16.mul ha, hb ; float fc = F16.f16ToFloat hc ; acc += fc; } matrixC.array kc.gix size + kc.giy, acc ; } This version of the matrix-multiplication for HAT already swaps kc.gix and kc.giy for better memory coalescing, and it uses a 2D kernel representation. Thus, it already includes some common optimizations for matrix-multiply on GPUs https://openjdk.org/projects/babylon/articles/hat-matmul/hat-matmul . However, as we will show in the evaluation section performance-evaluation , although this kernel is faster than the Java multithreaded baseline, it remains significantly slower than native and more specialized GPU implementations. We can improve its performance by expressing the matrix multiply-accumulate operation through tensor computation. Going through the previous HAT kernel, we see that it uses the F16Array data type, which is a predefined data type in the HAT API to operate with arrays of floating point values of 16 bits half-float . Besides, this kernel is coded using a 2D-Range to access x,y coordinates of the matrix in parallel and map these thread-ids namely kc.gix and kc.giy with the corresponding data in the same positions. To know more about how to optimize the matrix multiplication in HAT, I recommend the following article: The initial design of the HAT Tensor API was heavily inspired by the CUDA Tensor WMMA API https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html wmma , and we have shaped this API to make it more friendly for Java programmers. For illustration purposes, let’s start with how tensor cores can be programmed with CUDA C++ using the WMMA API. include