Exploiting GPU Tensor Cores from Java Using Babylon

Researchers extended the Heterogeneous Accelerator Toolkit (HAT) with a tensor-aware API and code transformations using OpenJDK Project Babylon's code reflection API, enabling Java programs to exploit GPU tensor cores for matrix-multiply-accumulate operations. On an NVIDIA Ampere A10 GPU, the approach achieved a speedup from 240 GFLOP/s to 7.3 TFLOP/s for naïve matrix multiplication, while remaining portable to Apple M4 GPU via OpenCL 1.2 with an 8x performance improvement.

Exploiting GPU Tensor Cores from Java using Babylon Abstract abstract Tensor Cores are dedicated hardware on NVIDIA GPUs that can be programmed to accelerate matrix-multiply-accumulate MMA operations. Running MMA operations on these cores can increase performance of specific applications dramatically. However, NVIDIA tensor cores are only available for NVIDIA GPUs and exposed to the CUDA programming model through low-level APIs. Ideally, we would also like to make those operations accessible from Java to accelerate domain-specific workloads e.g., LLMs , but those operations must be portable across accelerators. MMA capabilities are also available for other computing platforms such as Apple devices using the Metal programming model, or Intel XPUs via the OpenCL and oneAPI software stacks. However, these operations are not always achievable for other programming models such as OpenCL 1.2 the OpenCL version that Apple supports , which emphasizes the need for portability. This article tackles the architectural specificity of NVIDIA Tensor Cores by exploring a portable approach to tensor operations across multiple hardware accelerators that can be used from Java. The goal of this article is twofold. First, we show that Java programs can reach close-to-native performance for matrix-multiply computations on hardware with accelerated MMA support, such as NVIDIA GPUs. Second, we study how the same Java Tensor API can be mapped across different parallel programming models and vendors while remaining portable for both, source code and runtime scheduling parameters. To support this approach, we extended the Heterogeneous Accelerator Toolkit HAT https://github.com/openjdk/babylon/tree/code-reflection/hat , a parallel programming framework to accelerate data-parallel workloads on hardware accelerators, with a tensor-aware API and a set of code transformations using the code reflection API from the OpenJDK Project Babylon https://github.com/openjdk/babylon . Finally, we evaluate the performance of the system using the HAT Tensor API from Java in the context of two GPU platforms, an Apple M4 Max GPU and an NVIDIA Ampere A10 GPU https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/a10-datasheet.pdf . We show that, by enabling tensor cores on supported hardware NVIDIA , we can speed up the naïve matrix multiplication kernel from 240 GFLOP/s to 7.3 TFLOP/s, while the application remains portable to run on Apple M4 GPU via OpenCL 1.2, where with some parameter tuning, we can increase performance by 8x over the naïve matrix-multiplication. Disclaimer : this article shows an approach to extend the HAT programming model with an API for explicit tensor-core programming. Furthermore, it shows how to make this approach generic to be able to process computations expressed with the proposed HAT tensor core API on accelerators without explicit tensor instructions. While this article shows a complete approach, the final integration into the HAT programming model is under discussion. What are GPU Tensor Cores? what-are-gpu-tensor-cores Tensor Cores are programmable matrix multiply-accumulate MMA dedicated hardware that have been present on NVIDIA GPUs since the NVIDIA Volta GPU microarchitecture https://en.wikipedia.org/wiki/Volta microarchitecture . These specialized units can improve, for example, training and inference performance for deep learning applications. The direct programmability of these units is exposed via APIs to the CUDA programming model, and can also be used from specific NVIDIA libraries such as cuDNN https://developer.nvidia.com/cudnn and cuBLAS https://developer.nvidia.com/cublas . While Tensor Cores are specific to the NVIDIA architecture, comparable MMA operations can be found in other GPU platforms and vendors. For instance, Intel exposes Advanced Matrix Extensions https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html AMX on recent CPUs and XMX https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-1/xmx.html matrix engines on some GPUs; these features accelerate matrix operations through different software stacks such as the Intel oneAPI https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html . Using the NVIDIA terminology https://www.nvidia.com/en-us/data-center/tensor-cores , a tensor operation is represented as a product and addition of small size matrices. The following diagram shows a representation of this operation in which three tensors of size 16x16 elements represented as 2D matrices are multiplied and accumulated. The Tensor Cores can process hundreds/thousands of scalar FMA operations in a single GPU clock cycle, leading to a performance increase. Note that NVIDIA tends to support larger MMA operations with each new GPU generation. Tensor operations often use mixed precision: input matrices are represented in a lower-precision format, such as FP16, while accumulation and output may use a higher-precision format, such as FP32. But why is this set of operations important? Matrix multiply-accumulate operations are widely used in many types of applications, including AI and LLMs for training and inference, taking more than 80% of the total computation for upstream LLMs, as reported by Modular https://www.modular.com/blog/matrix-multiplication-on-nvidias-blackwell-part-1-introduction . Besides AI and LLMs, matrix-multiply is used for other types of applications such as scientific computing, graphics and data analytics, just to name a few. NVIDIA Tensor Architecture and Native Tensor Performance nvidia-tensor-architecture-and-native-tensor-performance In a way, Tensor Cores are equivalent to CPU vector units but for 2D-range operations, and specifically, for matrix-multiply operations. Thus, in hardware, GPUs implement a set of functional units to perform multiple MMA operations per GPU clock cycle. The number of functional units per GPU depends on the GPU generation and the GPU-tier e.g., a GPU for a data center vs a consumer-grade GPU . To better understand where tensors are computed on GPUs, let’s look at the common organization of CUDA cores and Tensor Cores on current NVIDIA GPU architectures. The diagram represents a processing block from the NVIDIA GPU architecture. It is composed of a so-called warp-scheduler a warp on NVIDIA GPUs means a set of consecutive 32 threads that will run in lockstep on the GPU cores. Although since the NVIDIA Volta microarchitecture, independent thread-scheduling is also possible . The warp scheduler can dispatch a warp per clock/cycle. This is really a throughput machine. Each processing block contains a large register file for example, the NVIDIA B200 GPU contains more than 16k private registers . This is private space for the CUDA threads that run inside the processing block. Furthermore, the processing block contains a big set of functional units for computing floating point operations in 32-bit, and integers of 32 bits. These represent the common CUDA cores that NVIDIA advertises. Each GPU tier and GPU generation varies in the number of CUDA cores per processing block. For instance, the NVIDIA Blackwell microarchitecture contains 32 CUDA cores for each processing block. Additionally, each processing block contains units for performing loads and stores, and a special functional unit SFU , in which math operations such as sqrt , exp , etc., are processed. Finally, each processing block contains a big unit for explicitly computing tensors . The MMA tensor operation that we described previously will be executed in these units. NVIDIA GPUs do not provide just one processing block per GPU. They are, indeed, organized into larger processing structures called streaming multiprocessors SM . And, in the Blackwell microarchitecture, each SM contains four processing blocks as follows: Again, different GPU generations and different GPU tiers provide different numbers of SMs per GPU. To give you an example, the NVIDIA B200 GPU https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/ contains up to 160 Streaming Multiprocessors SMs , and each SM contains 4 Tensor Cores. Furthermore, each Tensor Core can perform up to 512 FMA operations per cycle https://cvw.cac.cornell.edu/gpu-architecture/horizon-gpus-blackwell-b200/tensor cores fifth gen in half precision using 16-bits floating point numbers . This gives a total of 2048 FMA Fused Multiply-Add operations per cycle, per SM We, as CUDA/GPU programmers, and hopefully, as Java developers too, can directly program the Tensor Core Unit via an API for performing fast MMA operations. How Fast can we Process MMA Operations with Tensor Cores? how-fast-can-we-process-mma-operations-with-tensor-cores Let’s run an experiment. I am going to use an NVIDIA A10 GPU, and the CUDA code used has been adapted from an article from NVIDIA https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ . The cited article includes a comparison between explicit-use of Tensors using a naïve version of the matrix-multiplication with the tensor WMMA API, and compares this version against the cublasGemmEx https://docs.nvidia.com/cuda/cublas/index.html cublasgemmex kernel from the cuBLAS library. While this experiment was written for illustration purposes, it gives an idea about how the APIs are used. I slightly modified that example to also include the naïve matrix multiplication implemented in CUDA without tensors. This gives a better idea about the performance gains for using CUDA Tensors. The source code can be found in the following link https://github.com/jjfumero/cuda-tensor-samples/blob/main/tensorsExample.cu . Thus, now we have 3 versions in this example: - A naïve matrix-multiplication implemented in CUDA. - A naïve matrix-multiplication implemented in CUDA using the Tensor API wmma in a row-major layout. - A library call to the NVIDIA cuBLAS for GEMM in FP16 16-bits floating point numbers . Let’s analyze the performance of each implementation using the NVIDIA Nsight Compute Profiler NCU https://developer.nvidia.com/nsight-compute . The matrix size used is 1024x1024. The following Figure shows a screenshot of the CUDA NCU profiler for each of the kernels evaluated. The relevant columns for us are the third column kernel name and the fifth column duration in ms for each kernel. As we can see, the naïve implementation illustrated in row 6 from the Figure takes 2.15 milliseconds. The naïve implementation using tensors wmma example kernel illustrated in the profiler report in line 7 takes 0.33 milliseconds. This means a speedup of ~6.5x in kernel time, just by enabling tensor cores And how do we know tensors are being used? When we look at the source section of the NCU profiler, we can identify the mma operations from the CUDA source and correlate with its SASS NVIDIA GPU assembly instructions. The SASS code looks as follows: HMMA.16816.F32 R4, R12, R22, R4 HMMA.16816.F32 R8, R12, R24, R8 HMMA operations represent tensor core operations, as described in the NVIDIA documentation https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html improved-tensor-core-operations . Using the NCU profiler, we also see that the GEMM kernel from cuBLAS takes 0.05 ms line 9 in the Figure . Thus, this kernel performs 42x faster than the naïve matrix multiplication, and about 6.5x faster than the naïve version using Tensor Core operations for this input size on the NVIDIA A10 GPU. In terms of GFLOP/s, the naïve matmul kernel achieves ~1 TFLOP/s, while the naïve matmul with tensors enabled performs over 6.5 TFLOP/s and cuBLAS reaches 42 TFLOP/s. Note that the performance gap between wmma and cuBLAS is expected, since our tensor version does not use GPU shared memory, 2D register tiling, or double buffering https://siboehm.com/articles/22/CUDA-MMM . Enabling tensor core programming from Java with HAT and Babylon enabling-tensor-core-programming-from-java-with-hat-and-babylon Now, the fun stuff. How can we enable Tensor Core Programming from Java? Since the goal of tensor core programming is to accelerate matrix-multiplications, let’s start with how the naïve matrix-multiplication matmul is expressed in HAT. static void mxmNaiveF16 KernelContext kc, F16Array matrixA, F16Array matrixB, F32Array matrixC, int size { float acc = 0.0f; for int k = 0; k < size; k++ { F16 ha = matrixA.array k size + kc.giy ; F16 hb = matrixB.array kc.gix size + k ; F16 hc = F16.mul ha, hb ; float fc = F16.f16ToFloat hc ; acc += fc; } matrixC.array kc.gix size + kc.giy, acc ; } This version of the matrix-multiplication for HAT already swaps kc.gix and kc.giy for better memory coalescing, and it uses a 2D kernel representation. Thus, it already includes some common optimizations for matrix-multiply on GPUs https://openjdk.org/projects/babylon/articles/hat-matmul/hat-matmul . However, as we will show in the evaluation section performance-evaluation , although this kernel is faster than the Java multithreaded baseline, it remains significantly slower than native and more specialized GPU implementations. We can improve its performance by expressing the matrix multiply-accumulate operation through tensor computation. Going through the previous HAT kernel, we see that it uses the F16Array data type, which is a predefined data type in the HAT API to operate with arrays of floating point values of 16 bits half-float . Besides, this kernel is coded using a 2D-Range to access x,y coordinates of the matrix in parallel and map these thread-ids namely kc.gix and kc.giy with the corresponding data in the same positions. To know more about how to optimize the matrix multiplication in HAT, I recommend the following article: The initial design of the HAT Tensor API was heavily inspired by the CUDA Tensor WMMA API https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html wmma , and we have shaped this API to make it more friendly for Java programmers. For illustration purposes, let’s start with how tensor cores can be programmed with CUDA C++ using the WMMA API. include <cuda fp16.h include <mma.h using namespace nvcuda; define WMMA M 16 define WMMA N 16 define WMMA K 16 // A: MxK // B: KxN // C: MxN global void cuda tensors row major half a, half b, float c, int M, int N, int K { int lda = K; int ldb = N; int ldc = N; int warpM = blockIdx.x blockDim.x + threadIdx.x / warpSize; int warpN = blockIdx.y blockDim.y + threadIdx.y ; // Declare the fragments wmma::fragment<wmma::matrix a, WMMA M, WMMA N, WMMA K, half, wmma::row major a frag; wmma::fragment<wmma::matrix b, WMMA M, WMMA N, WMMA K, half, wmma::row major b frag; wmma::fragment<wmma::accumulator, WMMA M, WMMA N, WMMA K, float acc frag; wmma::fill fragment acc frag, 0.0f ; // Loop over the tiles int aRow = warpM WMMA M; int bCol = warpN WMMA N; for int i = 0; aRow < M && bCol < N && i < K; i += WMMA K { // Load the inputs wmma::load matrix sync a frag, a + i + aRow lda, lda ; wmma::load matrix sync b frag, b + bCol + i ldb, ldb ; // Perform the matrix multiplication wmma::mma sync acc frag, a frag, b frag, acc frag ; } int cRow = warpM WMMA M; int cCol = warpN WMMA N; if cRow < M && cCol < N { // Store the output wmma::store matrix sync c + cCol + cRow ldc, acc frag, ldc, wmma::mem row major ; } } The CUDA C++ API is a low-level API: It uses the wmma library, which stands for Warp Matrix Multiply Accumulate, and its operations are performed cooperatively by all threads in a warp. However, we can use the same core ideas as inspiration for the HAT Tensor API. In this CUDA example, fragments tensors are declared using a particular shape, type and memory access layout. In this case, we use a shape of 16, 16, 16 and 16-bit floating point values half . For the memory access layout, we use row major row major . The memory accessor layout describes how data is stored in memory for the input matrices a and b . If the data is stored in column major, we need to define the corresponding tensors as col major . Otherwise, we use row major . The previous code snippet runs a complete matrix multiplication, and it assumes tiles and matrix dimensions are multiples of 16. Note that CUDA WMMA supports a limited set of tile shapes, with the supported shapes depending on the operand and accumulator types. The overall strategy to compute matrix-multiplication using tensors is as follows: 1. Declare Tensor A with specific Shape and Float16 type 2. Declare Tensor B with specific Shape and Float16 type 3. Declare Tensor Accumulator with specific Shape 4. Initialize the accumulator tensor 5. Loop over the tiles from 0 to num-tiles , and for each tile do: 5.1 load tensorA from input matrix A 5.2 load tensorB from input matrix B 5.3 perform the MMA operation and store in accumulator 6. Store final result in resulting matrix C In CUDA, steps 1 and 2 also declare the access layout for each tensor from global memory. Furthermore, it tags whether the tensor will be used as first operand or second operand. These are low-level details that we can abstract in HAT with the help of code-reflection. For the memory access layout, while we can make it explicit, we can facilitate its accessor by making a row-major layout the default option, as it is also how other programming models and languages use by default, such as C, NumPy and Java. In case HAT developers need to access using column-major layout, we can pass an extra parameter when we define the tensor. Another observation is that the tensors A and B can be directly declared when we load the data from the input matrices. Besides, when we load a tensor, we can specify via a method call that we want to load a tensor in FP16 format. Thus, we can end up with the following strategy for HAT: 1. Initialize the accumulator e.g., 0.0f 2. For each of the tiles from 0 to num-tiles do 2.1 load-f16 tensorA from matrix A with specified shape 2.2 load-f16 tensorB from matrix B with specified shape 2.3 perform the MMA operation and store in accumulator 3. Store final result in matrix C This way of programming is a bit more generic and facilitates code portability across different programming models and GPU vendors e.g., by mapping this tensor API to explicit loop-tile processing with OpenCL 1.2, as we will see in the next section . Note that WMMA fragments in CUDA are mutable objects. However, in our model, we assume tensors are immutable, at least at the API level in HAT. While this eases programmability, the HAT runtime and compiler are still free to reorganize those operations to better match the underlying programming model e.g., CUDA and OpenCL . Next, we will explain the details of how HAT exposes these operations for Java programmers. 1. Declare Tensor Accumulator in HAT declare-tensor-accumulator-in-hat Tensor operations are defined in the Tensor class in HAT. The following code snippet shows how to create a tensor already initialized to zeros . Besides, it defines a common shape object to be used for the accumulator and load operations. js final int sizeShape = 16; // We can share a shape var shape = Tensor.shape sizeShape, sizeShape, sizeShape ; Tensor acc = Tensor.zeros shape, float.class ; In the current implementation, HAT does not choose the tensor shape automatically. The programmer selects a shape that best matches the backend, target device and data types used. For the CUDA backend, the selected shape must map to one of the WMMA fragment sizes supported for the operand and accumulator types https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html element-types-and-matrix-sizes . For the OpenCL backend, tensors are lowered to explicit per-work-item tiles as we will discuss shortly in private memory. Larger shapes can increase private register pressure, and can lead to register spilling or even compile/launch failures on some devices and drivers. For instance, using shapeSize = 16 may fail on NVIDIA GPUs through the OpenCL backend, while reducing it to shapeSize = 4 can compile and run. Other platforms, such as Apple Silicon and Intel Integrated Graphics, may accept the larger shape, although tuning can still improve performance. A more viable solution in the longer run would be to select a shape size via a PREFERRED SHAPE SIZE , and let the HAT runtime/compiler to automatically select the best fit for the targeted backend/architectures. This feature is currently being discussed on GitHub https://github.com/openjdk/babylon-docs/pull/20 discussion r3372431807 . 2. For each of the tiles from 0 to num-tiles for-each-of-the-tiles-from-0-to-num-tiles The next step is to iterate over the tiles, load the input matrices into tensors and perform the mma operation. In our example, each tile is of size WMMA K . final int WMMA K = 16; for int i = 0; i < size ; i += WMMA K { // load tensors // perform mma } Let’s dive into each of these operations. 2.1 Load tensorA from matrix A with specified shape load-tensora-from-matrix-a-with-specified-shape To load a tensor, we use the loadF16 method from the Tensor class. The parameters are defined as follows: - Input matrix. - Row to load - Column to load - Leading dimension - Shape Tensor tensorA = Tensor.loadF16 matrixA, aRow, i, lda, shape ; If we do not specify the memory access layout, the HAT runtime and compiler will use row-major by default. This would mean that the input matrices are represented in memory using a row-major layout. If the Java programmer requires a column-major layout because the input data is stored in such layout , we can pass an extra parameter to define the accessor as follows: Tensor tensorA = Tensor.loadF16 matrixA, aRow, i, lda, shape, Tensor.ofColumnMajor ; For the evaluation section, we will use this variant, column-major, since it was the one we evaluated before for the CUDA C++ performance report in the previous section how-fast-can-we-process-mma-operations-with-tensor-cores . 2.2 Load tensorB from matrix B with specified shape load-tensorb-from-matrix-b-with-specified-shape Similarly, we load the data from the second matrix into tensor B. Tensor tensorB = Tensor.loadF16 matrixB, i, bCol, ldb, shape ; 2.3 Perform the MMA Operation perform-the-mma-operation Now we have everything ready to perform the MMA operation. // acc = tensorA tensorB + acc acc = Tensor.mma tensorA, tensorB, acc ; This operation performs acc = tensorA tensorB + acc . If we run this code on a platform with tensor-core processing support, the HAT compiler maps these instructions to explicit mma tensor core instructions. Otherwise, as we will see in the following sections, the HAT compiler maps the mma operation into an explicit loop-tile operation that contains an matrix-multiply and accumulate operation as follows: acc = add dot tensorA, tensorB , acc ; 3. Store final result in matrix C store-final-result-in-matrix-c Tensor.store matrixC, cRow, cCol, acc, ldc ; Complete Kernel Code in HAT complete-kernel-code-in-hat The following code snippet shows the complete HAT kernel to compute the matrix-multiplication using the proposed Tensor Core API. @Reflect public static void hatTensors @RO KernelContext kc, @RO F16Array matrixA, @RO F16Array matrixB, @WO F32Array matrixC, int size { final int shapeSize = 16; final int WMMA M = shapeSize; final int WMMA N = shapeSize; final int WMMA K = shapeSize; int warpM = kc.gix / kc.wrs; int warpN = kc.giy; // multiply square matrices final int lda = size; final int ldb = size; final int ldc = size; var shape = Tensor.shape WMMA M, WMMA N, WMMA K ; Tensor acc = Tensor.zeros shape, float.class ; int aRow = warpM WMMA M; int bCol = warpN WMMA N; for int i = 0; i < size && aRow < size && bCol < size; i += WMMA K { Tensor tensorA = Tensor.loadF16 matrixA, aRow, i, lda, shape ; Tensor tensorB = Tensor.loadF16 matrixB, i, bCol, ldb, shape ; acc = Tensor.mma tensorA, tensorB, acc ; } int cRow = warpM WMMA M; int cCol = warpN WMMA N; if cRow < size && cCol < size { Tensor.store matrixC, cRow, cCol, acc, ldc ; } } And, if we emit the generated CUDA code by the HAT compiler, we obtain the following CUDA C++ kernel: HAT KERNEL void hatTensors HAT GLOBAL MEM KernelContext t kc, HAT GLOBAL MEM F16Array t matrixA, HAT GLOBAL MEM F16Array t matrixB, HAT GLOBAL MEM F32Array t matrixC, int size { int sizeShape = 16; int WMMA M = sizeShape; int WMMA N = sizeShape; int WMMA K = sizeShape; int warpM = HAT GIX/32; int warpN = HAT GIY; int lda = size; int ldb = size; int ldc = size; nvcuda::wmma::fragment< nvcuda::wmma::accumulator, 16, 16, 16, float acc; nvcuda::wmma::fill fragment acc,0.0 ; nvcuda::wmma::fragment< nvcuda::wmma::matrix a, 16, 16, 16, half, nvcuda::wmma::row major tensorA; nvcuda::wmma::fragment <nvcuda::wmma::matrix b, 16, 16, 16, half, nvcuda::wmma::row major tensorB; int aRow = warpM WMMA M; int bCol = warpN WMMA N; for int i = 0; i < size && aRow < size && bCol < size; i=i+WMMA K { nvcuda::wmma::load matrix sync tensorA, half matrixA- array + i+ aRow lda ,lda ; nvcuda::wmma::load matrix sync tensorB, half matrixB- array + bCol+ i ldb ,ldb ; nvcuda::wmma::mma sync acc,tensorA,tensorB,acc ; } int cRow = warpM WMMA M; int cCol = warpN WMMA N; if cRow < size && cCol < size { nvcuda::wmma::store matrix sync matrixC- array + cCol+ cRow ldc , acc,ldc,nvcuda::wmma::mem row major ; } return; } As we can see, the generated kernel from the HAT JIT compiler for CUDA is extremely similar to the handwritten CUDA kernels for tensors. If readers want to check and play with the HAT tensor API, it is fully available on GitHub under the hat/tensors/v2 branch. HAT Tensor Support https://github.com/jjfumero/babylon/tree/hat/tensors/v2 : yes, we have iterated this API a couple of times already ;- . Before we show the performance impact of this API from the Java/HAT level, there is another component in the API that we haven’t described yet, the warp-size and how to launch the kernel on the GPU. Let’s now describe it and analyze how we can make it portable across different programming models. Enabling Tile and Warp Sizes from the HAT ND-Range API enabling-tile-and-warp-sizes-from-the-hat-nd-range-api From our code example, we select the size of warpM based on the global thread-index and the warp-size: int warpM = kc.gix / kc.wrs; // warp-size usage 1st dim But, what does this mean? As explained in the NVIDIA GPU architecture section, recall that a warp is a basic execution unit composed of a set of 32 consecutive threads that run on the GPU. Furthermore, to make things a bit more complicated, different GPU vendors may have a different warp size e.g., on AMD GPUs, a wavefront — equivalent of warp using the AMD terminology is composed of a set of 64 threads. However, if we want to make this code portable across vendors and different heterogeneous programming models we should not change any line of the source code to be able to run the application and still obtain correct results. Thus, we need to design the thread-dispatcher and warp assignment carefully within the HAT compiler and runtime. How do we tackle the runtime portability for Warps and Tiles? Currently, HAT exposes an API to specify the global size and local work-group size through the ND-Range API. We extended the ND-Range API to also include a tile-size, and warp sizes and modified the HAT runtime to make warps portable. By portability, we mean that the generated kernel along with the runtime scheduling parameters must be functionally correct across the supported platforms. We will still have the opportunity to tune the ND-range if needed, but this should not be an obstacle to obtain correct functionality. Introducing a warp-size construct in the HAT API that can change value at runtime affects how to define the total number of threads to be deployed using the ND-Range API in HAT. For instance, a simple 2D ND-range in HAT is defined as follows: NDRange range = NDRange.of2D ofGlobal 1024, 1024 , ofLocal 16, 16 ; This means that we want to launch groups of threads of 16x16 in a total mesh of threads of 1024x1024. This follows the OpenCL semantics. However, the equivalent scheduler following the CUDA semantics is recalculated as follows: js var range = Grid 64, 64 , Block 16,16 ; Thus, the nd-range is recomputed with a grid of threads of 64x64 using the same block size 16x16 . Keep in mind that this is automatically handled by the HAT runtime without developer’s feedback. Introducing Warp-Size and Tile-Size in the ND-Range API introducing-warp-size-and-tile-size-in-the-nd-range-api By introducing warps and tiles what we are doing is reducing the total number of threads to be deployed, while increasing the work to be done per thread. With the new extension, we can make this visible as follows: js // total number of threads var ndRange = NDRange2D.of Global2D.of 1024, 1024 , // Local work-group sizes Local2D.of 128, 4 , // our tile shape size NDRange.Tile2D.of 16, 16 , // indicate warp-enabled per dimension Warp2D.of true, false ; Thus, for a given nd-range, the thread-dispatcher is recalculated as follows: For the OpenCL backend: ndRange = size / tileSize , size / tileSize For the CUDA backend: ndRange = size / tileSize warpSize , size / tileSize This conversion requires an input size to be a multiple of tile-size. Otherwise, the HAT runtime will launch a runtime exception. The calculation of the new ndrange may not cover every possible scheduling strategy, but it gives HAT a portable default that launches the correct number of threads for each backend without requiring source-code changes, thus, achieving functional portability. But how is the warp size recalculated? In this case, the value of the warp-size coded in HAT kernel is automatically calculated and inserted directly into the tree of the original code model. Currently, this is implemented as a new operation Op that is contained in dialect for tensors within the HAT code transformer. A dialect is a domain-specific code model without modifying the core Java code models. It can be seen as an extension for a specific domain of applications, in our case, for tensors. A detailed discussion of tensor dialects is beyond the scope of this article; however, we may cover this topic in a dedicated article in the future. Thus, for instance, the following code snippet shows how to access the warp size and compute the right warp-thread index: int warpM = kc.gix / kc.wrs; // warp-size usage 1st dim int warpN = kc.giy; The kc.wrs becomes a constant value of 32 during code specialization when compiling for NVIDIA architectures. In the case of OpenCL, we might choose a value of 1 if OpenCL 1.2 or older is selected as a target device, or any other value that matches the current architecture in which the kernel will be deployed. Thus, for instance, the resulting code when selecting the CUDA backend looks as follows: int warpM = kc.gix / 32; int warpN = kc.giy; Thus, the warpM value is selected using the global thread-id of the first dimension divided by 32 threads within a warp, while the warpN value is set to the thread-id of the second dimension. Recall that, with the underlying CUDA and OpenCL programming models that HAT tackles, developers can launch a multidimensional grid of threads of up to 3D, facilitating the mapping between thread-indexing and data. Enabling Compiler and Runtime Device Portability for HAT enabling-compiler-and-runtime-device-portability-for-hat One of the main goals in HAT is not to define vendor-specific interfaces, but rather to provide abstractions that can be mapped to different architecture, hardware accelerators and programming models. So far we have explained how we map the Tensor API to achieve tensor instructions to tackle supported hardware e.g., NVIDIA . But, how do we make it portable for devices that do not have explicit tensor instructions? To enable functional portability, we represent tensors as loop-tiles. Tensor operations are represented as tile-level operations for the creation, loading, storing and computing math operations such as the MMA. Readers with OpenCL experience might think of OpenCL subgroups, that can be used to target tensor operations, such as for Intel extensions for subgroups: While this is a possible and a promising way to match tensor operations to other hardware, what we want initially is to keep a generic reference implementation in which we can guarantee that we can run the corresponding HAT programs with older hardware/support for OpenCL e.g., OpenCL subgroups were promoted to core in OpenCL 2.1, but platforms such as Apple Silicon only support OpenCL 1.2 . Note that OpenCL extensions may expose subgroups, even for older OpenCL versions. However, this is not the case of Apple OpenCL. As such, a tensor load operation: js var tensorA = Tensor.load matrixA, aRow, aCol, lda ; can be offloaded to the following code using loop tiles: int aRow = warpM WMMA M; for int m = 0; m < WMMA M; m++ { int rowA = aRow + m; for int n = 0; n < WMMA K; n++ { int colA = aCol + n; int idxA = rowA lda + colA; HAT GLOBAL MEM F16Impl t ha = &matrixA- array idxA ; F16 t r = F16 t {ha- value}; a frag m WMMA K + n = r; } } Similarly, the tensor store operation is mapped as follows: for int m = 0; m < WMMA M; m++ { int rowC = cRow + m; for int n = 0; n < WMMA N; n++ { int colC = cCol + n; int idxC = rowC ldc + colC; matrixC- array idxC = acc m n ; } } The mma operation is mapped as follows: for int m = 0; m < WMMA M; m++ { for int n = 0; n < WMMA N; n++ { float sum = acc m n ; for int k = 0; k < WMMA K; k++ { F16 t ha = a frag m WMMA K + k ; F16 t hb = b frag k WMMA N + n ; F16 t result = F16 t { ha.value hb.value }; sum += float result.value ; } acc m n = sum; } } The tensor and accumulator arrays are mapped to the accelerator’s private memory. Each accelerator thread processes a small tile: it loads tiles from global memory into private memory, performs the MMA operation locally, and stores the resulting tile back to global memory. The next question is how does HAT perform for the CUDA backend targeting CUDA Tensors as well as an OpenCL backend for processing tensor operation as tiles? The following section tackles performance evaluation of this approach. Performance Evaluation performance-evaluation To evaluate the tensors support with HAT we use two different platforms to check performance and portability for the two available backends in HAT. Namely: - An NVIDIA A10 GPU with programmable tensor cores. - Apple M4 Max GPU with OpenCL 1.2: this platform does not provide explicit tensor support via the OpenCL 1.2 software stack, and it is used to check code portability, functionality and performance by mapping to explicit loop-tile processing for tensors. The following table summarizes the main hardware and software components for each platform: | HW/SW Component | NVIDIA A10 24GB VRAM | Apple M4 Max Laptop | |---|---|---| | GPU | NVIDIA A10 | Apple M4 Max GPU | | Programming Backend | CUDA 13.2.78 | OpenCL 1.2 | | Driver Version | 595.71.05 | Apple OpenCL 1.2 1.0 | | System Memory | 235 GB | 64 GB | | CPU | Intel Xeon Platinum 8358 CPU @ 2.60GHz | Apple M4 Max | | Operating System | Ubuntu 22.04.5 - 6.8.0-1052-oracle | macOS 26.3.1 | | Java Version | 26-ea+10-1053. | 26+35-2893 | | HAT Backend | CUDA | OpenCL | Common setup for both machines: | Component | Value | |---|---| | Babylon Revision | | Main.java https://github.com/jjfumero/babylon/blob/hat/tensors/v2/hat/examples/tensors/src/main/java/tensors/Main.java Note that the GPU A10 is running in an instance BM.GPU.A10 from Oracle Cloud OCI https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm . The instance is a paravirtualized VM with 15 CPU cores assigned. The versions to be compared are: a the Java parallel-streams running on the corresponding CPU system; b the Java/HAT naïve matrix multiplication using FP32 float values; c the Java/HAT naïve matrix multiplication using FP16 half values, and d the naïve matrix multiply implemented using the proposed Tensor API for HAT. For our experiments, we use the column-major version, as it was the default option from the CUDA C++ explained in the original NVIDIA blog post https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/ . We ran each benchmark 100 times, collected the last 50 runs and report the average and standard-deviation. Furthermore, we set the Java heap size to 16 GB. We also run the benchmark for four different sizes, ranging from 512x512 matrices to 4kx4k matrices. The following code snippet shows how this benchmark has been evaluated: Use "ffi-cuda" for the CUDA backend backend="ffi-opencl" timestamp= date "+%a-%b-%d-%I-%M-%S-%p-%Y" directory="tensorResults $timestamp" mkdir -p $directory inputSize= 512 1024 2048 4096 for s in ${inputSize @ }; do echo -e "Running with MatMul Tensors with Size: $s x $s" java -Xmx16g -Xms16g -cp hat/job.jar hat.java run \ $backend tensors --size=$s --iterations=100 \ --verbose --skip-sequential $directory/$s.log sleep 10 done Each run generates a csv table with all timers. Those timers were used to plot the end-to-end performance. For the kernel performance, we use the NVIDIA NCU profiler. Performance of HAT with Tensors on NVIDIA A10 GPU performance-of-hat-with-tensors-on-nvidia-a10-gpu This section shows how HAT performs when explicit tensor cores are used in hardware via the CUDA backend. The following performance graph shows the speedups of each version, namely naïve matrix multiplication in FP32, FP16 and the naïve matrix multiplication implemented using the HAT Tensor API. All these versions are compared to the Java parallel stream implementation evaluated on the Intel Xeon Platinum 8358 with 15 CPU physical cores. Furthermore, the following performance graph reports the speedup computed from end-to-end time after warm up, including data transfers copy data from the CPU to the GPU, matrix-multiplication processing, and copy data from the GPU to the CPU . We can see that the HAT tensor implementation performs up to 2200 times faster compared to Java Parallel streams on the CPU, and it is consistently faster for all data points compared to the naïve implementations in FP16 and FP32. Kernel Performance Profiled with NCU kernel-performance-profiled-with-ncu We also want to analyze if our kernel tensor implementation in HAT matches performance with the handwritten CUDA C++ kernel using the WMMA API. Ideally, the kernel performance of these two CUDA code generated by HAT and the CUDA WMMA native version should match. To check this, we evaluated the HAT execution with the NCU kernel profiler and obtain the GPU kernel elapsed time. The following Figure shows a screenshot of the NCU profiler when running the Tensor benchmark with the same input size 1024x1024 used for the initial evaluation of the CUDA C++ versions. As we can see, the mxmTensorCM kernel takes ~0.35 ms, which is ~5% slower compared to the equivalent CUDA C++ WMMA kernel implementation 0.33 ms for the same size, as we saw in Section How Fast can we Process MMA Operations with Tensor Cores? how-fast-can-we-process-mma-operations-with-tensor-cores . But why doesn’t it run at the same speed as CUDA Native? but-why-doesnt-it-run-at-the-same-speed-as-cuda-native The HAT compiler transforms arrays represented with Java interfaces into C99-structs. With this in mind, let’s inspect the profiler of both kernels CUDA C++ and the one generated by HAT in more detail. At a first glance using the SoL the Speed of Light section of the NCU, we see the following differences. The diagram shows the compute-throughput and memory throughput for both kernels. The green bar represents the baseline CUDA C++ , while the blue line represents the throughput of the CUDA kernel generated by HAT. As we can see, the compute-throughput in the case of the HAT kernel is 4.47% slower compared to the equivalent CUDA C++ version. Looking at the memory subsystem report, we see the following values: Keeping the CUDA WMMA C++ as baseline against the generated CUDA code from HAT, we see a large increase in memory sectors +50% increase by looking at the row “Global Load” and column “Sectors” . Besides that, we see a 50% increase regarding the memory sectors per request, which increases the memory traffic between the L2 and the global memory, producing slower overall kernel executions. The main reason for this seems to be related with memory alignment. Looking at the NVIDIA documentation for tensor loads/stores https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html description we find the following description for the load matrix sync operation: Waits until all warp lanes have arrived at load matrix sync and then loads the matrix fragment a from memory. mptr must be a 256-bit aligned pointer pointing to the first element of the matrix in memory. This means we need 32-byte alignment for the inputs F16Array arrays. --- a/hat/core/src/main/java/hat/buffer/F16Array.java +++ b/hat/core/src/main/java/hat/buffer/F16Array.java @@ -41,6 +41,7 @@ interface F16Impl extends Struct, F16 { Schema<F16Array schema = Schema.of F16Array.class, f16array - f16array.arrayLen "length" // 4 bytes for length + .pad 28 // + 28 bytes for padding - 32 bytes .array "array", half - half.fields "value" ; After applying the alignment, we now obtain the following report from the Speed-of-Light section: The green line represents the baseline CUDA C++ kernel and the blue line represents the kernel generated by HAT for both compute and memory throughput. The main difference compared to the previous version is in the compute section, that now runs 0.85% slower, producing equivalent CUDA kernels. If we benchmark the different versions and input sizes with the NCU, we can compare the kernel quality across the different versions. The following Figure shows the GFLOP/s the higher, the better for each version, including the CUDA C++ WMMA and cuBLAS versions. In this case, the HAT version for tensors does not include alignment. As we can see, the HAT Tensor generated kernel achieves up to 7.0 TFLOP/s, while the 2D naïve implementations achieve 239 GFLOP/s. The following performance plot shows the performance when the data alignment is set to 32 bytes for the input F16Array types. As we can see, HAT achieves up to 7.3 TFLOP/s. The CUDA C++ and the HAT generated tensor kernel achieve, in this case, the same performance, as we discussed earlier. The highest performance is achieved when running the CUDA cuBLAS version, by a factor of 3-7x faster. However, let’s keep in mind that the cuBLAS implementation is optimized, while our initial tensor version still has plenty of room for improvements. Performance on Apple M4 Max GPU with OpenCL 1.2 and Explicit Tiles performance-on-apple-m4-max-gpu-with-opencl-1.2-and-explicit-tiles The following performance plot shows the speedups of each version compared to the Java Parallel Streams evaluated on the CPU the higher, the better when running on the Apple M4 MAX GPU. The x-axis shows the input data size matrix sizes , and the y-axis shows the speedup. For this version, we use a default value of a local group size Local.of from the HAT API of 128x4 threads, and a tile size of 16 elements. We initially choose these values as they were the default ones when running with the CUDA backend. Using explicit tile processing does not automatically improve performance. In this experiment, the tensor version improves performance for the 1k and 2k matrix sizes, but it is slower than the naïve matrix multiplication for the smallest and largest sizes 512x512 and 4kx4k . But, why is it slower? One important performance factor to keep in mind when running applications on GPUs is the selection of the group size and the tile size. This can influence performance, even if the application is suitable for acceleration. Thus, what we might need to do on many occasions is to tune the group size and tile size. The following code snippet shows the parameters tuned for this application. Note that other combinations may also improve performance, and if the reader tries on another GPU, quite likely, these parameters need to be tuned again. --- a/hat/examples/tensors/src/main/java/tensors/Main.java +++ b/hat/examples/tensors/src/main/java/tensors/Main.java @@ -67,7 +67,7 @@ public class Main { - final int shapeSize = 16; + final int shapeSize = 4; - Local2D.of 128, 4 , - NDRange.Tile2D.of 16, 16 , + Local2D.of 128, 1 , + NDRange.Tile2D.of 4, 4 , By selecting a tile size of 4x4 elements and a group size of 128x1 threads, we obtain the following performance graph: With these parameters, the automatic loop-tiled implementation improves by up to 9x compared to the previous configuration using the Tensor API. It becomes the fastest version across all evaluated input sizes and reaches more than 1000x speedup over the CPU parallel-stream implementation. Furthermore, it improves the naïve matrix-multiplication without tensors by 8x. Since we reran the entire experiment across all versions and input data sizes, some fluctuations are visible even in the baseline implementations. It is also worth noting that the experiments were run on an Apple M4 Max MacBook laptop without full isolation, so interference from other applications may have affected the measurements. The main takeaway is that the HAT compiler and runtime can compile and launch an equivalent application by mapping tensors to explicit tiles, and with some parameter tuning, we can obtain performance even for non-supported tensor devices. This is due to the applicability of thread-coarsening for this compute-kernel and the increasing work per thread compared to the naïve matrix multiplication. Conclusions conclusions In this article, we have explained how to enable Tensor Core computation for NVIDIA GPUs from Java using the Project Babylon and HAT. Furthermore, we have explained how to make the proposed representation portable across different backends, by explicitly mapping tensors to loop tiles in OpenCL, allowing tile execution on accelerators that lack dedicated MMA units. We evaluated our approach for two supported backends in HAT CUDA and OpenCL and two computing systems, one for NVIDIA and another one for Apple Silicon with OpenCL. For the NVIDIA/CUDA platform we used an A10 GPU, in which we showed that the performance of HAT matches with the equivalent CUDA C++ version, achieving up to 7.3 TFLOP/s. The second platform used was an Apple M4 MAX GPU laptop, in which we show that, with a bit of tuning, performance of the whole application improves by 8x compared to the naïve matrix multiplication without the tensor API. The main conclusion is that with the Project Babylon and code reflection, it is possible to model portable Java applications to target hardware accelerators while still achieving high performance. While this article has been focused on the acceleration of the matrix multiplication application via the tensor cores, prior works demonstrate that tensor cores programming can be also applied to a wider class of computations, including reductions https://ieeexplore.ieee.org/document/9147055 and Graph Convolutional Networks GCNs https://dl.acm.org/doi/10.1145/3629526.3653835 . Appendix appendix The Tensor API implementation for HAT is fully available on GitHub: How to run the Java Benchmark how-to-run-the-java-benchmark For OpenCL: java @hat/run ffi-opencl tensors --iterations=100 --verbose --size=1024 --check For CUDA: java @hat/run ffi-cuda tensors --iterations=100 --verbose --size=1024 --check Show Performance Numbers show-performance-numbers The benchmark stores a CSV table in the local directory: cat table-tensors-1024.csv Show the Generated Code show-the-generated-code For OpenCL: HAT=SHOW CODE java @hat/run ffi-opencl tensors --iterations=100 --verbose --size=1024 --check For CUDA: HAT=SHOW CODE java @hat/run ffi-cuda tensors --iterations=100 --verbose --size=1024 --check