KernelAbstractions
KernelAbstractions.jl
(KA) is a package that allows you to write GPU-like kernels targetting different execution backends. KA intends to be a minimal and performant library that explores ways to write heterogeneous code. Although parts of the package are still experimental, it has been used successfully as part of the Exascale Computing Project to run Julia code on pre-Frontier and pre-Aurora systems. Currently, profiling and debugging require backend-specific calls like, for example, in CUDA.jl
.
While KernelAbstraction.jl is focused on performance portability, it emulates GPU semantics and therefore the kernel language has several constructs that are necessary for good performance on the GPU, but serve no purpose on the CPU. In these cases, we either ignore such statements entirely (such as with @synchronize
) or swap out the construct for something similar on the CPU (such as using an MVector
to replace @localmem
). This means that CPU performance will still be fast, but might be performing extra work to provide a consistent programming model across GPU and CPU
Supported backends
All supported backends rely on their respective Julia interface to the compiler backend and depend on GPUArrays.jl
and GPUCompiler.jl
. KA provides interface kernel generation modules to those packages in /lib
. After loading the kernel packages, KA will provide a KA.Device
for that backend to be used in the kernel generation stage.
CUDA
using CUDA
using KernelAbstractions
using CUDAKernels
CUDA.jl
is currently the most mature way to program for GPUs. This provides a device CUDADevice <: KA.Device
to
AMDGPU
using AMDGPU
using KernelAbstractions
using ROCKernels
Experimental AMDGPU (ROCm) support is available via the AMDGPU.jl
and ROCKernels.jl
. It provides the device ROCDevice <: KA.Device
. Please get in touch with @jpsamaroo
for any issues specific to the ROCKernels backend.
oneAPI
Experimental support for Intel GPUs has also been added through the oneAPI Intel Compute Runtime interfaced to in oneAPI.jl
Semantic differences
To Julia
- Functions inside kernels are forcefully inlined, except when marked with
@noinline
. - Floating-point multiplication, addition, subtraction are marked contractable.
To CUDA.jl/AMDGPU.jl
- The kernels are automatically bounds-checked against either the dynamic or statically provided
ndrange
. - Functions like
Base.sin
are mapped toCUDA.sin
/AMDGPU.sin
.
To GPUifyLoops.jl
@scratch
has been renamed to@private
, and the semantics have changed. Instead of denoting how many dimensions are implicit on the GPU, you only ever provide the explicit number of dimensions that you require. The implicit CPU dimensions are appended.
Contributing
Please file any bug reports through Github issues or fixes through a pull request. Any heterogeneous hardware or code aficionados is welcome to join us on our journey.