
KernelAbstractions.jl is a package that allows you to write GPU-like kernels that target different execution backends. It is intended to be a minimal, and performant library that explores ways to best write heterogenous code.


While KernelAbstraction.jl is focused on performance portatbility, it is GPU-biased and therefore the kernel language has several constructs that are necessary for good performance on the GPU, but may hurt performance on the CPU.


Writing your first kernel

Kernel functions have to be marked with the @kernel. Inside the @kernel macro you can use the kernel language. As an example the mul2 kernel below will multiply each element of the array A by 2. It uses the @index macro to obtain the global linear index of the current workitem.

@kernel function mul2(A)
  I = @index(Global)
  A[I] = 2 * A[I]

Launching your first kernel

You can construct a kernel for a specific backend by calling the kernel function with the first argument being the device kind, the second argument being the size of the workgroup and the third argument being a static ndrange. The second and third argument are optional. After instantiating the kernel you can launch it by calling the kernel object with the right arguments and some keyword arguments that configure the specific launch. The example below creates a kernel with a static workgroup size of 16 and a dynamic ndrange. Since the ndrange is dynamic it has to be provided for the launch as a keyword argument.

A = ones(1024, 1024)
kernel = mul2(CPU(), 16)
event = kernel(A, ndrange=size(A))
all(A .== 2.0)

All kernel launches are asynchronous, each kernel produces an event token that has to be waited upon, before reading or writing memory that was passed as an argument to the kernel. See dependencies for a full explanation.

Important differences to Julia

  1. Functions inside kernels are forcefully inlined, except when marked with @noinline.
  2. Floating-point multiplication, addition, subtraction are marked contractable.

Important differences to CUDAnative

  1. The kernels are automatically bounds-checked against either the dynamic or statically provided ndrange.
  2. Functions like Base.sin are mapped to CUDAnative.sin.

Important differences to GPUifyLoops

  1. @scratch has been renamed to @private, and the semantics have changed. Instead of denoting how many dimensions are implicit on the GPU, you only ever provide the explicit number of dimensions that you require. The implicit CPU dimensions are appended.

How to debug kernels


How to profile kernels