Execution

CUDAnative.@cudaMacro
@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to CUDAdrv.cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

The underlying operations (argument conversion, kernel compilation, kernel call) can be performed explicitly when more control is needed, e.g. to reflect on the resource usage of a kernel to determine the launch configuration. A host-side kernel launch is done as follows:

args = ...
GC.@preserve args begin
    kernel_args = cudaconvert.(args)
    kernel_tt = Tuple{Core.Typeof.(kernel_args)...}
    kernel = cufunction(f, kernel_tt; compilation_kwargs)
    kernel(kernel_args...; launch_kwargs)
end

A device-side launch, aka. dynamic parallelism, is similar but more restricted:

args = ...
# GC.@preserve is not supported
# we're on the device already, so no need to cudaconvert
kernel_tt = Tuple{Core.Typeof(args[1]), ...}    # this needs to be fully inferred!
kernel = dynamic_cufunction(f, kernel_tt)       # no compiler kwargs supported
kernel(args...; launch_kwargs)
CUDAnative.HostKernelType
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use CUDAnative.@cuda.

The following keyword arguments are supported:

  • threads (defaults to 1)
  • blocks (defaults to 1)
  • shmem (defaults to 0)
  • config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
  • stream (defaults to the default stream)
CUDAnative.cufunctionFunction
cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • minthreads: the required number of threads in a thread block
  • maxthreads: the maximum number of threads in a thread block
  • blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
  • maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
  • name: override the name that the kernel will have in the generated code

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

Device-side execution

CUDAnative.DeviceKernelType
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use CUDAnative.@cuda.

The following keyword arguments are supported:

  • threads (defaults to 1)
  • blocks (defaults to 1)
  • shmem (defaults to 0)
  • config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
  • stream (defaults to the default stream)
CUDAnative.dynamic_cufunctionFunction
dynamic_cufunction(f, tt=Tuple{})

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDAnative.cufunction.

No keyword arguments are supported.

Utilities

CUDAnative.cudaconvertFunction
cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDAnative.Adaptor type.

CUDAnative.nextwarpFunction
nextwarp(dev, threads)
prevwarp(dev, threads)

Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev. This is a common requirement when using intra-warp communication.