Execution
CUDAnative.@cuda
— Macro@cuda [kwargs...] func(args...)
High-level interface for executing code on a GPU. The @cuda
macro should prefix a call, with func
a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert
. Finally, a call to CUDAdrv.cudacall
is performed, scheduling a kernel launch on the current CUDA context.
Several keyword arguments are supported that influence the behavior of @cuda
.
dynamic
: use dynamic parallelism to launch device-side kernels- arguments that influence kernel compilation: see
cufunction
anddynamic_cufunction
- arguments that influence kernel launch: see
CUDAnative.HostKernel
andCUDAnative.DeviceKernel
The underlying operations (argument conversion, kernel compilation, kernel call) can be performed explicitly when more control is needed, e.g. to reflect on the resource usage of a kernel to determine the launch configuration. A host-side kernel launch is done as follows:
args = ...
GC.@preserve args begin
kernel_args = cudaconvert.(args)
kernel_tt = Tuple{Core.Typeof.(kernel_args)...}
kernel = cufunction(f, kernel_tt; compilation_kwargs)
kernel(kernel_args...; launch_kwargs)
end
A device-side launch, aka. dynamic parallelism, is similar but more restricted:
args = ...
# GC.@preserve is not supported
# we're on the device already, so no need to cudaconvert
kernel_tt = Tuple{Core.Typeof(args[1]), ...} # this needs to be fully inferred!
kernel = dynamic_cufunction(f, kernel_tt) # no compiler kwargs supported
kernel(args...; launch_kwargs)
CUDAnative.HostKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use CUDAnative.@cuda
.
The following keyword arguments are supported:
threads
(defaults to 1)blocks
(defaults to 1)shmem
(defaults to 0)config
: callback function to dynamically compute the launch configuration. should accept aHostKernel
and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.stream
(defaults to the default stream)
CUDAnative.cufunction
— Functioncufunction(f, tt=Tuple{}; kwargs...)
Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda
.
The following keyword arguments are supported:
minthreads
: the required number of threads in a thread blockmaxthreads
: the maximum number of threads in a thread blockblocks_per_sm
: a minimum number of thread blocks to be scheduled on a single multiprocessormaxregs
: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)name
: override the name that the kernel will have in the generated code
The output of this function is automatically cached, i.e. you can simply call cufunction
in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.
Device-side execution
CUDAnative.DeviceKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use CUDAnative.@cuda
.
The following keyword arguments are supported:
threads
(defaults to 1)blocks
(defaults to 1)shmem
(defaults to 0)config
: callback function to dynamically compute the launch configuration. should accept aHostKernel
and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.stream
(defaults to the default stream)
CUDAnative.dynamic_cufunction
— Functiondynamic_cufunction(f, tt=Tuple{})
Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDAnative.cufunction
.
No keyword arguments are supported.
Utilities
CUDAnative.cudaconvert
— Functioncudaconvert(x)
This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x
as-is.
Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDAnative.Adaptor
type.
CUDAnative.nextwarp
— Functionnextwarp(dev, threads)
prevwarp(dev, threads)
Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev
. This is a common requirement when using intra-warp communication.