Compiler

Execution

The main entry-point to the compiler is the @cuda macro:

CUDA.@cuda — Macro

@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

dynamic: use dynamic parallelism to launch device-side kernels
arguments that influence kernel compilation: see cufunction and dynamic_cufunction
arguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel

The underlying operations (argument conversion, kernel compilation, kernel call) can be performed explicitly when more control is needed, e.g. to reflect on the resource usage of a kernel to determine the launch configuration. A host-side kernel launch is done as follows:

args = ...
GC.@preserve args begin
    kernel_args = cudaconvert.(args)
    kernel_tt = Tuple{Core.Typeof.(kernel_args)...}
    kernel = cufunction(f, kernel_tt; compilation_kwargs)
    kernel(kernel_args...; launch_kwargs)
end

A device-side launch, aka. dynamic parallelism, is similar but more restricted:

args = ...
# GC.@preserve is not supported
# we're on the device already, so no need to cudaconvert
kernel_tt = Tuple{Core.Typeof(args[1]), ...}    # this needs to be fully inferred!
kernel = dynamic_cufunction(f, kernel_tt)       # no compiler kwargs supported
kernel(args...; launch_kwargs)

If needed, you can use a lower-level API that lets you inspect the compiler kernel:

CUDA.cudaconvert — Function

cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.Adaptor type.

CUDA.cufunction — Function

cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

minthreads: the required number of threads in a thread block
maxthreads: the maximum number of threads in a thread block
blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
name: override the name that the kernel will have in the generated code

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

CUDA.HostKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

threads (defaults to 1)
blocks (defaults to 1)
shmem (defaults to 0)
config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
stream (defaults to the default stream)

CUDA.version — Function

version()

Returns the CUDA version as reported by the driver.

version(k::HostKernel)

Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.

CUDA.maxthreads — Function

maxthreads(k::HostKernel)

Queries the maximum amount of threads a kernel can use in a single block.

CUDA.registers — Function

registers(k::HostKernel)

Queries the register usage of a kernel.

CUDA.memory — Function

memory(k::HostKernel)

Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.

Reflection

If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:

@device_code_lowered
@device_code_typed
@device_code_warntype
@device_code_llvm
@device_code_ptx
@device_code_sass
@device_code

These macros are also available in function-form:

CUDA.code_typed
CUDA.code_warntype
CUDA.code_llvm
CUDA.code_ptx
CUDA.code_sass

For more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:

CUDA.@device_code_sass — Macro

@device_code_sass [io::IO=stdout, ...] ex

Evaluates the expression ex and prints the result of CUDA.code_sass to io for every compiled CUDA kernel. For other supported keywords, see CUDA.code_sass.

CUDA.code_sass — Function

code_sass([io], f, types, cap::VersionNumber)

Prints the SASS code generated for the method matching the given generic function and type signature to io which defaults to stdout.

The following keyword arguments are supported:

cap which device to generate code for
kernel: treat the function as an entry-point kernel
verbose: enable verbose mode, which displays code generation statistics