Kernel programming

This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.

Indexing and Dimensions

CUDA.gridDim — Function

gridDim()::CuDim3

Returns the dimensions of the grid.

CUDA.blockIdx — Function

blockIdx()::CuDim3

Returns the block index within the grid.

CUDA.blockDim — Function

blockDim()::CuDim3

Returns the dimensions of the block.

CUDA.threadIdx — Function

threadIdx()::CuDim3

Returns the thread index within the block.

CUDA.warpsize — Function

warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

warpsize()::UInt32

Returns the warp size (in threads).

Memory Types

Shared Memory

CUDA.@cuStaticSharedMem — Macro

@cuStaticSharedMem(T::Type, dims) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.

CUDA.@cuDynamicSharedMem — Macro

@cuDynamicSharedMem(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.

Note that the amount of dynamic shared memory needs to specified when launching the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

Synchronization

CUDA.sync_threads — Function

sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

CUDA.sync_warp — Function

sync_warp(mask::Integer=0xffffffff)

Waits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.

Note

Requires CUDA >= 9.0 and sm_6.2

CUDA.threadfence_block — Function

threadfence_block()

A memory fence that ensures that:

All writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()
All reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().

CUDA.threadfence — Function

threadfence()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.

CUDA.threadfence_system — Function

threadfence_system()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().

Clock & Sleep

CUDA.clock — Function

clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

CUDA.nanosleep — Function

nanosleep(t)

Puts a thread for a given amount t(in nanoseconds).

Note

Requires CUDA >= 10.0 and sm_6.2

Warp Vote

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.

CUDA.vote_all — Function

vote_all(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.

CUDA.vote_any — Function

vote_any(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.

CUDA.vote_ballot — Function

vote_ballot(predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

Warp Shuffle

CUDA.shfl_sync — Function

shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.

CUDA.shfl_up_sync — Function

shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.

CUDA.shfl_down_sync — Function

shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.

CUDA.shfl_xor_sync — Function

shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.

Formatted Output

CUDA.@cuprint — Macro

@cuprint(xs...)
@cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, chars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
    @cuprint "Hello, World $(42)\n"

CUDA.@cuprintf — Macro

@cuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

CUDA.@cushow — Macro

@cushow(ex)

GPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.

@cushow threadIdx().x

Assertions

CUDA.@cuassert — Macro

@assert cond [text]

Signal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

Atomics

A high-level macro is available to annotate expressions with:

CUDA.@atomic — Macro

@atomic a[I] = op(a[I], val)
@atomic a[I] ...= val

Atomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.

Warn

This interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API.

If your expression is not recognized, or you need more control, use the underlying functions:

CUDA.atomic_cas! — Function

atomic_cas!(ptr::DevicePtr{T}, cmp::T, val::T)

Reads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_xchg! — Function

atomic_xchg!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_add! — Function

atomic_add!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.

CUDA.atomic_sub! — Function

atomic_sub!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_mul! — Function

atomic_mul!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes *(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.

CUDA.atomic_div! — Function

atomic_div!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes /(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.

CUDA.atomic_and! — Function

atomic_and!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_or! — Function

atomic_or!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_xor! — Function

atomic_xor!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_min! — Function

atomic_min!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_max! — Function

atomic_max!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_inc! — Function

atomic_inc!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

CUDA.atomic_dec! — Function

atomic_dec!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

Dynamic parallelism

Similarly to launching kernels from the host, you can use @cuda while passing dynamic=true for launching kernels from the device. A lower-level API is available as well:

CUDA.dynamic_cufunction — Function

dynamic_cufunction(f, tt=Tuple{})

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction.

No keyword arguments are supported.

CUDA.DeviceKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

threads (defaults to 1)
blocks (defaults to 1)
shmem (defaults to 0)
config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
stream (defaults to the default stream)

CUDA runtime

Certain parts of the CUDA API are available for use on the GPU, for example to launch dynamic kernels or set-up cooperative groups. Coverage of this part of the API, provided by the libcudadevrt library, is under development and contributions are welcome.

Calls to these functions are often ambiguous with their host-side equivalents. To avoid confusion, you need to prefix device-side API interactions with the CUDA module, e.g., CUDA.synchronize.

CUDA.synchronize — Function

synchronize()

Block for the current context's tasks to complete.

synchronize(s::CuStream)

Wait until a stream's tasks are completed.

synchronize(e::CuEvent)

Waits for an event to complete.

Math

Many mathematical functions are provided by the libdevice library, and are wrapped by jl. These functions implement interfaces that are similar to existing functions in Base, albeit often with support for fewer types.

To avoid confusion with existing implementations in Base, you need to prefix calls to this library with the CUDA module. For example, in kernel code, call CUDA.sin instead of plain sin.

For a list of available functions, look at src/device/cuda/libdevice.jl.

Device array

CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the CuArray, and implements (part of) the array interface as well as other functionality for use on the GPU:

CUDA.CuDeviceArray — Type

CuDeviceArray(dims, ptr)
CuDeviceArray{T}(dims, ptr)
CuDeviceArray{T,A}(dims, ptr)
CuDeviceArray{T,A,N}(dims, ptr)

Construct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

CUDA.Const — Type

Const(A::CuDeviceArray)

Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.

This API can only be used on devices with compute capability 3.5 or higher.

Warning

Experimental API. Subject to change without deprecation.

WMMA

Warp matrix multiply-accumulate (WMMA) is a CUDA API to access Tensor Cores, a new hardware feature in Volta GPUs to perform mixed precision matrix multiply-accumulate operations. The interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.

Note

Requires Julia v"1.4.0-DEV.666" or later, or you run into LLVM errors.

Note

For optimal performance, you should use Julia v1.5.0-DEV.324 or later.

Terminology

The WMMA operations perform a matrix multiply-accumulate. More concretely, it calculates $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ is a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

However, not all values of $M$, $N$ and $K$ are allowed. The tuple $(M, N, K)$ is often called the "shape" of the multiply accumulate operation.

The multiply-accumulate consists of the following steps:

Load the matrices $A$, $B$ and $C$ from memory to registers using a WMMA load operation.
Perform the matrix multiply-accumulate of $A$, $B$ and $C$ to obtain $D$ using a WMMA MMA operation. $D$ is stored in hardware registers after this step.
Store the result $D$ back to memory using a WMMA store operation.

Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep. Failure to do so will result in undefined behaviour.

Each thread in a warp will hold a part of the matrix in its registers. In WMMA parlance, this part is referred to as a "fragment". Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.

Finally, it is important to note that the resultant $D$ matrix can be used as a $C$ matrix for a subsequent multiply-accumulate. This is useful if one needs to calculate a sum of the form $\sum_{i=0}^{n} A_i B_i$, where $A_i$ and $B_i$ are matrices of the correct dimension.

LLVM Intrinsics

The LLVM intrinsics are accessible by using the one-to-one Julia wrappers. The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic. For example, LLVM's [8 x <2 x half>] becomes NTuple{8, NTuple{2, VecElement{Float16}}} in Julia. In essence, these wrappers return the SSA values returned by the LLVM intrinsic. Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.

These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend. For more information about the PTX instructions, please refer to the PTX Instruction Set Architecture Manual.

The LLVM intrinsics are subdivided in three categories: load, store and multiply-accumulate. In what follows, each of these will be discussed.

Load matrix

Missing docstring.

Missing docstring for WMMA.llvm_wmma_load. Check Documenter's build log for details.

Perform multiply-accumulate

Missing docstring.

Missing docstring for WMMA.llvm_wmma_mma. Check Documenter's build log for details.

Store matrix

Missing docstring.

Missing docstring for WMMA.llvm_wmma_store. Check Documenter's build log for details.

Example

using Test

using CUDA

# Generate input matrices
a     = rand(Float16, (16, 16))
a_dev = CuArray(a)
b     = rand(Float16, (16, 16))
b_dev = CuArray(b)
c     = rand(Float32, (16, 16))
c_dev = CuArray(c)

# Allocate space for result
d_dev = similar(c_dev)

# Matrix multiply-accumulate kernel (D = A * B + C)
function kernel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16(pointer(a_dev), 16)
    b_frag = WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16(pointer(b_dev), 16)
    c_frag = WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32(pointer(c_dev), 16)

    d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32(a_frag, b_frag, c_frag)

    WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32(pointer(d_dev), d_frag, 16)
    return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
@test all(isapprox.(a * b + c, Array(d_dev); rtol=0.01))

CUDA C-like API

The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA. For example, it ensures that the $A$ fragment argument to the MMA instruction was obtained by a load_a call, and not by a load_b or load_c. Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.

The CUDA C-like API heavily uses Julia's dispatch mechanism. As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.

Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration. All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument. As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.

In contrast, the API in Julia separates the WMMA storage (WMMA.Fragment) and configuration (WMMA.Config). Instead of taking the resultant fragment by reference, the Julia functions just return it. This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction. Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA. This is accomplished by a separate "WMMA configuration" (see WMMA.Config) that you create once, and then give as an argument to all intrinsics.

Fragment

Missing docstring.

Missing docstring for WMMA.RowMajor. Check Documenter's build log for details.

Missing docstring.

Missing docstring for WMMA.ColMajor. Check Documenter's build log for details.

Missing docstring.

Missing docstring for WMMA.Unspecified. Check Documenter's build log for details.

Missing docstring.

Missing docstring for WMMA.FragmentLayout. Check Documenter's build log for details.

Missing docstring.

Missing docstring for WMMA.Fragment. Check Documenter's build log for details.

WMMA configuration

Missing docstring.

Missing docstring for WMMA.Config. Check Documenter's build log for details.

Load matrix

Missing docstring.

Missing docstring for WMMA.load_a. Check Documenter's build log for details.

WMMA.load_b and WMMA.load_c have the same signature.

Perform multiply-accumulate

Missing docstring.

Missing docstring for WMMA.mma. Check Documenter's build log for details.

Store matrix

Missing docstring.

Missing docstring for WMMA.store_d. Check Documenter's build log for details.

Fill fragment

Missing docstring.

Missing docstring for WMMA.fill_c. Check Documenter's build log for details.

Element access and broadcasting

Similar to the CUDA C++ WMMA API, WMMA.Fragments have an x member that can be used to access individual elements. Note that, in contrast to the values returned by the LLVM intrinsics, the x member is flattened. For example, while the Float16 variants of the load_a instrinsics return NTuple{8, NTuple{2, VecElement{Float16}}}, the x member has type NTuple{16, Float16}.

Typically, you will only need to access the x member to perform elementwise operations. This can be more succinctly expressed using Julia's broadcast mechanism. For example, to double each element in a fragment, you can simply use:

frag = 2.0f0 .* frag

Example

using Test

using CUDA

a     = rand(Float16, (16, 16))
b     = rand(Float16, (16, 16))
c     = rand(Float32, (16, 16))

a_dev = CuArray(a)
b_dev = CuArray(b)
c_dev = CuArray(c)
d_dev = similar(c_dev)

function kernel(a_dev, b_dev, c_dev, d_dev)
    conf = WMMA.Config{16, 16, 16, Float32}

    a_frag = WMMA.load_a(pointer(a_dev), 16, WMMA.ColMajor, conf)
    b_frag = WMMA.load_b(pointer(b_dev), 16, WMMA.ColMajor, conf)
    c_frag = WMMA.load_c(pointer(c_dev), 16, WMMA.ColMajor, conf)

    c_frag = 0.5f0 .* c_frag

    d_frag = WMMA.mma(a_frag, b_frag, c_frag, conf)

    WMMA.store_d(pointer(d_dev), d_frag, 16, WMMA.ColMajor, conf)

    return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
d = Array(d_dev)

@test all(isapprox.(a * b + 0.5 * c, d; rtol=0.01))