CUDA

This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.

Indexing and Dimensions

Memory Types

Shared Memory

CUDAnative.@cuStaticSharedMemMacro
@cuStaticSharedMem(T::Type, dims) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.

CUDAnative.@cuDynamicSharedMemMacro
@cuDynamicSharedMem(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.

Note that the amount of dynamic shared memory needs to specified when launching the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

Synchronization

CUDAnative.sync_threadsFunction
sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

CUDAnative.sync_warpFunction
sync_warp(mask::Integer=0xffffffff)

Waits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.

Note

Requires CUDA >= 9.0 and sm_6.2

CUDAnative.threadfence_blockFunction
threadfence_block()

A memory fence that ensures that:

  • All writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()
  • All reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().
CUDAnative.threadfenceFunction
threadfence()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.

CUDAnative.threadfence_systemFunction
threadfence_system()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().

Clock & Sleep

CUDAnative.clockFunction
clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

CUDAnative.nanosleepFunction
nanosleep(t)

Puts a thread for a given amount t(in nanoseconds).

Note

Requires CUDA >= 10.0 and sm_6.2

Warp Vote

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.

CUDAnative.vote_allFunction
vote_all(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.

CUDAnative.vote_anyFunction
vote_any(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.

CUDAnative.vote_ballotFunction
vote_ballot(predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

Warp Shuffle

Missing docstring.

Missing docstring for CUDAnative.shfl. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.shfl_up. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.shfl_down. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.shfl_xor. Check Documenter's build log for details.

If using CUDA 9.0, and PTX ISA 6.0 is supported, synchronizing versions of these intrinsics are available as well:

CUDAnative.shfl_syncFunction
shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.

CUDAnative.shfl_up_syncFunction
shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.

CUDAnative.shfl_down_syncFunction
shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.

CUDAnative.shfl_xor_syncFunction
shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.

Formatted Output

CUDAnative.@cuprintfMacro
@cuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

Assertions

CUDAnative.@cuassertMacro
@assert cond [text]

Signal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

CUDA runtime

Certain parts of the CUDA API are available for use on the GPU, for example to launch dynamic kernels or set-up cooperative groups. Coverage of this part of the API, provided by the libcudadevrt library, is under development and contributions are welcome.

Calls to these functions are often ambiguous with their host-side equivalents as implemented in CUDAdrv. To avoid confusion, you need to prefix device-side API interactions with the CUDAnative module, e.g., CUDAnative.synchronize.

CUDAnative.synchronizeFunction
synchronize()

Wait for the device to finish. This is the device side version, and should not be called from the host.

synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.

Math

Many mathematical functions are provided by the libdevice library, and are wrapped by CUDAnative.jl. These functions implement interfaces that are similar to existing functions in Base, albeit often with support for fewer types.

To avoid confusion with existing implementations in Base, you need to prefix calls to this library with the CUDAnative module. For example, in kernel code, call CUDAnative.sin instead of plain sin.

For a list of available functions, look at src/device/cuda/libdevice.jl.