CUDA
This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.
Indexing and Dimensions
CUDAnative.gridDim
— FunctiongridDim()::CuDim3
Returns the dimensions of the grid.
CUDAnative.blockIdx
— FunctionblockIdx()::CuDim3
Returns the block index within the grid.
CUDAnative.blockDim
— FunctionblockDim()::CuDim3
Returns the dimensions of the block.
CUDAnative.threadIdx
— FunctionthreadIdx()::CuDim3
Returns the thread index within the block.
CUDAnative.warpsize
— Functionwarpsize()::UInt32
Returns the warp size (in threads).
Memory Types
Shared Memory
CUDAnative.@cuStaticSharedMem
— Macro@cuStaticSharedMem(T::Type, dims) -> CuDeviceArray{T,AS.Shared}
Get an array of type T
and dimensions dims
(either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.
CUDAnative.@cuDynamicSharedMem
— Macro@cuDynamicSharedMem(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,AS.Shared}
Get an array of type T
and dimensions dims
(either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.
Note that the amount of dynamic shared memory needs to specified when launching the kernel.
Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view
.
Synchronization
CUDAnative.sync_threads
— Functionsync_threads()
Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads()
are visible to all threads in the block.
CUDAnative.sync_warp
— Functionsync_warp(mask::Integer=0xffffffff)
Waits threads in the warp, selected by means of the bitmask mask
, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp()
are visible to those threads in the warp. The default value for mask
selects all threads in the warp.
Requires CUDA >= 9.0 and sm_6.2
CUDAnative.threadfence_block
— Functionthreadfence_block()
A memory fence that ensures that:
- All writes to all memory made by the calling thread before the call to
threadfence_block()
are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call tothreadfence_block()
- All reads from all memory made by the calling thread before the call to
threadfence_block()
are ordered before all reads from all memory made by the calling thread after the call tothreadfence_block()
.
CUDAnative.threadfence
— Functionthreadfence()
A memory fence that acts as threadfence_block
for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence()
are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence()
.
Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.
CUDAnative.threadfence_system
— Functionthreadfence_system()
A memory fence that acts as threadfence_block
for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system()
are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system()
.
Clock & Sleep
CUDAnative.clock
— Functionclock(UInt32)
Returns the value of a per-multiprocessor counter that is incremented every clock cycle.
clock(UInt32)
Returns the value of a per-multiprocessor counter that is incremented every clock cycle.
CUDAnative.nanosleep
— Functionnanosleep(t)
Puts a thread for a given amount t
(in nanoseconds).
Requires CUDA >= 10.0 and sm_6.2
Warp Vote
The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.
CUDAnative.vote_all
— Functionvote_all(predicate::Bool)
Evaluate predicate
for all active threads of the warp and return non-zero if and only if predicate
evaluates to non-zero for all of them.
CUDAnative.vote_any
— Functionvote_any(predicate::Bool)
Evaluate predicate
for all active threads of the warp and return non-zero if and only if predicate
evaluates to non-zero for any of them.
CUDAnative.vote_ballot
— Functionvote_ballot(predicate::Bool)
Evaluate predicate
for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate
evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.
Warp Shuffle
Missing docstring for CUDAnative.shfl
. Check Documenter's build log for details.
Missing docstring for CUDAnative.shfl_up
. Check Documenter's build log for details.
Missing docstring for CUDAnative.shfl_down
. Check Documenter's build log for details.
Missing docstring for CUDAnative.shfl_xor
. Check Documenter's build log for details.
If using CUDA 9.0, and PTX ISA 6.0 is supported, synchronizing versions of these intrinsics are available as well:
CUDAnative.shfl_sync
— Functionshfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)
Shuffle a value from a directly indexed lane lane
, and synchronize threads according to threadmask
.
CUDAnative.shfl_up_sync
— Functionshfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)
Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask
.
CUDAnative.shfl_down_sync
— Functionshfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)
Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask
.
CUDAnative.shfl_xor_sync
— Functionshfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)
Shuffle a value from a lane based on bitwise XOR of own lane ID with mask
, and synchronize threads according to threadmask
.
Formatted Output
CUDAnative.@cuprintf
— Macro@cuprintf("%Fmt", args...)
Print a formatted string in device context on the host standard output.
Note that this is not a fully C-compliant printf
implementation; see the CUDA documentation for supported options and inputs.
Also beware that it is an untyped, and unforgiving printf
implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld
formatting string.
Assertions
CUDAnative.@cuassert
— Macro@assert cond [text]
Signal assertion failure to the CUDA driver if cond
is false
. Preferred syntax for writing assertions, mimicking Base.@assert
. Message text
is optionally displayed upon assertion failure.
A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.
CUDA runtime
Certain parts of the CUDA API are available for use on the GPU, for example to launch dynamic kernels or set-up cooperative groups. Coverage of this part of the API, provided by the libcudadevrt
library, is under development and contributions are welcome.
Calls to these functions are often ambiguous with their host-side equivalents as implemented in CUDAdrv. To avoid confusion, you need to prefix device-side API interactions with the CUDAnative module, e.g., CUDAnative.synchronize
.
CUDAnative.synchronize
— Functionsynchronize()
Wait for the device to finish. This is the device side version, and should not be called from the host.
synchronize
acts as a synchronization point for child grids in the context of dynamic parallelism.
Math
Many mathematical functions are provided by the libdevice
library, and are wrapped by CUDAnative.jl. These functions implement interfaces that are similar to existing functions in Base
, albeit often with support for fewer types.
To avoid confusion with existing implementations in Base
, you need to prefix calls to this library with the CUDAnative module. For example, in kernel code, call CUDAnative.sin
instead of plain sin
.
For a list of available functions, look at src/device/cuda/libdevice.jl
.