CUDA.AbstractKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use @cuda
.
The following keyword arguments are supported:
threads
(defaults to 1)blocks
(defaults to 1)shmem
(defaults to 0)config
: callback function to dynamically compute the launch configuration. should accept aHostKernel
and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.stream
(defaults to the default stream)
CUDA.Const
— TypeConst(A::CuDeviceArray)
Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.
This API can only be used on devices with compute capability 3.5 or higher.
Experimental API. Subject to change without deprecation.
CUDA.CuContext
— TypeCuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)
Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.
When you are done using the context, call unsafe_destroy!
to mark it for deletion, or use do-block syntax with this constructor.
CUDA.CuContext
— MethodCuContext(pctx::CuPrimaryContext)
Retain the primary context on the GPU, returning a context compatible with the driver API. The primary context will be released when the returned driver context is finalized.
As these contexts are refcounted by CUDA, you should not call CUDA.unsafe_destroy!
on them but use CUDA.unsafe_release!
instead (available with do-block syntax as well).
CUDA.CuDevice
— TypeCuDevice(i::Integer)
Get a handle to a compute device.
CUDA.CuDevice
— MethodCuDevice(::CuContext)
Returns the device for a context.
CUDA.CuDeviceArray
— TypeCuDeviceArray(dims, ptr)
CuDeviceArray{T}(dims, ptr)
CuDeviceArray{T,A}(dims, ptr)
CuDeviceArray{T,A,N}(dims, ptr)
Construct an N
-dimensional dense CUDA device array with element type T
wrapping a pointer, where N
is determined from the length of dims
and T
is determined from the type of ptr
. dims
may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N
is supplied explicitly as in Array{T,N}(dims)
, then it must match the length of dims
. The same applies to the element type T
, which should match the type of the pointer ptr
.
CUDA.CuDeviceTexture
— TypeCuDeviceTexture{T,N,NC}
N
-dimensional device texture with elements of type T
. This type is the device-side counterpart of CuTexture
, and can be used to access textures using regular indexing notation. If NC
is true, indices used by these accesses should be normalized, i.e., fall into the [0,1)
domain.
Device-side texture objects cannot be created directly, but should be created host-side using CuTexture
and passed to the kernal as an argument.
Experimental API. Subject to change without deprecation.
CUDA.CuDim3
— TypeCuDim3(x)
CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))
A type used to specify dimensions, consisting of 3 integers for respectively the x
, y
and z
dimension. Unspecified dimensions default to 1
.
Often accepted as argument through the CuDim
type alias, eg. in the case of cudacall
or launch
, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3
object.
CUDA.CuError
— TypeCuError(code)
CuError(code, meta)
Create a CUDA error object with error code code
. The optional meta
parameter indicates whether extra information, such as error logs, is known.
CUDA.CuEvent
— TypeCuEvent()
Create a new CUDA event.
CUDA.CuFunction
— TypeCuFunction(mod::CuModule, name::String)
Acquires a function handle from a named function in a module.
CUDA.CuGlobal
— TypeCuGlobal{T}(mod::CuModule, name::String)
Acquires a typed global variable handle from a named global in a module.
CUDA.CuIterator
— TypeCuIterator(batches)
Return a CuIterator
that can iterate through the provided batches
via Base.iterate
.
Upon each iteration, the current batch
is adapted to the GPU (via map(x -> adapt(CuArray, x), batch)
) and the previous iteration is marked as freeable from GPU memory (via unsafe_free!
).
This abstraction is useful for batching data into GPU memory in a manner that allows old iterations to potentially be freed (or marked as reusable) earlier than they otherwise would via CuArray's internal polling mechanism.
CUDA.CuLink
— TypeCuLink()
Creates a pending JIT linker invocation.
CUDA.CuLinkImage
— TypeThe result of a linking operation.
This object keeps its parent linker object alive, as destroying a linker destroys linked images too.
CUDA.CuModule
— TypeCuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})
Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.
The options
is an optional dictionary of JIT options and their respective value.
CUDA.CuModule
— MethodCuModule(img::CuLinkImage, ...)
Create a CUDA module from a completed linking operation. Options from CuModule
apply.
CUDA.CuPrimaryContext
— TypeCuPrimaryContext(dev::CuDevice)
Create a primary CUDA context for a given device.
Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.
CUDA.CuPtr
— TypeCuPtr{T}
A memory address that refers to data of type T
that is accessible from the GPU. A CuPtr
is ABI compatible with regular Ptr
objects, e.g. it can be used to ccall
a function that expects a Ptr
to GPU memory, but it prevents erroneous conversions between the two.
CUDA.CuStream
— MethodCuStream(; flags=STREAM_DEFAULT, priority=nothing)
Create a CUDA stream.
CUDA.CuTexture
— TypeCuTexture{T,N,P}
N
-dimensional texture object with elements of type T
. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture
type.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture(x::CuArray{T,N})
Create a N
-dimensional texture object that reads from a CuArray
.
Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture(x::CuTextureArray{T,N})
Create a N
-dimensional texture object withelements of type T
that will be read from x
.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)
Construct a N
-dimensional texture object with elements of type T
as stored in parent
.
Several keyword arguments alter the behavior of texture objects:
address_mode
(wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple ofN
entries.filter_mode
(point, linear): how non-integral indices are fetched. Point mode fetches a single value, linear results in linear interpolation between values.normalized_coordinates
(true, false): whether indices are expected to fall in the normalized[0:1)
range.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— TypeCuTextureArray{T,N}(undef, dims)
N
-dimensional dense texture array with elements of type T
. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture
objects.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray(A::AbstractArray)
Allocate and initialize a texture buffer from host memory in A
.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray(A::CuArray)
Allocate and initialize a texture buffer from device memory in A
.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray{T,N}(undef, dims)
Construct an uninitialized texture array of N
dimensions specified in the dims
tuple, with elements of type T
. Use Base.copyto!
to initialize this texture array, or use constructors that take a non-texture array to do so automatically.
Experimental API. Subject to change without deprecation.
CUDA.DeviceKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use @cuda
.
The following keyword arguments are supported:
threads
(defaults to 1)blocks
(defaults to 1)shmem
(defaults to 0)config
: callback function to dynamically compute the launch configuration. should accept aHostKernel
and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.stream
(defaults to the default stream)
CUDA.DevicePtr
— TypeDevicePtr{T,A}
A memory address that refers to data of type T
that is accessible from the GPU. It is the on-device counterpart of CuPtr
, additionally keeping track of the address space A
where the data resides (shared, global, constant, etc). This information is used to provide optimized implementations of operations such as unsafe_load
and unsafe_store!.
CUDA.HostKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use @cuda
.
The following keyword arguments are supported:
threads
(defaults to 1)blocks
(defaults to 1)shmem
(defaults to 0)config
: callback function to dynamically compute the launch configuration. should accept aHostKernel
and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.stream
(defaults to the default stream)
CUDA.OutOfGPUMemoryError
— TypeOutOfGPUMemoryError()
An operation allocated too much GPU memory for either the system or the memory pool to handle properly.
CUDA.PtrOrCuPtr
— TypePtrOrCuPtr{T}
A special pointer type, ABI-compatible with both Ptr
and CuPtr
, for use in ccall
expressions to convert values to either a GPU or a CPU type (in that order). This is required for CUDA APIs which accept pointers that either point to host or device memory.
Base.eltype
— Methodeltype(var::CuGlobal)
Return the element type of a global variable object.
Base.getindex
— MethodBase.getindex(var::CuGlobal)
Return the current value of a global variable.
Base.pop!
— Methodpop!(CuContext)
Pops the current CUDA context from the current CPU thread, and returns that context.
Base.push!
— Methodpush!(CuContext, ctx::CuContext)
Pushes a context on the current CPU thread.
Base.resize!
— Methodresize!(a::CuVector, n::Int)
Resize a
to contain n
elements. If n
is smaller than the current collection length, the first n
elements will be retained. If n
is larger, the new elements are not guaranteed to be initialized.
Several restrictions apply to which types of CuArray
s can be resized:
- the array should be backed by the memory pool, and not have been constructed with
unsafe_wrap
- the array cannot be derived (view, reshape) from another array
- the array cannot have any derived arrays itself
Base.setindex!
— MethodBase.setindex(var::CuGlobal{T}, val::T)
Set the value of a global variable to val
Base.unsafe_wrap
— Methodunsafe_wrap(::CuArray, ptr::CuPtr{T}, dims; own=false, ctx=context())
Wrap a CuArray
object around the data at the address given by ptr
. The pointer element type T
determines the array element type. dims
is either an integer (for a 1d array) or a tuple of the array dimensions. own
optionally specified whether Julia should take ownership of the memory, calling cudaFree
when the array is no longer referenced. The ctx
argument determines the CUDA context where the data is allocated in.
CUDA.CuCurrentContext
— MethodCuCurrentContext()
Return the current context, or nothing
if there is no active context.
CUDA.CuCurrentDevice
— MethodCuCurrentDevice()
Returns the current device, or nothing
if there is no active device.
CUDA.CuDefaultStream
— MethodCuDefaultStream()
Return the default stream.
CUDA.activate
— Methodactivate(ctx::CuContext)
Binds the specified CUDA context to the calling CPU thread.
CUDA.active_blocks
— Methodactive_blocks(fun::CuFunction, threads; shmem=0)
Calculate the maximum number of active blocks per multiprocessor when running threads
threads of a kernel fun
requiring shmem
bytes of dynamic shared memory.
CUDA.add_data!
— Methodadd_data!(link::CuLink, name::String, data::Vector{UInt8}, type::CUjitInputType)
Add object code to a pending link operation.
CUDA.add_data!
— Methodadd_data!(link::CuLink, name::String, code::String)
Add PTX code to a pending link operation.
CUDA.add_file!
— Methodadd_file!(link::CuLink, path::String, typ::CUjitInputType)
Add data from a file to a link operation. The argument typ
indicates the type of the contained data.
CUDA.alloc
— Methodalloc(sz)
Allocate a number of bytes sz
from the memory pool. Returns a CuPtr{Nothing}
; may throw a OutOfGPUMemoryError
if the allocation request cannot be satisfied.
CUDA.alloc_timings
— Methodalloc_timings()
Show the timings of the CUDA allocator. Assumes CUDA.enable_timings()
has been called.
CUDA.atdevicereset
— MethodCUDA.atdevicereset(f::Function)
Register a function to be called after resetting devices. The function is passed one argument: the device which has been reset.
Use this hook to invalidate global state that depends on the current device.
CUDA.atdeviceswitch
— MethodCUDA.atdeviceswitch(f::Function)
Register a function to be called after switching to or initializing a device on a thread.
Use this hook to invalidate thread-local state that depends on the current device. If that state is also context dependent, be sure to query the context in your callback.
CUDA.atomic_add!
— Functionatomic_add!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes old + val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.
CUDA.atomic_and!
— Functionatomic_and!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes old & val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_cas!
— Functionatomic_cas!(ptr::DevicePtr{T}, cmp::T, val::T)
Reads the value old
located at address ptr
and compare with cmp
. If old
equals to cmp
, stores val
at the same address. Otherwise, doesn't change the value old
. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_dec!
— Functionatomic_dec!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes (((old == 0) | (old > val)) ? val : (old-1) )
, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old
.
This operation is only supported for values of type Int32.
CUDA.atomic_div!
— Functionatomic_div!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes /(old, val)
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.
CUDA.atomic_inc!
— Functionatomic_inc!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes ((old >= val) ? 0 : (old+1))
, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old
.
This operation is only supported for values of type Int32.
CUDA.atomic_max!
— Functionatomic_max!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes max(old, val)
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.
CUDA.atomic_min!
— Functionatomic_min!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes min(old, val)
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.
CUDA.atomic_mul!
— Functionatomic_mul!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes *(old, val)
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.
CUDA.atomic_or!
— Functionatomic_or!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes old | val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_sub!
— Functionatomic_sub!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes old - val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.
CUDA.atomic_xchg!
— Functionatomic_xchg!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
and stores val
at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_xor!
— Functionatomic_xor!(ptr::DevicePtr{T}, val::T)
Reads the value old
located at address ptr
, computes old ⊻ val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.attaskswitch
— MethodCUDA.attaskswitch(f::Function)
Register a function to be called after switching to or initializing a task on a thread.
Use this hook to invalidate thread-local state that depends on the current task.
CUDA.attribute!
— Methodattribute!(ptr::Union{Ptr,CuPtr}, attr, val)
Sets attributeattr
on a pointer ptr
to val
.
CUDA.attribute
— Methodattribute(dev::CuDevice, code)
Returns information about the device.
CUDA.attribute
— Methodattribute(X, ptr::Union{Ptr,CuPtr}, attr)
Returns attribute attr
about pointer ptr
. The type of the returned value depends on the attribute, and as such must be passed as the X
parameter.
CUDA.available_memory
— Methodavailable_memory()
Returns the available_memory amount of memory (in bytes), available for allocation by the CUDA context.
CUDA.blockDim
— MethodblockDim()::CuDim3
Returns the dimensions of the block.
CUDA.blockIdx
— MethodblockIdx()::CuDim3
Returns the block index within the grid.
CUDA.capability
— Methodcapability(dev::CuDevice)
Returns the compute capability of the device.
CUDA.clock
— Methodclock(UInt32)
Returns the value of a per-multiprocessor counter that is incremented every clock cycle.
CUDA.clock
— Methodclock(UInt32)
Returns the value of a per-multiprocessor counter that is incremented every clock cycle.
CUDA.code_sass
— Functioncode_sass([io], f, types, cap::VersionNumber)
Prints the SASS code generated for the method matching the given generic function and type signature to io
which defaults to stdout
.
The following keyword arguments are supported:
cap
which device to generate code forkernel
: treat the function as an entry-point kernelverbose
: enable verbose mode, which displays code generation statistics
See also: @device_code_sass
CUDA.complete
— Methodcomplete(link::CuLink)
Complete a pending linker invocation, returning an output image.
CUDA.context!
— Methodcontext!(ctx::CuContext)
Bind the current host thread to the context ctx
.
Note that the contexts used with this call should be previously acquired by calling context
, and not arbitrary contexts created by calling the CuContext
constructor.
CUDA.context!
— Methodcontext!(f, ctx)
Sets the active context for the duration of f
.
CUDA.context
— Methodcontext()::CuContext
Get or create a CUDA context for the current thread (as opposed to CuCurrentContext
which may return nothing
if there is no context bound to the current thread).
CUDA.cudacall
— Functioncudacall(f::CuFunction, types, values...; blocks::CuDim, threads::CuDim,
cooperative=false, shmem=0, stream=CuDefaultStream())
ccall
-like interface for launching a CUDA function f
on a GPU.
For example:
vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))
The blocks
and threads
arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types
argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.
CUDA.cudaconvert
— Methodcudaconvert(x)
This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x
as-is.
Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.Adaptor
type.
CUDA.cufunction
— Methodcufunction(f, tt=Tuple{}; kwargs...)
Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda
.
The following keyword arguments are supported:
minthreads
: the required number of threads in a thread blockmaxthreads
: the maximum number of threads in a thread blockblocks_per_sm
: a minimum number of thread blocks to be scheduled on a single multiprocessormaxregs
: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)name
: override the name that the kernel will have in the generated code
The output of this function is automatically cached, i.e. you can simply call cufunction
in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.
CUDA.description
— Methoddescription(err::CuError)
Gets the string description of an error code.
CUDA.device!
— Functiondevice!(dev::Integer)
device!(dev::CuDevice)
Sets dev
as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice
(slightly faster).
If your library or code needs to perform an action when the active device changes, add a hook using CUDA.atdeviceswitch
.
CUDA.device!
— Methoddevice!(f, dev)
Sets the active device for the duration of f
.
Note that this call is intended for temporarily switching devices, and does not change the default device used to initialize new threads or tasks.
CUDA.device
— Methoddevice()::CuDevice
Get the CUDA device for the current thread, similar to how context()
works compared to CuCurrentContext()
.
CUDA.device_reset!
— Functiondevice_reset!(dev::CuDevice=device())
Reset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.
If your library or code needs to perform an action when the active context changes, add a hook using CUDA.atdevicereset
. Resetting the device will also cause subsequent API calls to fire the CUDA.atdeviceswitch
hook.
CUDA.device_synchronize
— Methoddevice_synchronize()
Wait for the device to finish. This is the device side version, and should not be called from the host.
device_synchronize
acts as a synchronization point for child grids in the context of dynamic parallelism.
CUDA.deviceid
— Functiondeviceid()::Int
deviceid(dev::CuDevice)::Int
Get the ID number of the current device of execution. This is a 0-indexed number, corresponding to the device ID as known to CUDA.
CUDA.devices
— Methoddevices()
Get an iterator for the compute devices.
CUDA.dynamic_cufunction
— Functiondynamic_cufunction(f, tt=Tuple{})
Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction
.
No keyword arguments are supported.
CUDA.elapsed
— Methodelapsed(start::CuEvent, stop::CuEvent)
Computes the elapsed time between two events (in seconds).
CUDA.enable_timings
— Methodenable_timings()
Enable the recording of debug timings.
CUDA.find_binary
— Methodfind_binary(name; locations=String[])
Similar to find_library
, performs an exhaustive search for a binary in various subdirectories of locations
, and finally PATH by using Sys.which
.
CUDA.find_libcudadevrt
— Methodfind_libcudadevrt(toolkit_dirs::Vector{String})
Look for the CUDA device runtime library in any of the CUDA toolkit directories toolkit_dirs
.
CUDA.find_libdevice
— Methodfind_libdevice(toolkit_dirs::Vector{String})
Look for the CUDA device library supporting targets
in any of the CUDA toolkit directories toolkit_dirs
. On CUDA >= 9.0, a single library unified library is discovered and returned as a string. On older toolkits, individual libraries for each of the targets are returned as a vector of strings.
CUDA.find_library
— Functionfind_library(name, version; locations=String[])
Wrapper for Libdl.find_library, performing a more exhaustive search:
- variants of the library name (including version numbers, platform-specific tags, etc);
- various subdirectories of the
locations
list, and finally system library directories.
Returns the full path to the library.
CUDA.find_toolkit
— Methodfind_toolkit()::Vector{String}
Look for directories where (parts of) the CUDA toolkit might be installed. This returns a (possibly empty) list of paths that can be used as an argument to other discovery functions.
The behavior of this function can be overridden by defining the CUDA_PATH
, CUDA_HOME
or CUDA_ROOT
environment variables, which should point to the root of the CUDA toolkit.
CUDA.flags
— Methodflags(pctx::CuPrimaryContext)
Query the flags of a primary context.
CUDA.free
— Methodfree(sz)
Releases a buffer pointed to by ptr
to the memory pool.
CUDA.functional
— Functionfunctional(show_reason=false)
Check if the package has been configured successfully and is ready to use.
This call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.
CUDA.gridDim
— MethodgridDim()::CuDim3
Returns the dimensions of the grid.
CUDA.has_cuda
— Functionhas_cuda()::Bool
Check whether the local system provides an installation of the CUDA driver and toolkit. Use this function if your code loads packages that require CUDA.jl.
Note that CUDA-dependent packages might still fail to load if the installation is broken, so it's recommended to guard against that and print a warning to inform the user:
using CUDA
if has_cuda()
try
using CuArrays
catch ex
@warn "CUDA is installed, but CuArrays.jl fails to load" exception=(ex,catch_backtrace())
end
end
CUDA.has_cuda_gpu
— Functionhas_cuda_gpu()::Bool
Check whether the local system provides an installation of the CUDA driver and toolkit, and if it contains a CUDA-capable GPU. See has_cuda
for more details.
Note that this function initializes the CUDA API in order to check for the number of GPUs.
CUDA.initializer
— Methodinitializer(f::Function)
Register a function to be called before making a CUDA API call that requires an initialized context.
CUDA.isactive
— Methodisactive(pctx::CuPrimaryContext)
Query whether a primary context is active.
CUDA.launch
— Methodlaunch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
cooperative=false, shmem=0, stream=CuDefaultStream())
Low-level call to launch a CUDA function f
on the GPU, using blocks
and threads
as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem
, and the kernel is launched on stream stream
.
Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.
This is a low-level call, prefer to use cudacall
instead.
CUDA.launch_configuration
— Methodlaunch_configuration(fun::CuFunction; shmem=0, max_threads=0)
Calculate a suggested launch configuration for kernel fun
requiring shmem
bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads
.
In the case of a variable amount of shared memory, pass a callable object for shmem
instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.
CUDA.maxthreads
— Methodmaxthreads(k::HostKernel)
Queries the maximum amount of threads a kernel can use in a single block.
CUDA.memory
— Methodmemory(k::HostKernel)
Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.
CUDA.memory_status
— Functionmemory_status([io=stdout])
Report to io
on the memory status of the current GPU and the active memory pool.
CUDA.name
— Methodname(dev::CuDevice)
Returns an identifier string for the device.
CUDA.name
— Methodname(err::CuError)
Gets the string representation of an error code.
This name can often be used as a symbol in source code to get an instance of this error. For example:
julia> err = CuError(1)
CuError(1, ERROR_INVALID_VALUE)
julia> name(err)
"ERROR_INVALID_VALUE"
julia> ERROR_INVALID_VALUE
CuError(1, ERROR_INVALID_VALUE)
CUDA.nanosleep
— Methodnanosleep(t)
Puts a thread for a given amount t
(in nanoseconds).
Requires CUDA >= 10.0 and sm_6.2
CUDA.nextwarp
— Methodnextwarp(dev, threads)
prevwarp(dev, threads)
Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev
. This is a common requirement when using intra-warp communication.
CUDA.occupancy
— Methodoccupancy(fun::CuFunction, threads; shmem=0)
Calculate the theoretical occupancy of launching threads
threads of a kernel fun
requiring shmem
bytes of dynamic shared memory.
CUDA.pool_timings
— Methodpool_timings()
Show the timings of the currently active memory pool. Assumes CUDA.enable_timings()
has been called.
CUDA.prepare_cuda_call
— MethodCUDA.prepare_cuda_call()
Prepare state for calling CUDA API functions.
Many CUDA APIs, like the CUDA driver API used by CUDA.jl, use implicit thread-local state to determine, e.g., which device to use. With Julia however, code is grouped in tasks. Execution can switch between them, and tasks can be executing on (and in the future migrate between) different threads. To synchronize these two worlds, call this function before any CUDA API call to update thread-local state based on the current task and its context.
If you need to maintain your own task-local state, subscribe to device and task switch events using CUDA.atdeviceswitch
and CUDA.attaskswitch
for proper invalidation. If your state is device-specific, but global (i.e. not task-bound), it suffices to index your state with the current deviceid()
and invalidate that state when the device is reset by subscribing to CUDA.atdevicereset()
.
CUDA.prevwarp
— Methodnextwarp(dev, threads)
prevwarp(dev, threads)
Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev
. This is a common requirement when using intra-warp communication.
CUDA.priority
— Methodpriority_range(s::CuStream)
Return the priority of a stream s
.
CUDA.priority_range
— Methodpriority_range()
Return the valid range of stream priorities as a StepRange
(with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).
CUDA.query
— Methodquery(e::CuEvent)
Return false
if there is outstanding work preceding the most recent call to record(e)
and true
if all captured work has been completed.
CUDA.query
— Methodquery(s::CuStream)
Return false
if a stream is busy (has task running or queued) and true
if that stream is free.
CUDA.reclaim
— Functionreclaim([sz=typemax(Int)])
Reclaims sz
bytes of cached memory. Use this to free GPU memory before calling into functionality that does not use the CUDA memory pool. Returns the number of bytes actually reclaimed.
CUDA.record
— Functionrecord(e::CuEvent, stream=CuDefaultStream())
Record an event on a stream.
CUDA.registers
— Methodregisters(k::HostKernel)
Queries the register usage of a kernel.
CUDA.release
— Methodrelease()
Returns the CUDA release part of the version as returned by version
.
CUDA.setflags!
— Methodsetflags!(pctx::CuPrimaryContext)
Set the flags of a primary context.
CUDA.shfl_down_sync
— Functionshfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)
Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask
.
CUDA.shfl_recurse
— Methodshfl_recurse(op, x::T)::T
Register how a shuffle operation op
should be applied to a value x
of type T
that is not natively supported by the shuffle intrinsics.
CUDA.shfl_sync
— Functionshfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)
Shuffle a value from a directly indexed lane lane
, and synchronize threads according to threadmask
.
CUDA.shfl_up_sync
— Functionshfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)
Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask
.
CUDA.shfl_xor_sync
— Functionshfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)
Shuffle a value from a lane based on bitwise XOR of own lane ID with mask
, and synchronize threads according to threadmask
.
CUDA.sync_grid
— Methodsync_grid(grid_handle::Culonglong)
Waits until all threads in all blocks in the grid grid_handle
have reached this point and all global memory accesses made by these threads prior to sync_grid()
are visible to all threads in the grid. A 32-bit integer cudaError_t
is returned.
CUDA.sync_threads
— Methodsync_threads()
Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads()
are visible to all threads in the block.
CUDA.sync_threads_and
— Methodsync_threads_and(predicate::Int32)
Identical to __syncthreads()
with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate
evaluates to non-zero for all of them.
sync_threads_and(predicate::Bool)
Identical to __syncthreads()
with the additional feature that it evaluates predicate for all threads of the block and returns true
if and only if predicate
evaluates to true
for all of them.
CUDA.sync_threads_count
— Methodsync_threads_count(predicate::Int32)
Identical to __syncthreads()
with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate
evaluates to non-zero.
sync_threads_count(predicate::Bool)
Identical to __syncthreads()
with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate
evaluates to true
.
CUDA.sync_threads_or
— Methodsync_threads_or(predicate::Int32)
Identical to __syncthreads()
with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate
evaluates to non-zero for any of them.
sync_threads_or(predicate::Int32)
Identical to __syncthreads()
with the additional feature that it evaluates predicate for all threads of the block and returns true
if and only if predicate
evaluates to true
for any of them.
CUDA.sync_warp
— Functionsync_warp(mask::Integer=0xffffffff)
Waits threads in the warp, selected by means of the bitmask mask
, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp()
are visible to those threads in the warp. The default value for mask
selects all threads in the warp.
Requires CUDA >= 9.0 and sm_6.2
CUDA.synchronize
— Methodsynchronize(e::CuEvent)
Waits for an event to complete.
CUDA.synchronize
— Methodsynchronize(s::CuStream)
Wait until a stream's tasks are completed.
CUDA.synchronize
— Methodsynchronize()
Block for the current context's tasks to complete.
CUDA.this_grid
— Methodthis_grid()
Returns a grid_handle
of the grid group this thread belongs to. Only available if a cooperative kernel is launched.
CUDA.threadIdx
— MethodthreadIdx()::CuDim3
Returns the thread index within the block.
CUDA.threadfence
— Methodthreadfence()
A memory fence that acts as threadfence_block
for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence()
are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence()
.
Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.
CUDA.threadfence_block
— Methodthreadfence_block()
A memory fence that ensures that:
- All writes to all memory made by the calling thread before the call to
threadfence_block()
are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call tothreadfence_block()
- All reads from all memory made by the calling thread before the call to
threadfence_block()
are ordered before all reads from all memory made by the calling thread after the call tothreadfence_block()
.
CUDA.threadfence_system
— Methodthreadfence_system()
A memory fence that acts as threadfence_block
for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system()
are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system()
.
CUDA.toolkit_origin
— Methodtoolkit_origin()
Returns the origin of the CUDA toolkit in use (either :artifact, or :local).
CUDA.toolkit_release
— Methodtoolkit_release()
Returns the CUDA release part of the version as returned by version
.
CUDA.toolkit_version
— Methodtoolkit_version()
Returns the version of the CUDA toolkit in use.
CUDA.total_memory
— Methodtotal_memory()
Returns the total amount of memory (in bytes), available for allocation by the CUDA context.
CUDA.totalmem
— Methodtotalmem(dev::CuDevice)
Returns the total amount of memory (in bytes) on the device.
CUDA.unsafe_destroy!
— Methodunsafe_destroy!(ctx::CuContext)
Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
CUDA.unsafe_release!
— MethodCUDA.unsafe_release!(ctx::CuContext)
Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
CUDA.unsafe_reset!
— Methodunsafe_reset!(pctx::CuPrimaryContext)
Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.
CUDA.version
— Methodversion(k::HostKernel)
Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.
CUDA.version
— Methodversion()
Returns the CUDA version as reported by the driver.
CUDA.vote_all
— Methodvote_all(predicate::Bool)
Evaluate predicate
for all active threads of the warp and return non-zero if and only if predicate
evaluates to non-zero for all of them.
CUDA.vote_any
— Methodvote_any(predicate::Bool)
Evaluate predicate
for all active threads of the warp and return non-zero if and only if predicate
evaluates to non-zero for any of them.
CUDA.vote_ballot
— Methodvote_ballot(predicate::Bool)
Evaluate predicate
for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate
evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.
CUDA.wait
— Functionwait(e::CuEvent, stream=CuDefaultStream())
Make a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent)
for that.
CUDA.warpsize
— Methodwarpsize(dev::CuDevice)
Returns the warp size (in threads) of the device.
CUDA.warpsize
— Methodwarpsize()::UInt32
Returns the warp size (in threads).
CUDA.@allocated
— Macro@allocated
A macro to evaluate an expression, discarding the resulting value, instead returning the total number of bytes allocated during evaluation of the expression.
CUDA.@atomic
— Macro@atomic a[I] = op(a[I], val)
@atomic a[I] ...= val
Atomically perform a sequence of operations that loads an array element a[I]
, performs the operation op
on that value and a second value val
, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.
This interface is experimental, and might change without warning. Use the lower-level atomic_...!
functions for a stable API.
CUDA.@cuDynamicSharedMem
— Macro@cuDynamicSharedMem(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,AS.Shared}
Get an array of type T
and dimensions dims
(either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.
Note that the amount of dynamic shared memory needs to specified when launching the kernel.
Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view
.
CUDA.@cuStaticSharedMem
— Macro@cuStaticSharedMem(T::Type, dims) -> CuDeviceArray{T,AS.Shared}
Get an array of type T
and dimensions dims
(either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.
CUDA.@cuassert
— Macro@assert cond [text]
Signal assertion failure to the CUDA driver if cond
is false
. Preferred syntax for writing assertions, mimicking Base.@assert
. Message text
is optionally displayed upon assertion failure.
A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.
CUDA.@cuda
— Macro@cuda [kwargs...] func(args...)
High-level interface for executing code on a GPU. The @cuda
macro should prefix a call, with func
a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert
. Finally, a call to cudacall
is performed, scheduling a kernel launch on the current CUDA context.
Several keyword arguments are supported that influence the behavior of @cuda
.
dynamic
: use dynamic parallelism to launch device-side kernels- arguments that influence kernel compilation: see
cufunction
anddynamic_cufunction
- arguments that influence kernel launch: see
CUDA.HostKernel
andCUDA.DeviceKernel
The underlying operations (argument conversion, kernel compilation, kernel call) can be performed explicitly when more control is needed, e.g. to reflect on the resource usage of a kernel to determine the launch configuration. A host-side kernel launch is done as follows:
args = ...
GC.@preserve args begin
kernel_args = cudaconvert.(args)
kernel_tt = Tuple{Core.Typeof.(kernel_args)...}
kernel = cufunction(f, kernel_tt; compilation_kwargs)
kernel(kernel_args...; launch_kwargs)
end
A device-side launch, aka. dynamic parallelism, is similar but more restricted:
args = ...
# GC.@preserve is not supported
# we're on the device already, so no need to cudaconvert
kernel_tt = Tuple{Core.Typeof(args[1]), ...} # this needs to be fully inferred!
kernel = dynamic_cufunction(f, kernel_tt) # no compiler kwargs supported
kernel(args...; launch_kwargs)
CUDA.@cuprint
— Macro@cuprint(xs...)
@cuprintln(xs...)
Print a textual representation of values xs
to standard output from the GPU. The functionality builds on @cuprintf
, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, chars and pointers. For more complex output, use @cuprintf
directly.
Limited string interpolation is also possible:
@cuprint("Hello, World ", 42, "\n")
@cuprint "Hello, World $(42)\n"
CUDA.@cuprintf
— Macro@cuprintf("%Fmt", args...)
Print a formatted string in device context on the host standard output.
Note that this is not a fully C-compliant printf
implementation; see the CUDA documentation for supported options and inputs.
Also beware that it is an untyped, and unforgiving printf
implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld
formatting string.
CUDA.@cuprintln
— Macro@cuprint(xs...)
@cuprintln(xs...)
Print a textual representation of values xs
to standard output from the GPU. The functionality builds on @cuprintf
, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, chars and pointers. For more complex output, use @cuprintf
directly.
Limited string interpolation is also possible:
@cuprint("Hello, World ", 42, "\n")
@cuprint "Hello, World $(42)\n"
CUDA.@cushow
— Macro@cushow(ex)
GPU analog of Base.@show
. It comes with the same type restrictions as @cuprintf
.
@cushow threadIdx().x
CUDA.@device_code_sass
— Macro@device_code_sass [io::IO=stdout, ...] ex
Evaluates the expression ex
and prints the result of CUDA.code_sass
to io
for every compiled CUDA kernel. For other supported keywords, see CUDA.code_sass
.
CUDA.@elapsed
— Macro@elapsed stream ex
@elapsed ex
A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.
CUDA.@profile
— Macro@profile ex
Run expressions while activating the CUDA profiler.
Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.
CUDA.@retry_reclaim
— Macro@retry_reclaim isfailed(ret) ex
Run a block of code ex
repeatedly until it successfully allocates the memory it needs. Retries are only attempted when calling isfailed
with the current return value is true. At each try, more and more memory is freed from the CUDA memory pool. When that is not possible anymore, the latest returned value will be returned.
This macro is intended for use with CUDA APIs, which sometimes allocate (outside of the CUDA memory pool) and return a specific error code when failing to.
CUDA.@sync
— Macro@sync [blocking=true] ex
Run expression ex
and synchronize the GPU afterwards. By default, this is a CPU-friendly synchronization, i.e. it performs a blocking synchronization without increasing CPU load As such, this operation is preferred over implicit synchronization (e.g. when performing a memory copy) for high-performance applications.
It is also useful for timing code that executes asynchronously.
CUDA.@time
— Macro@time ex
Run expression ex
and report on execution time and GPU/CPU memory behavior. The GPU is synchronized right before and after executing ex
to exclude any external effects.