CUDA.AbstractKernelType
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • threads (defaults to 1)
  • blocks (defaults to 1)
  • shmem (defaults to 0)
  • config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
  • stream (defaults to the default stream)
CUDA.ConstType
Const(A::CuDeviceArray)

Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.

This API can only be used on devices with compute capability 3.5 or higher.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuContextType
CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)

Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.

When you are done using the context, call unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.

CUDA.CuContextMethod
CuContext(pctx::CuPrimaryContext)

Retain the primary context on the GPU, returning a context compatible with the driver API. The primary context will be released when the returned driver context is finalized.

As these contexts are refcounted by CUDA, you should not call CUDA.unsafe_destroy! on them but use CUDA.unsafe_release! instead (available with do-block syntax as well).

CUDA.CuDeviceType
CuDevice(i::Integer)

Get a handle to a compute device.

CUDA.CuDeviceMethod
CuDevice(::CuContext)

Returns the device for a context.

CUDA.CuDeviceArrayType
CuDeviceArray(dims, ptr)
CuDeviceArray{T}(dims, ptr)
CuDeviceArray{T,A}(dims, ptr)
CuDeviceArray{T,A,N}(dims, ptr)

Construct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

CUDA.CuDeviceTextureType
CuDeviceTexture{T,N,NC}

N-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain.

Device-side texture objects cannot be created directly, but should be created host-side using CuTexture and passed to the kernal as an argument.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuDim3Type
CuDim3(x)

CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall or launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.

CUDA.CuErrorType
CuError(code)
CuError(code, meta)

Create a CUDA error object with error code code. The optional meta parameter indicates whether extra information, such as error logs, is known.

CUDA.CuFunctionType
CuFunction(mod::CuModule, name::String)

Acquires a function handle from a named function in a module.

CUDA.CuGlobalType
CuGlobal{T}(mod::CuModule, name::String)

Acquires a typed global variable handle from a named global in a module.

CUDA.CuIteratorType
CuIterator(batches)

Return a CuIterator that can iterate through the provided batches via Base.iterate.

Upon each iteration, the current batch is adapted to the GPU (via map(x -> adapt(CuArray, x), batch)) and the previous iteration is marked as freeable from GPU memory (via unsafe_free!).

This abstraction is useful for batching data into GPU memory in a manner that allows old iterations to potentially be freed (or marked as reusable) earlier than they otherwise would via CuArray's internal polling mechanism.

CUDA.CuLinkType
CuLink()

Creates a pending JIT linker invocation.

CUDA.CuLinkImageType

The result of a linking operation.

This object keeps its parent linker object alive, as destroying a linker destroys linked images too.

CUDA.CuModuleType
CuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})

Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.

The options is an optional dictionary of JIT options and their respective value.

CUDA.CuModuleMethod
CuModule(img::CuLinkImage, ...)

Create a CUDA module from a completed linking operation. Options from CuModule apply.

CUDA.CuPrimaryContextType
CuPrimaryContext(dev::CuDevice)

Create a primary CUDA context for a given device.

Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.

CUDA.CuPtrType
CuPtr{T}

A memory address that refers to data of type T that is accessible from the GPU. A CuPtr is ABI compatible with regular Ptr objects, e.g. it can be used to ccall a function that expects a Ptr to GPU memory, but it prevents erroneous conversions between the two.

CUDA.CuStreamMethod
CuStream(; flags=STREAM_DEFAULT, priority=nothing)

Create a CUDA stream.

CUDA.CuTextureType
CuTexture{T,N,P}

N-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureMethod
CuTexture(x::CuArray{T,N})

Create a N-dimensional texture object that reads from a CuArray.

Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureMethod
CuTexture(x::CuTextureArray{T,N})

Create a N-dimensional texture object withelements of type T that will be read from x.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureMethod
CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)

Construct a N-dimensional texture object with elements of type T as stored in parent.

Several keyword arguments alter the behavior of texture objects:

  • address_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.
  • filter_mode (point, linear): how non-integral indices are fetched. Point mode fetches a single value, linear results in linear interpolation between values.
  • normalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.
Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArrayType
CuTextureArray{T,N}(undef, dims)

N-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture objects.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArrayMethod
CuTextureArray(A::AbstractArray)

Allocate and initialize a texture buffer from host memory in A.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArrayMethod
CuTextureArray(A::CuArray)

Allocate and initialize a texture buffer from device memory in A.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArrayMethod
CuTextureArray{T,N}(undef, dims)

Construct an uninitialized texture array of N dimensions specified in the dims tuple, with elements of type T. Use Base.copyto! to initialize this texture array, or use constructors that take a non-texture array to do so automatically.

Warning

Experimental API. Subject to change without deprecation.

CUDA.DeviceKernelType
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • threads (defaults to 1)
  • blocks (defaults to 1)
  • shmem (defaults to 0)
  • config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
  • stream (defaults to the default stream)
CUDA.DevicePtrType
DevicePtr{T,A}

A memory address that refers to data of type T that is accessible from the GPU. It is the on-device counterpart of CuPtr, additionally keeping track of the address space A where the data resides (shared, global, constant, etc). This information is used to provide optimized implementations of operations such as unsafe_load and unsafe_store!.

CUDA.HostKernelType
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • threads (defaults to 1)
  • blocks (defaults to 1)
  • shmem (defaults to 0)
  • config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
  • stream (defaults to the default stream)
CUDA.OutOfGPUMemoryErrorType
OutOfGPUMemoryError()

An operation allocated too much GPU memory for either the system or the memory pool to handle properly.

CUDA.PtrOrCuPtrType
PtrOrCuPtr{T}

A special pointer type, ABI-compatible with both Ptr and CuPtr, for use in ccall expressions to convert values to either a GPU or a CPU type (in that order). This is required for CUDA APIs which accept pointers that either point to host or device memory.

Base.eltypeMethod
eltype(var::CuGlobal)

Return the element type of a global variable object.

Base.getindexMethod
Base.getindex(var::CuGlobal)

Return the current value of a global variable.

Base.pop!Method
pop!(CuContext)

Pops the current CUDA context from the current CPU thread, and returns that context.

Base.push!Method
push!(CuContext, ctx::CuContext)

Pushes a context on the current CPU thread.

Base.resize!Method

resize!(a::CuVector, n::Int)

Resize a to contain n elements. If n is smaller than the current collection length, the first n elements will be retained. If n is larger, the new elements are not guaranteed to be initialized.

Several restrictions apply to which types of CuArrays can be resized:

  • the array should be backed by the memory pool, and not have been constructed with unsafe_wrap
  • the array cannot be derived (view, reshape) from another array
  • the array cannot have any derived arrays itself
Base.setindex!Method
Base.setindex(var::CuGlobal{T}, val::T)

Set the value of a global variable to val

Base.unsafe_wrapMethod

unsafe_wrap(::CuArray, ptr::CuPtr{T}, dims; own=false, ctx=context())

Wrap a CuArray object around the data at the address given by ptr. The pointer element type T determines the array element type. dims is either an integer (for a 1d array) or a tuple of the array dimensions. own optionally specified whether Julia should take ownership of the memory, calling cudaFree when the array is no longer referenced. The ctx argument determines the CUDA context where the data is allocated in.

CUDA.CuCurrentContextMethod
CuCurrentContext()

Return the current context, or nothing if there is no active context.

CUDA.CuCurrentDeviceMethod
CuCurrentDevice()

Returns the current device, or nothing if there is no active device.

CUDA.activateMethod
activate(ctx::CuContext)

Binds the specified CUDA context to the calling CPU thread.

CUDA.active_blocksMethod
active_blocks(fun::CuFunction, threads; shmem=0)

Calculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

CUDA.add_data!Method
add_data!(link::CuLink, name::String, data::Vector{UInt8}, type::CUjitInputType)

Add object code to a pending link operation.

CUDA.add_data!Method
add_data!(link::CuLink, name::String, code::String)

Add PTX code to a pending link operation.

CUDA.add_file!Method
add_file!(link::CuLink, path::String, typ::CUjitInputType)

Add data from a file to a link operation. The argument typ indicates the type of the contained data.

CUDA.allocMethod
alloc(sz)

Allocate a number of bytes sz from the memory pool. Returns a CuPtr{Nothing}; may throw a OutOfGPUMemoryError if the allocation request cannot be satisfied.

CUDA.atdeviceresetMethod
CUDA.atdevicereset(f::Function)

Register a function to be called after resetting devices. The function is passed one argument: the device which has been reset.

Use this hook to invalidate global state that depends on the current device.

CUDA.atdeviceswitchMethod
CUDA.atdeviceswitch(f::Function)

Register a function to be called after switching to or initializing a device on a thread.

Use this hook to invalidate thread-local state that depends on the current device. If that state is also context dependent, be sure to query the context in your callback.

CUDA.atomic_add!Function
atomic_add!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.

CUDA.atomic_and!Function
atomic_and!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_cas!Function
atomic_cas!(ptr::DevicePtr{T}, cmp::T, val::T)

Reads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_dec!Function
atomic_dec!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

CUDA.atomic_div!Function
atomic_div!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes /(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.

CUDA.atomic_inc!Function
atomic_inc!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

CUDA.atomic_max!Function
atomic_max!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_min!Function
atomic_min!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_mul!Function
atomic_mul!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes *(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.

CUDA.atomic_or!Function
atomic_or!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_sub!Function
atomic_sub!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_xchg!Function
atomic_xchg!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_xor!Function
atomic_xor!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.attaskswitchMethod
CUDA.attaskswitch(f::Function)

Register a function to be called after switching to or initializing a task on a thread.

Use this hook to invalidate thread-local state that depends on the current task.

CUDA.attribute!Method
attribute!(ptr::Union{Ptr,CuPtr}, attr, val)

Sets attributeattr on a pointer ptr to val.

CUDA.attributeMethod
attribute(dev::CuDevice, code)

Returns information about the device.

CUDA.attributeMethod
attribute(X, ptr::Union{Ptr,CuPtr}, attr)

Returns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

CUDA.available_memoryMethod
available_memory()

Returns the available_memory amount of memory (in bytes), available for allocation by the CUDA context.

CUDA.blockDimMethod
blockDim()::CuDim3

Returns the dimensions of the block.

CUDA.blockIdxMethod
blockIdx()::CuDim3

Returns the block index within the grid.

CUDA.capabilityMethod
capability(dev::CuDevice)

Returns the compute capability of the device.

CUDA.clockMethod
clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

CUDA.clockMethod
clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

CUDA.code_sassFunction
code_sass([io], f, types, cap::VersionNumber)

Prints the SASS code generated for the method matching the given generic function and type signature to io which defaults to stdout.

The following keyword arguments are supported:

  • cap which device to generate code for
  • kernel: treat the function as an entry-point kernel
  • verbose: enable verbose mode, which displays code generation statistics

See also: @device_code_sass

CUDA.completeMethod
complete(link::CuLink)

Complete a pending linker invocation, returning an output image.

CUDA.context!Method
context!(ctx::CuContext)

Bind the current host thread to the context ctx.

Note that the contexts used with this call should be previously acquired by calling context, and not arbitrary contexts created by calling the CuContext constructor.

CUDA.context!Method
context!(f, ctx)

Sets the active context for the duration of f.

CUDA.contextMethod
context()::CuContext

Get or create a CUDA context for the current thread (as opposed to CuCurrentContext which may return nothing if there is no context bound to the current thread).

CUDA.cudacallFunction
cudacall(f::CuFunction, types, values...; blocks::CuDim, threads::CuDim,
         cooperative=false, shmem=0, stream=CuDefaultStream())

ccall-like interface for launching a CUDA function f on a GPU.

For example:

vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))

cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))

The blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.

CUDA.cudaconvertMethod
cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.Adaptor type.

CUDA.cufunctionMethod
cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • minthreads: the required number of threads in a thread block
  • maxthreads: the maximum number of threads in a thread block
  • blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
  • maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
  • name: override the name that the kernel will have in the generated code

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

CUDA.descriptionMethod
description(err::CuError)

Gets the string description of an error code.

CUDA.device!Function
device!(dev::Integer)
device!(dev::CuDevice)

Sets dev as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice (slightly faster).

If your library or code needs to perform an action when the active device changes, add a hook using CUDA.atdeviceswitch.

CUDA.device!Method
device!(f, dev)

Sets the active device for the duration of f.

Note that this call is intended for temporarily switching devices, and does not change the default device used to initialize new threads or tasks.

CUDA.device_reset!Function
device_reset!(dev::CuDevice=device())

Reset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.

If your library or code needs to perform an action when the active context changes, add a hook using CUDA.atdevicereset. Resetting the device will also cause subsequent API calls to fire the CUDA.atdeviceswitch hook.

CUDA.device_synchronizeMethod
device_synchronize()

Wait for the device to finish. This is the device side version, and should not be called from the host.

device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.

CUDA.deviceidFunction
deviceid()::Int
deviceid(dev::CuDevice)::Int

Get the ID number of the current device of execution. This is a 0-indexed number, corresponding to the device ID as known to CUDA.

CUDA.devicesMethod
devices()

Get an iterator for the compute devices.

CUDA.dynamic_cufunctionFunction
dynamic_cufunction(f, tt=Tuple{})

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction.

No keyword arguments are supported.

CUDA.elapsedMethod
elapsed(start::CuEvent, stop::CuEvent)

Computes the elapsed time between two events (in seconds).

CUDA.find_binaryMethod
find_binary(name; locations=String[])

Similar to find_library, performs an exhaustive search for a binary in various subdirectories of locations, and finally PATH by using Sys.which.

CUDA.find_libcudadevrtMethod
find_libcudadevrt(toolkit_dirs::Vector{String})

Look for the CUDA device runtime library in any of the CUDA toolkit directories toolkit_dirs.

CUDA.find_libdeviceMethod
find_libdevice(toolkit_dirs::Vector{String})

Look for the CUDA device library supporting targets in any of the CUDA toolkit directories toolkit_dirs. On CUDA >= 9.0, a single library unified library is discovered and returned as a string. On older toolkits, individual libraries for each of the targets are returned as a vector of strings.

CUDA.find_libraryFunction
find_library(name, version; locations=String[])

Wrapper for Libdl.find_library, performing a more exhaustive search:

  • variants of the library name (including version numbers, platform-specific tags, etc);
  • various subdirectories of the locations list, and finally system library directories.

Returns the full path to the library.

CUDA.find_toolkitMethod
find_toolkit()::Vector{String}

Look for directories where (parts of) the CUDA toolkit might be installed. This returns a (possibly empty) list of paths that can be used as an argument to other discovery functions.

The behavior of this function can be overridden by defining the CUDA_PATH, CUDA_HOME or CUDA_ROOT environment variables, which should point to the root of the CUDA toolkit.

CUDA.flagsMethod
flags(pctx::CuPrimaryContext)

Query the flags of a primary context.

CUDA.freeMethod
free(sz)

Releases a buffer pointed to by ptr to the memory pool.

CUDA.functionalFunction
functional(show_reason=false)

Check if the package has been configured successfully and is ready to use.

This call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.

CUDA.gridDimMethod
gridDim()::CuDim3

Returns the dimensions of the grid.

CUDA.has_cudaFunction
has_cuda()::Bool

Check whether the local system provides an installation of the CUDA driver and toolkit. Use this function if your code loads packages that require CUDA.jl.

Note that CUDA-dependent packages might still fail to load if the installation is broken, so it's recommended to guard against that and print a warning to inform the user:

using CUDA
if has_cuda()
    try
        using CuArrays
    catch ex
        @warn "CUDA is installed, but CuArrays.jl fails to load" exception=(ex,catch_backtrace())
    end
end
CUDA.has_cuda_gpuFunction
has_cuda_gpu()::Bool

Check whether the local system provides an installation of the CUDA driver and toolkit, and if it contains a CUDA-capable GPU. See has_cuda for more details.

Note that this function initializes the CUDA API in order to check for the number of GPUs.

CUDA.initializerMethod
initializer(f::Function)

Register a function to be called before making a CUDA API call that requires an initialized context.

CUDA.isactiveMethod
isactive(pctx::CuPrimaryContext)

Query whether a primary context is active.

CUDA.launchMethod
launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
       cooperative=false, shmem=0, stream=CuDefaultStream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

CUDA.launch_configurationMethod
launch_configuration(fun::CuFunction; shmem=0, max_threads=0)

Calculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.

In the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.

CUDA.maxthreadsMethod
maxthreads(k::HostKernel)

Queries the maximum amount of threads a kernel can use in a single block.

CUDA.memoryMethod
memory(k::HostKernel)

Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.

CUDA.memory_statusFunction
memory_status([io=stdout])

Report to io on the memory status of the current GPU and the active memory pool.

CUDA.nameMethod
name(dev::CuDevice)

Returns an identifier string for the device.

CUDA.nameMethod
name(err::CuError)

Gets the string representation of an error code.

This name can often be used as a symbol in source code to get an instance of this error. For example:

julia> err = CuError(1)
CuError(1, ERROR_INVALID_VALUE)

julia> name(err)
"ERROR_INVALID_VALUE"

julia> ERROR_INVALID_VALUE
CuError(1, ERROR_INVALID_VALUE)
CUDA.nanosleepMethod
nanosleep(t)

Puts a thread for a given amount t(in nanoseconds).

Note

Requires CUDA >= 10.0 and sm_6.2

CUDA.nextwarpMethod
nextwarp(dev, threads)
prevwarp(dev, threads)

Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev. This is a common requirement when using intra-warp communication.

CUDA.occupancyMethod
occupancy(fun::CuFunction, threads; shmem=0)

Calculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

CUDA.prepare_cuda_callMethod
CUDA.prepare_cuda_call()

Prepare state for calling CUDA API functions.

Many CUDA APIs, like the CUDA driver API used by CUDA.jl, use implicit thread-local state to determine, e.g., which device to use. With Julia however, code is grouped in tasks. Execution can switch between them, and tasks can be executing on (and in the future migrate between) different threads. To synchronize these two worlds, call this function before any CUDA API call to update thread-local state based on the current task and its context.

If you need to maintain your own task-local state, subscribe to device and task switch events using CUDA.atdeviceswitch and CUDA.attaskswitch for proper invalidation. If your state is device-specific, but global (i.e. not task-bound), it suffices to index your state with the current deviceid() and invalidate that state when the device is reset by subscribing to CUDA.atdevicereset().

CUDA.prevwarpMethod
nextwarp(dev, threads)
prevwarp(dev, threads)

Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev. This is a common requirement when using intra-warp communication.

CUDA.priorityMethod
priority_range(s::CuStream)

Return the priority of a stream s.

CUDA.priority_rangeMethod
priority_range()

Return the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).

CUDA.queryMethod
query(e::CuEvent)

Return false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.

CUDA.queryMethod
query(s::CuStream)

Return false if a stream is busy (has task running or queued) and true if that stream is free.

CUDA.reclaimFunction
reclaim([sz=typemax(Int)])

Reclaims sz bytes of cached memory. Use this to free GPU memory before calling into functionality that does not use the CUDA memory pool. Returns the number of bytes actually reclaimed.

CUDA.recordFunction
record(e::CuEvent, stream=CuDefaultStream())

Record an event on a stream.

CUDA.registersMethod
registers(k::HostKernel)

Queries the register usage of a kernel.

CUDA.releaseMethod
release()

Returns the CUDA release part of the version as returned by version.

CUDA.setflags!Method
setflags!(pctx::CuPrimaryContext)

Set the flags of a primary context.

CUDA.shfl_down_syncFunction
shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.

CUDA.shfl_recurseMethod
shfl_recurse(op, x::T)::T

Register how a shuffle operation op should be applied to a value x of type T that is not natively supported by the shuffle intrinsics.

CUDA.shfl_syncFunction
shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.

CUDA.shfl_up_syncFunction
shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.

CUDA.shfl_xor_syncFunction
shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.

CUDA.sync_gridMethod
sync_grid(grid_handle::Culonglong)

Waits until all threads in all blocks in the grid grid_handle have reached this point and all global memory accesses made by these threads prior to sync_grid() are visible to all threads in the grid. A 32-bit integer cudaError_t is returned.

CUDA.sync_threadsMethod
sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

CUDA.sync_threads_andMethod
sync_threads_and(predicate::Int32)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.

sync_threads_and(predicate::Bool)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for all of them.

CUDA.sync_threads_countMethod
sync_threads_count(predicate::Int32)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.

sync_threads_count(predicate::Bool)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to true.

CUDA.sync_threads_orMethod
sync_threads_or(predicate::Int32)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.

sync_threads_or(predicate::Int32)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for any of them.

CUDA.sync_warpFunction
sync_warp(mask::Integer=0xffffffff)

Waits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.

Note

Requires CUDA >= 9.0 and sm_6.2

CUDA.synchronizeMethod
synchronize(e::CuEvent)

Waits for an event to complete.

CUDA.synchronizeMethod
synchronize(s::CuStream)

Wait until a stream's tasks are completed.

CUDA.synchronizeMethod
synchronize()

Block for the current context's tasks to complete.

CUDA.this_gridMethod
this_grid()

Returns a grid_handle of the grid group this thread belongs to. Only available if a cooperative kernel is launched.

CUDA.threadIdxMethod
threadIdx()::CuDim3

Returns the thread index within the block.

CUDA.threadfenceMethod
threadfence()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.

CUDA.threadfence_blockMethod
threadfence_block()

A memory fence that ensures that:

  • All writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()
  • All reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().
CUDA.threadfence_systemMethod
threadfence_system()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().

CUDA.toolkit_originMethod
toolkit_origin()

Returns the origin of the CUDA toolkit in use (either :artifact, or :local).

CUDA.total_memoryMethod
total_memory()

Returns the total amount of memory (in bytes), available for allocation by the CUDA context.

CUDA.totalmemMethod
totalmem(dev::CuDevice)

Returns the total amount of memory (in bytes) on the device.

CUDA.unsafe_destroy!Method
unsafe_destroy!(ctx::CuContext)

Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

CUDA.unsafe_release!Method
CUDA.unsafe_release!(ctx::CuContext)

Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

CUDA.unsafe_reset!Method
unsafe_reset!(pctx::CuPrimaryContext)

Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.

CUDA.versionMethod
version(k::HostKernel)

Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.

CUDA.versionMethod
version()

Returns the CUDA version as reported by the driver.

CUDA.vote_allMethod
vote_all(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.

CUDA.vote_anyMethod
vote_any(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.

CUDA.vote_ballotMethod
vote_ballot(predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

CUDA.waitFunction
wait(e::CuEvent, stream=CuDefaultStream())

Make a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent) for that.

CUDA.warpsizeMethod
warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

CUDA.warpsizeMethod
warpsize()::UInt32

Returns the warp size (in threads).

CUDA.@allocatedMacro
@allocated

A macro to evaluate an expression, discarding the resulting value, instead returning the total number of bytes allocated during evaluation of the expression.

CUDA.@atomicMacro
@atomic a[I] = op(a[I], val)
@atomic a[I] ...= val

Atomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.

Warn

This interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API.

CUDA.@cuDynamicSharedMemMacro
@cuDynamicSharedMem(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.

Note that the amount of dynamic shared memory needs to specified when launching the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

CUDA.@cuStaticSharedMemMacro
@cuStaticSharedMem(T::Type, dims) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.

CUDA.@cuassertMacro
@assert cond [text]

Signal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

CUDA.@cudaMacro
@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

The underlying operations (argument conversion, kernel compilation, kernel call) can be performed explicitly when more control is needed, e.g. to reflect on the resource usage of a kernel to determine the launch configuration. A host-side kernel launch is done as follows:

args = ...
GC.@preserve args begin
    kernel_args = cudaconvert.(args)
    kernel_tt = Tuple{Core.Typeof.(kernel_args)...}
    kernel = cufunction(f, kernel_tt; compilation_kwargs)
    kernel(kernel_args...; launch_kwargs)
end

A device-side launch, aka. dynamic parallelism, is similar but more restricted:

args = ...
# GC.@preserve is not supported
# we're on the device already, so no need to cudaconvert
kernel_tt = Tuple{Core.Typeof(args[1]), ...}    # this needs to be fully inferred!
kernel = dynamic_cufunction(f, kernel_tt)       # no compiler kwargs supported
kernel(args...; launch_kwargs)
CUDA.@cuprintMacro
@cuprint(xs...)
@cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, chars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
    @cuprint "Hello, World $(42)\n"
CUDA.@cuprintfMacro
@cuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

CUDA.@cuprintlnMacro
@cuprint(xs...)
@cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, chars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
    @cuprint "Hello, World $(42)\n"
CUDA.@cushowMacro
@cushow(ex)

GPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.

@cushow threadIdx().x
CUDA.@elapsedMacro
@elapsed stream ex
@elapsed ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

CUDA.@profileMacro
@profile ex

Run expressions while activating the CUDA profiler.

Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.

CUDA.@retry_reclaimMacro
@retry_reclaim isfailed(ret) ex

Run a block of code ex repeatedly until it successfully allocates the memory it needs. Retries are only attempted when calling isfailed with the current return value is true. At each try, more and more memory is freed from the CUDA memory pool. When that is not possible anymore, the latest returned value will be returned.

This macro is intended for use with CUDA APIs, which sometimes allocate (outside of the CUDA memory pool) and return a specific error code when failing to.

CUDA.@syncMacro
@sync [blocking=true] ex

Run expression ex and synchronize the GPU afterwards. By default, this is a CPU-friendly synchronization, i.e. it performs a blocking synchronization without increasing CPU load As such, this operation is preferred over implicit synchronization (e.g. when performing a memory copy) for high-performance applications.

It is also useful for timing code that executes asynchronously.

CUDA.@timeMacro
@time ex

Run expression ex and report on execution time and GPU/CPU memory behavior. The GPU is synchronized right before and after executing ex to exclude any external effects.