Docstrings · CUDA.jl

CUDA.AbstractKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

threads (defaults to 1)
blocks (defaults to 1)
shmem (defaults to 0)
config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
stream (defaults to the default stream)

CUDA.Const — Type

Const(A::CuDeviceArray)

Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.

This API can only be used on devices with compute capability 3.5 or higher.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuContext — Type

CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)

Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.

When you are done using the context, call unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.

CUDA.CuContext — Method

CuContext(pctx::CuPrimaryContext)

Retain the primary context on the GPU, returning a context compatible with the driver API. The primary context will be released when the returned driver context is finalized.

As these contexts are refcounted by CUDA, you should not call CUDA.unsafe_destroy! on them but use CUDA.unsafe_release! instead (available with do-block syntax as well).

CUDA.CuDevice — Type

CuDevice(i::Integer)

Get a handle to a compute device.

CUDA.CuDevice — Method

CuDevice(::CuContext)

Returns the device for a context.

CUDA.CuDeviceArray — Type

CuDeviceArray(dims, ptr)
CuDeviceArray{T}(dims, ptr)
CuDeviceArray{T,A}(dims, ptr)
CuDeviceArray{T,A,N}(dims, ptr)

Construct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

CUDA.CuDeviceTexture — Type

CuDeviceTexture{T,N,NC}

N-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain.

Device-side texture objects cannot be created directly, but should be created host-side using CuTexture and passed to the kernal as an argument.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuDim3 — Type

CuDim3(x)

CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall or launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.

CUDA.CuError — Type

CuError(code)
CuError(code, meta)

Create a CUDA error object with error code code. The optional meta parameter indicates whether extra information, such as error logs, is known.

CUDA.CuEvent — Type

CuEvent()

Create a new CUDA event.

CUDA.CuFunction — Type

CuFunction(mod::CuModule, name::String)

Acquires a function handle from a named function in a module.

CUDA.CuGlobal — Type

CuGlobal{T}(mod::CuModule, name::String)

Acquires a typed global variable handle from a named global in a module.

CUDA.CuIterator — Type

CuIterator(batches)

Return a CuIterator that can iterate through the provided batches via Base.iterate.

Upon each iteration, the current batch is adapted to the GPU (via map(x -> adapt(CuArray, x), batch)) and the previous iteration is marked as freeable from GPU memory (via unsafe_free!).

This abstraction is useful for batching data into GPU memory in a manner that allows old iterations to potentially be freed (or marked as reusable) earlier than they otherwise would via CuArray's internal polling mechanism.

CUDA.CuLink — Type

CuLink()

Creates a pending JIT linker invocation.

CUDA.CuLinkImage — Type

The result of a linking operation.

This object keeps its parent linker object alive, as destroying a linker destroys linked images too.

CUDA.CuModule — Type

CuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})

Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.

The options is an optional dictionary of JIT options and their respective value.

CUDA.CuModule — Method

CuModule(img::CuLinkImage, ...)

Create a CUDA module from a completed linking operation. Options from CuModule apply.

CUDA.CuPrimaryContext — Type

CuPrimaryContext(dev::CuDevice)

Create a primary CUDA context for a given device.

Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.

CUDA.CuPtr — Type

CuPtr{T}

A memory address that refers to data of type T that is accessible from the GPU. A CuPtr is ABI compatible with regular Ptr objects, e.g. it can be used to ccall a function that expects a Ptr to GPU memory, but it prevents erroneous conversions between the two.

CUDA.CuStream — Method

CuStream(; flags=STREAM_DEFAULT, priority=nothing)

Create a CUDA stream.

CUDA.CuTexture — Type

CuTexture{T,N,P}

N-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTexture — Method

CuTexture(x::CuArray{T,N})

Create a N-dimensional texture object that reads from a CuArray.

Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTexture — Method

CuTexture(x::CuTextureArray{T,N})

Create a N-dimensional texture object withelements of type T that will be read from x.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTexture — Method

CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)

Construct a N-dimensional texture object with elements of type T as stored in parent.

Several keyword arguments alter the behavior of texture objects:

address_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.
filter_mode (point, linear): how non-integral indices are fetched. Point mode fetches a single value, linear results in linear interpolation between values.
normalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Type

CuTextureArray{T,N}(undef, dims)

N-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture objects.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Method

CuTextureArray(A::AbstractArray)

Allocate and initialize a texture buffer from host memory in A.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Method

CuTextureArray(A::CuArray)

Allocate and initialize a texture buffer from device memory in A.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Method

CuTextureArray{T,N}(undef, dims)

Construct an uninitialized texture array of N dimensions specified in the dims tuple, with elements of type T. Use Base.copyto! to initialize this texture array, or use constructors that take a non-texture array to do so automatically.

Warning

Experimental API. Subject to change without deprecation.

CUDA.DeviceKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

threads (defaults to 1)
blocks (defaults to 1)
shmem (defaults to 0)
config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
stream (defaults to the default stream)

CUDA.DevicePtr — Type

DevicePtr{T,A}

A memory address that refers to data of type T that is accessible from the GPU. It is the on-device counterpart of CuPtr, additionally keeping track of the address space A where the data resides (shared, global, constant, etc). This information is used to provide optimized implementations of operations such as unsafe_load and unsafe_store!.

CUDA.HostKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

threads (defaults to 1)
blocks (defaults to 1)
shmem (defaults to 0)
config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
stream (defaults to the default stream)

CUDA.OutOfGPUMemoryError — Type

OutOfGPUMemoryError()

An operation allocated too much GPU memory for either the system or the memory pool to handle properly.

CUDA.PtrOrCuPtr — Type

PtrOrCuPtr{T}

A special pointer type, ABI-compatible with both Ptr and CuPtr, for use in ccall expressions to convert values to either a GPU or a CPU type (in that order). This is required for CUDA APIs which accept pointers that either point to host or device memory.

Base.eltype — Method

eltype(var::CuGlobal)

Return the element type of a global variable object.

Base.getindex — Method

Base.getindex(var::CuGlobal)

Return the current value of a global variable.

Base.pop! — Method

pop!(CuContext)

Pops the current CUDA context from the current CPU thread, and returns that context.

Base.push! — Method

push!(CuContext, ctx::CuContext)

Pushes a context on the current CPU thread.

Base.resize! — Method

resize!(a::CuVector, n::Int)

Resize a to contain n elements. If n is smaller than the current collection length, the first n elements will be retained. If n is larger, the new elements are not guaranteed to be initialized.

Several restrictions apply to which types of CuArrays can be resized:

the array should be backed by the memory pool, and not have been constructed with unsafe_wrap
the array cannot be derived (view, reshape) from another array
the array cannot have any derived arrays itself

Base.setindex! — Method

Base.setindex(var::CuGlobal{T}, val::T)

Set the value of a global variable to val

Base.unsafe_wrap — Method

unsafe_wrap(::CuArray, ptr::CuPtr{T}, dims; own=false, ctx=context())

Wrap a CuArray object around the data at the address given by ptr. The pointer element type T determines the array element type. dims is either an integer (for a 1d array) or a tuple of the array dimensions. own optionally specified whether Julia should take ownership of the memory, calling cudaFree when the array is no longer referenced. The ctx argument determines the CUDA context where the data is allocated in.

CUDA.CuCurrentContext — Method

CuCurrentContext()

Return the current context, or nothing if there is no active context.

CUDA.CuCurrentDevice — Method

CuCurrentDevice()

Returns the current device, or nothing if there is no active device.

CUDA.CuDefaultStream — Method

CuDefaultStream()

Return the default stream.

CUDA.activate — Method

activate(ctx::CuContext)

Binds the specified CUDA context to the calling CPU thread.

CUDA.active_blocks — Method

active_blocks(fun::CuFunction, threads; shmem=0)

Calculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

CUDA.add_data! — Method

add_data!(link::CuLink, name::String, data::Vector{UInt8}, type::CUjitInputType)

Add object code to a pending link operation.

CUDA.add_data! — Method

add_data!(link::CuLink, name::String, code::String)

Add PTX code to a pending link operation.

CUDA.add_file! — Method

add_file!(link::CuLink, path::String, typ::CUjitInputType)

Add data from a file to a link operation. The argument typ indicates the type of the contained data.

CUDA.alloc — Method

alloc(sz)

Allocate a number of bytes sz from the memory pool. Returns a CuPtr{Nothing}; may throw a OutOfGPUMemoryError if the allocation request cannot be satisfied.

CUDA.alloc_timings — Method

alloc_timings()

Show the timings of the CUDA allocator. Assumes CUDA.enable_timings() has been called.

CUDA.atdevicereset — Method

CUDA.atdevicereset(f::Function)

Register a function to be called after resetting devices. The function is passed one argument: the device which has been reset.

Use this hook to invalidate global state that depends on the current device.

CUDA.atdeviceswitch — Method

CUDA.atdeviceswitch(f::Function)

Use this hook to invalidate thread-local state that depends on the current device. If that state is also context dependent, be sure to query the context in your callback.

CUDA.atomic_add! — Function

atomic_add!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.

CUDA.atomic_and! — Function

atomic_and!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_cas! — Function

atomic_cas!(ptr::DevicePtr{T}, cmp::T, val::T)

Reads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_dec! — Function

atomic_dec!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

CUDA.atomic_div! — Function

atomic_div!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes /(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.

CUDA.atomic_inc! — Function

atomic_inc!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

CUDA.atomic_max! — Function

atomic_max!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_min! — Function

atomic_min!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_mul! — Function

atomic_mul!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes *(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported on GPU hardware with compute capability 6.0+ for values of type Float32 and Float64.

CUDA.atomic_or! — Function

atomic_or!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_sub! — Function

atomic_sub!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 6.0+, values of type Float32 and Float64 are supported.

CUDA.atomic_xchg! — Function

atomic_xchg!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.atomic_xor! — Function

atomic_xor!(ptr::DevicePtr{T}, val::T)

Reads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDA.attaskswitch — Method

CUDA.attaskswitch(f::Function)

Use this hook to invalidate thread-local state that depends on the current task.

CUDA.attribute! — Method

attribute!(ptr::Union{Ptr,CuPtr}, attr, val)

Sets attributeattr on a pointer ptr to val.

CUDA.attribute — Method

attribute(dev::CuDevice, code)

Returns information about the device.

CUDA.attribute — Method

attribute(X, ptr::Union{Ptr,CuPtr}, attr)

Returns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

CUDA.available_memory — Method

available_memory()

Returns the available_memory amount of memory (in bytes), available for allocation by the CUDA context.

CUDA.blockDim — Method

blockDim()::CuDim3

Returns the dimensions of the block.

CUDA.blockIdx — Method

blockIdx()::CuDim3

Returns the block index within the grid.

CUDA.capability — Method

capability(dev::CuDevice)

Returns the compute capability of the device.

CUDA.clock — Method

clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

CUDA.clock — Method

clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

CUDA.code_sass — Function

code_sass([io], f, types, cap::VersionNumber)

Prints the SASS code generated for the method matching the given generic function and type signature to io which defaults to stdout.

The following keyword arguments are supported:

cap which device to generate code for
kernel: treat the function as an entry-point kernel
verbose: enable verbose mode, which displays code generation statistics

See also: @device_code_sass

CUDA.complete — Method

complete(link::CuLink)

Complete a pending linker invocation, returning an output image.

CUDA.context! — Method

context!(ctx::CuContext)

Bind the current host thread to the context ctx.

Note that the contexts used with this call should be previously acquired by calling context, and not arbitrary contexts created by calling the CuContext constructor.

CUDA.context! — Method

context!(f, ctx)

Sets the active context for the duration of f.

CUDA.context — Method

context()::CuContext

Get or create a CUDA context for the current thread (as opposed to CuCurrentContext which may return nothing if there is no context bound to the current thread).

CUDA.cudacall — Function

cudacall(f::CuFunction, types, values...; blocks::CuDim, threads::CuDim,
         cooperative=false, shmem=0, stream=CuDefaultStream())

ccall-like interface for launching a CUDA function f on a GPU.

For example:

vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))

cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))

The blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.

CUDA.cudaconvert — Method

cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.Adaptor type.

CUDA.cufunction — Method

cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

minthreads: the required number of threads in a thread block
maxthreads: the maximum number of threads in a thread block
blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
name: override the name that the kernel will have in the generated code

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

CUDA.description — Method

description(err::CuError)

Gets the string description of an error code.

CUDA.device! — Function

device!(dev::Integer)
device!(dev::CuDevice)

Sets dev as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice (slightly faster).

If your library or code needs to perform an action when the active device changes, add a hook using CUDA.atdeviceswitch.

CUDA.device! — Method

device!(f, dev)

Sets the active device for the duration of f.

Note that this call is intended for temporarily switching devices, and does not change the default device used to initialize new threads or tasks.

CUDA.device — Method

device()::CuDevice

Get the CUDA device for the current thread, similar to how context() works compared to CuCurrentContext().

CUDA.device_reset! — Function

device_reset!(dev::CuDevice=device())

Reset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.

If your library or code needs to perform an action when the active context changes, add a hook using CUDA.atdevicereset. Resetting the device will also cause subsequent API calls to fire the CUDA.atdeviceswitch hook.

CUDA.device_synchronize — Method

device_synchronize()

Wait for the device to finish. This is the device side version, and should not be called from the host.

device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.

CUDA.deviceid — Function

deviceid()::Int
deviceid(dev::CuDevice)::Int

Get the ID number of the current device of execution. This is a 0-indexed number, corresponding to the device ID as known to CUDA.

CUDA.devices — Method

devices()

Get an iterator for the compute devices.

CUDA.dynamic_cufunction — Function

dynamic_cufunction(f, tt=Tuple{})

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction.

No keyword arguments are supported.

CUDA.elapsed — Method

elapsed(start::CuEvent, stop::CuEvent)

Computes the elapsed time between two events (in seconds).

CUDA.enable_timings — Method

enable_timings()

Enable the recording of debug timings.

CUDA.find_binary — Method

find_binary(name; locations=String[])

Similar to find_library, performs an exhaustive search for a binary in various subdirectories of locations, and finally PATH by using Sys.which.

CUDA.find_libcudadevrt — Method

find_libcudadevrt(toolkit_dirs::Vector{String})

Look for the CUDA device runtime library in any of the CUDA toolkit directories toolkit_dirs.

CUDA.find_libdevice — Method

find_libdevice(toolkit_dirs::Vector{String})

Look for the CUDA device library supporting targets in any of the CUDA toolkit directories toolkit_dirs. On CUDA >= 9.0, a single library unified library is discovered and returned as a string. On older toolkits, individual libraries for each of the targets are returned as a vector of strings.

CUDA.find_library — Function

find_library(name, version; locations=String[])

Wrapper for Libdl.find_library, performing a more exhaustive search:

variants of the library name (including version numbers, platform-specific tags, etc);
various subdirectories of the locations list, and finally system library directories.

Returns the full path to the library.

CUDA.find_toolkit — Method

find_toolkit()::Vector{String}

Look for directories where (parts of) the CUDA toolkit might be installed. This returns a (possibly empty) list of paths that can be used as an argument to other discovery functions.

The behavior of this function can be overridden by defining the CUDA_PATH, CUDA_HOME or CUDA_ROOT environment variables, which should point to the root of the CUDA toolkit.

CUDA.flags — Method

flags(pctx::CuPrimaryContext)

Query the flags of a primary context.

CUDA.free — Method

free(sz)

Releases a buffer pointed to by ptr to the memory pool.

CUDA.functional — Function

functional(show_reason=false)

Check if the package has been configured successfully and is ready to use.

This call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.

CUDA.gridDim — Method

gridDim()::CuDim3

Returns the dimensions of the grid.

CUDA.has_cuda — Function

has_cuda()::Bool

Check whether the local system provides an installation of the CUDA driver and toolkit. Use this function if your code loads packages that require CUDA.jl.

Note that CUDA-dependent packages might still fail to load if the installation is broken, so it's recommended to guard against that and print a warning to inform the user:

using CUDA
if has_cuda()
    try
        using CuArrays
    catch ex
        @warn "CUDA is installed, but CuArrays.jl fails to load" exception=(ex,catch_backtrace())
    end
end

CUDA.has_cuda_gpu — Function

has_cuda_gpu()::Bool

Check whether the local system provides an installation of the CUDA driver and toolkit, and if it contains a CUDA-capable GPU. See has_cuda for more details.

Note that this function initializes the CUDA API in order to check for the number of GPUs.

CUDA.initializer — Method

initializer(f::Function)

CUDA.isactive — Method

isactive(pctx::CuPrimaryContext)

Query whether a primary context is active.

CUDA.launch — Method

launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
       cooperative=false, shmem=0, stream=CuDefaultStream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

CUDA.launch_configuration — Method

launch_configuration(fun::CuFunction; shmem=0, max_threads=0)

Calculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.

In the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.

CUDA.maxthreads — Method

maxthreads(k::HostKernel)

Queries the maximum amount of threads a kernel can use in a single block.

CUDA.memory — Method

memory(k::HostKernel)

Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.

CUDA.memory_status — Function

memory_status([io=stdout])

Report to io on the memory status of the current GPU and the active memory pool.

CUDA.name — Method

name(dev::CuDevice)

Returns an identifier string for the device.

CUDA.name — Method

name(err::CuError)

Gets the string representation of an error code.

This name can often be used as a symbol in source code to get an instance of this error. For example:

julia> err = CuError(1)
CuError(1, ERROR_INVALID_VALUE)

julia> name(err)
"ERROR_INVALID_VALUE"

julia> ERROR_INVALID_VALUE
CuError(1, ERROR_INVALID_VALUE)

CUDA.nanosleep — Method

nanosleep(t)

Puts a thread for a given amount t(in nanoseconds).

Note

Requires CUDA >= 10.0 and sm_6.2

CUDA.nextwarp — Method

nextwarp(dev, threads)
prevwarp(dev, threads)

Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev. This is a common requirement when using intra-warp communication.

CUDA.occupancy — Method

occupancy(fun::CuFunction, threads; shmem=0)

Calculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

CUDA.pool_timings — Method

pool_timings()

Show the timings of the currently active memory pool. Assumes CUDA.enable_timings() has been called.

CUDA.prepare_cuda_call — Method

CUDA.prepare_cuda_call()

Prepare state for calling CUDA API functions.

Many CUDA APIs, like the CUDA driver API used by CUDA.jl, use implicit thread-local state to determine, e.g., which device to use. With Julia however, code is grouped in tasks. Execution can switch between them, and tasks can be executing on (and in the future migrate between) different threads. To synchronize these two worlds, call this function before any CUDA API call to update thread-local state based on the current task and its context.

If you need to maintain your own task-local state, subscribe to device and task switch events using CUDA.atdeviceswitch and CUDA.attaskswitch for proper invalidation. If your state is device-specific, but global (i.e. not task-bound), it suffices to index your state with the current deviceid() and invalidate that state when the device is reset by subscribing to CUDA.atdevicereset().

CUDA.prevwarp — Method

nextwarp(dev, threads)
prevwarp(dev, threads)

Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev. This is a common requirement when using intra-warp communication.

CUDA.priority — Method

priority_range(s::CuStream)

Return the priority of a stream s.

CUDA.priority_range — Method

priority_range()

Return the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).

CUDA.query — Method

query(e::CuEvent)

Return false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.

CUDA.query — Method

query(s::CuStream)

Return false if a stream is busy (has task running or queued) and true if that stream is free.

CUDA.reclaim — Function

reclaim([sz=typemax(Int)])

Reclaims sz bytes of cached memory. Use this to free GPU memory before calling into functionality that does not use the CUDA memory pool. Returns the number of bytes actually reclaimed.

CUDA.record — Function

record(e::CuEvent, stream=CuDefaultStream())

Record an event on a stream.

CUDA.registers — Method

registers(k::HostKernel)

Queries the register usage of a kernel.

CUDA.release — Method

release()

Returns the CUDA release part of the version as returned by version.

CUDA.setflags! — Method

setflags!(pctx::CuPrimaryContext)

Set the flags of a primary context.

CUDA.shfl_down_sync — Function

shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.

CUDA.shfl_recurse — Method

shfl_recurse(op, x::T)::T

Register how a shuffle operation op should be applied to a value x of type T that is not natively supported by the shuffle intrinsics.

CUDA.shfl_sync — Function

shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.

CUDA.shfl_up_sync — Function

shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.

CUDA.shfl_xor_sync — Function

shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.

CUDA.sync_grid — Method

sync_grid(grid_handle::Culonglong)

Waits until all threads in all blocks in the grid grid_handle have reached this point and all global memory accesses made by these threads prior to sync_grid() are visible to all threads in the grid. A 32-bit integer cudaError_t is returned.

CUDA.sync_threads — Method

sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

CUDA.sync_threads_and — Method

sync_threads_and(predicate::Int32)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.

sync_threads_and(predicate::Bool)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for all of them.

CUDA.sync_threads_count — Method

sync_threads_count(predicate::Int32)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.

sync_threads_count(predicate::Bool)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to true.

CUDA.sync_threads_or — Method

sync_threads_or(predicate::Int32)

sync_threads_or(predicate::Int32)

Identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for any of them.

CUDA.sync_warp — Function

sync_warp(mask::Integer=0xffffffff)

Waits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.

Note

Requires CUDA >= 9.0 and sm_6.2

CUDA.synchronize — Method

synchronize(e::CuEvent)

Waits for an event to complete.

CUDA.synchronize — Method

synchronize(s::CuStream)

Wait until a stream's tasks are completed.

CUDA.synchronize — Method

synchronize()

Block for the current context's tasks to complete.

CUDA.this_grid — Method

this_grid()

Returns a grid_handle of the grid group this thread belongs to. Only available if a cooperative kernel is launched.

CUDA.threadIdx — Method

threadIdx()::CuDim3

Returns the thread index within the block.

CUDA.threadfence — Method

threadfence()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.

CUDA.threadfence_block — Method

threadfence_block()

A memory fence that ensures that:

All writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()
All reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().

CUDA.threadfence_system — Method

threadfence_system()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().

CUDA.toolkit_origin — Method

toolkit_origin()

Returns the origin of the CUDA toolkit in use (either :artifact, or :local).

CUDA.toolkit_release — Method

toolkit_release()

Returns the CUDA release part of the version as returned by version.

CUDA.toolkit_version — Method

toolkit_version()

Returns the version of the CUDA toolkit in use.

CUDA.total_memory — Method

total_memory()

Returns the total amount of memory (in bytes), available for allocation by the CUDA context.

CUDA.totalmem — Method

totalmem(dev::CuDevice)

Returns the total amount of memory (in bytes) on the device.

CUDA.unsafe_destroy! — Method

unsafe_destroy!(ctx::CuContext)

Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

CUDA.unsafe_release! — Method

CUDA.unsafe_release!(ctx::CuContext)

Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

CUDA.unsafe_reset! — Method

unsafe_reset!(pctx::CuPrimaryContext)

Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.

CUDA.version — Method

version(k::HostKernel)

Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.

CUDA.version — Method

version()

Returns the CUDA version as reported by the driver.

CUDA.vote_all — Method

vote_all(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.

CUDA.vote_any — Method

vote_any(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.

CUDA.vote_ballot — Method

vote_ballot(predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

CUDA.wait — Function

wait(e::CuEvent, stream=CuDefaultStream())

Make a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent) for that.

CUDA.warpsize — Method

warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

CUDA.warpsize — Method

warpsize()::UInt32

Returns the warp size (in threads).

CUDA.@allocated — Macro

@allocated

A macro to evaluate an expression, discarding the resulting value, instead returning the total number of bytes allocated during evaluation of the expression.

CUDA.@atomic — Macro

@atomic a[I] = op(a[I], val)
@atomic a[I] ...= val

Atomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.

Warn

This interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API.

CUDA.@cuDynamicSharedMem — Macro

@cuDynamicSharedMem(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.

Note that the amount of dynamic shared memory needs to specified when launching the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

CUDA.@cuStaticSharedMem — Macro

@cuStaticSharedMem(T::Type, dims) -> CuDeviceArray{T,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.

CUDA.@cuassert — Macro

@assert cond [text]

Signal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

CUDA.@cuda — Macro

@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

dynamic: use dynamic parallelism to launch device-side kernels
arguments that influence kernel compilation: see cufunction and dynamic_cufunction
arguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel

The underlying operations (argument conversion, kernel compilation, kernel call) can be performed explicitly when more control is needed, e.g. to reflect on the resource usage of a kernel to determine the launch configuration. A host-side kernel launch is done as follows:

args = ...
GC.@preserve args begin
    kernel_args = cudaconvert.(args)
    kernel_tt = Tuple{Core.Typeof.(kernel_args)...}
    kernel = cufunction(f, kernel_tt; compilation_kwargs)
    kernel(kernel_args...; launch_kwargs)
end

A device-side launch, aka. dynamic parallelism, is similar but more restricted:

args = ...
# GC.@preserve is not supported
# we're on the device already, so no need to cudaconvert
kernel_tt = Tuple{Core.Typeof(args[1]), ...}    # this needs to be fully inferred!
kernel = dynamic_cufunction(f, kernel_tt)       # no compiler kwargs supported
kernel(args...; launch_kwargs)

CUDA.@cuprint — Macro

@cuprint(xs...)
@cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, chars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
    @cuprint "Hello, World $(42)\n"

CUDA.@cuprintf — Macro

@cuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

CUDA.@cuprintln — Macro

@cuprint(xs...)
@cuprintln(xs...)

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
    @cuprint "Hello, World $(42)\n"

CUDA.@cushow — Macro

@cushow(ex)

GPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.

@cushow threadIdx().x

CUDA.@device_code_sass — Macro

@device_code_sass [io::IO=stdout, ...] ex

Evaluates the expression ex and prints the result of CUDA.code_sass to io for every compiled CUDA kernel. For other supported keywords, see CUDA.code_sass.

CUDA.@elapsed — Macro

@elapsed stream ex
@elapsed ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

CUDA.@profile — Macro

@profile ex

Run expressions while activating the CUDA profiler.

Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.

CUDA.@retry_reclaim — Macro

@retry_reclaim isfailed(ret) ex

Run a block of code ex repeatedly until it successfully allocates the memory it needs. Retries are only attempted when calling isfailed with the current return value is true. At each try, more and more memory is freed from the CUDA memory pool. When that is not possible anymore, the latest returned value will be returned.

This macro is intended for use with CUDA APIs, which sometimes allocate (outside of the CUDA memory pool) and return a specific error code when failing to.

CUDA.@sync — Macro

@sync [blocking=true] ex

Run expression ex and synchronize the GPU afterwards. By default, this is a CPU-friendly synchronization, i.e. it performs a blocking synchronization without increasing CPU load As such, this operation is preferred over implicit synchronization (e.g. when performing a memory copy) for high-performance applications.

It is also useful for timing code that executes asynchronously.

CUDA.@time — Macro

@time ex

Run expression ex and report on execution time and GPU/CPU memory behavior. The GPU is synchronized right before and after executing ex to exclude any external effects.