Docstrings · AMDGPU.jl

AMDGPU.ExceptionHolder — Type

ExceptionHolder

exception_flag::Mem.HostBuffer: Pinned host memory. Contains one element of Int32 type. If stored value is not 0, then there is an exception that occurred during kernel execution on the respective device.
gate::ROCArray{UInt64}: Linear index for x, y & z dimensions at which the exception occurred. This is used to filter out other threads from duplication exceptions.
buffers_counter::ROCArray{Int32}: Counts number of printf buffers errprintf_buffers currently used.
str_buffers_counter::ROCArray{Int32}: Error string counter. Counts number of string buffers string_buffers used for exception reporting.
errprintf_buffers::Vector{Mem.HostBuffer}: Array of buffers used for writing exceptions. These buffers are used in the same way as device-printf buffers, except they are pre-allocated.
string_buffers::Vector{Mem.HostBuffer}: Array of string buffers. These buffers are used every time we need to report the name of the exception, file, or line.

AMDGPU.device! — Method

device!(device::HIPDevice)

Switch current device being used. This switches only for a task inside which it is called.

AMDGPU.device — Method

device(A::ROCArray) -> HIPDevice

Return the device associated with the array A.

AMDGPU.device — Method

device()::HIPDevice

Get currently active device. This device is used when launching kernels via @roc.

AMDGPU.device_id! — Method

device_id!(idx::Integer)

Sets the current device to AMDGPU.devices()[idx]. See device_id for details on the numbering semantics.

AMDGPU.device_id — Method

device_id() -> Int
device_id(device::HIPDevice) -> Int

Returns the numerical device ID for device or for the current AMDGPU.device().

AMDGPU.device_synchronize — Method

Blocks until all kernels on all streams have completed. Uses currently active device.

AMDGPU.devices — Method

devices()

Get list of all devices.

AMDGPU.finalize_shmem! — Method

Copy swap back into global memory x.

AMDGPU.free — Method

free()

Returns the free amount of memory (in bytes), available for allocation on the device.

AMDGPU.functional — Method

functional(component::Symbol) -> Bool

Returns true if the ROCm component component is configured and expected to function correctly. Available component values are:

:hip - Queries HIP library availability
:lld - Queries ld.lld tool availability
:device_libs - Queries ROCm device libraries availability
:rocblas - Queries rocBLAS library availability
:rocsolver - Queries rocSOLVER library availability
:rocalution - Queries rocALUTION library availability
:rocsparse - Queries rocSPARSE library availability
:rocrand - Queries rocRAND library availability
:rocfft - Queries rocFFT library availability
:MIOpen - Queries MIOpen library availability
:all - Queries all above components

This query should never throw for valid component values.

AMDGPU.functional — Method

functional() -> Bool

Returns true if AMDGPU is nominally functional; "functional" currently means that HSA, HIP, lld, and device libraries are available (although it does not imply that usages of these components will be successful).

Packages may use the result of this query to determine whether it is safe to:

Use AMDGPU to compile code
Query devices, queues, and other runtime state
Launch compiled kernels on a device
Wait on launched kernels to complete
Utilize external ROCm libraries (rocBLAS et. al)

If the full compilation and launch pipeline is desired, then this query should be sufficient for most packages and applications. This query combines sub-queries of multiple components; a failing sub-query will propagate to a false return value. For more fine-grained queries, use functional(::Symbol).

This query should never throw.

AMDGPU.hard_memory_limit! — Method

Set a hard limit for total GPU memory allocations.

AMDGPU.info — Method

info()

Returns a tuple of two integers, indicating respectively the free and total amount of memory (in bytes) available for allocation on the device.

AMDGPU.priority! — Method

priority!(f::Base.Callable, priority::Symbol)

Chnage the priority of default stream, execute f and revert to the original priority. Accepted values are :normal (the default), :low and :high.

Returns:

Return value of the function f.

AMDGPU.priority! — Method

priority!(p::Symbol)

Change the priority of the default stream. Accepted values are :normal (the default), :low and :high.

AMDGPU.rocconvert — Method

rocconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the AMDGPU.Adaptor type.

AMDGPU.stream! — Method

stream!(f::Base.Callable, stream::HIPStream)

Change the default stream to be used within the same Julia task, execute f and revert to the original stream.

Returns:

Return value of the function f.

AMDGPU.stream! — Method

stream!(s::HIPStream)

Change the default stream to be used within the same Julia task.

AMDGPU.stream — Method

stream()::HIPStream

Get the HIP stream that should be used as the default one for the currently executing task.

AMDGPU.synchronize — Function

synchronize(stream::HIPStream = stream(); blocking::Bool = false)

Wait until all kernels executing on stream have completed.

If there are running HostCalls, then blocking must be false. Additionally, if you want to stop host calls afterwards, then provide stop_hostcalls=true keyword argument.

AMDGPU.total — Method

total()

Returns the total amount of memory (in bytes), available for allocation on the device.

AMDGPU.used — Method

used()

Returns the used amount of memory (in bytes), allocated on the device.

Base.resize! — Method

resize!(a::ROCVector, n::Integer)

Resize a to contain n elements. If n is smaller than the current collection length, the first n elements will be retained. If n is larger, the new elements are not guaranteed to be initialized.

Note that this operation is only supported on managed buffers, i.e., not on arrays that are created by unsafe_wrap.

AMDGPU.@elapsed — Macro

@elapsed ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

AMDGPU.@roc — Macro

@roc [kwargs...] func(args...)

High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.

Several keyword arguments are supported:

launch::Bool = true: whether to launch the kernel. If false, then returns a compiled kernel which can be launched by calling it and passing arguments.
Arguments that influence kernel compilation, see AMDGPU.Compiler.hipfunction.
Arguments that influence kernel launch, see AMDGPU.Runtime.HIPKernel.

AMDGPU.@sync — Macro

@sync ex

Run expression ex on currently active stream and synchronize the GPU on that stream afterwards.

See also: synchronize.

AMDGPU.Device.DEVICE_ERR_SENTINEL — Constant

Fatal error on device wavefront accessing the signal.

AMDGPU.Device.DEVICE_LOCK_SENTINEL — Constant

Device has locked the signal.

AMDGPU.Device.DEVICE_MSG_SENTINEL — Constant

Device-sourced message is available.

AMDGPU.Device.HOST_ERR_SENTINEL — Constant

Fatal error on host thread accessing the signal.

AMDGPU.Device.HOST_LOCK_SENTINEL — Constant

Host has locked the signal.

AMDGPU.Device.HOST_MSG_SENTINEL — Constant

Host-sourced message is available.

AMDGPU.Device.READY_SENTINEL — Constant

Signal is ready for accessing by host or device.

AMDGPU.Device.DeviceStaticString — Type

Internal representation of a static string.

AMDGPU.Device.HostCall — Type

HostCall{RT,AT}

GPU-compatible struct for making hostcalls.

AMDGPU.Device.HostCallHolder — Method

HostCallHolder(func, return_type::Type, arg_types::Type{Tuple}) -> HostCall

Construct a HostCall that executes func with the arguments passed from the calling kernel.

func must be passed arguments of types contained in arg_types, and must return a value of type return_type, or else the hostcall will fail with undefined behavior.

Note: This API is currently experimental and is subject to change at any time.

AMDGPU.Device.ROCDeviceArray — Type

ROCDeviceArray(dims, ptr)
ROCDeviceArray{T}(dims, ptr)
ROCDeviceArray{T,A}(dims, ptr)
ROCDeviceArray{T,A,N}(dims, ptr)

Construct an N-dimensional dense ROC device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

AMDGPU.Device.activelane — Method

activelane()::Cuint

Get id of the current lane within a wavefront/warp.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = i
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Array(x)
1×8 Matrix{Int32}:
 0  1  2  3  4  5  6  7

AMDGPU.Device.activemask — Method

activemask()::UInt64

Get the mask of all active lanes in a warp.

AMDGPU.Device.all_sync — Method

all_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.all_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1

AMDGPU.Device.alloc_special — Method

Allocates on-device memory statically from the specified address space.

AMDGPU.Device.any_sync — Method

any_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.any_sync(mask, i == 0)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1

AMDGPU.Device.ballot — Method

ballot(predicate::Bool)::UInt64

Return a value whose Nth bit is set if and only if predicate evaluates to true for the Nth lane and the lane is active.

julia> function ker!(x)
           x[1] = AMDGPU.Device.ballot(true)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Culong}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 0x00000000ffffffff

AMDGPU.Device.ballot_sync — Method

ballot_sync(mask::UInt64, predicate::Bool)::UInt64

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the wavefront and the Nth thread is active.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.ballot_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt64}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> bitstring(Array(x)[1])
"0000000000000000000000000000000001010101010101010101010101010101"

AMDGPU.Device.blockDim — Method

blockDim()::ROCDim3

Returns the dimensions of the block. See also: workgroupDim

AMDGPU.Device.blockIdx — Method

blockIdx()::ROCDim3

Returns the block index within the grid. See also: workgroupIdx

AMDGPU.Device.bpermute — Method

bpermute(addr::Integer, val::Cint)::Cint

Read data stored in val from the lane VGPR (vector general purpose register) given by addr.

The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr VGPR should be desired_lane_id * 4, since VGPR values are 4 bytes wide.

Example below shifts all values in the wavefront by 1 to the "left".

julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Read data from the next immediate lane.
           x[i + 1] = AMDGPU.Device.bpermute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0

AMDGPU.Device.gridGroupDim — Method

gridGroupDim()::ROCDim3

Returns the size of the grid in workgroups. This is equivalent to CUDA's gridDim.

AMDGPU.Device.gridItemDim — Method

gridItemDim()::ROCDim3

Returns the size of the grid in workitems. This behaviour is different from CUDA where gridDim gives the size of the grid in blocks.

AMDGPU.Device.hostcall! — Method

Calls the host function stored in hc with arguments args.

AMDGPU.Device.permute — Method

permute(addr::Integer, val::Cint)::Cint

Put data stored in val to the lane VGPR (vector general purpose register) given by addr.

Example below shifts all values in the wavefront by 1 to the "right".

julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Put data into the next immediate lane.
           x[i + 1] = AMDGPU.Device.permute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 7  0  1  2  3  4  5  6

AMDGPU.Device.readfirstlane — Method

readfirstlane(val)

Read a value stored in val from the first lane in the wavefront.

AMDGPU.Device.shfl — Function

shfl(val, lane, width = wavefrontsize())

Read data stored in val from a lane (this is a more high-level op than bpermute).

If lane is outside the range [0:width - 1], the value returned corresponds to the value held by the lane modulo width (within the same subsection).

julia> function ker!(x)
           i::UInt32 = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl(i, i + 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0

If width is less than wavefront size then each subsection of the wavefront behaves as a separate entity with a starting logical lane ID of 0.

julia> function ker!(x)
           i::UInt32 = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl(i, i + 1, 4) # <-- Notice width = 4.
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  0  5  6  7  4

AMDGPU.Device.shfl_down — Function

shfl_down(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_down(i, 1, 8)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  7

AMDGPU.Device.shfl_down_sync — Function

shfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with higher ID relative to the caller.

AMDGPU.Device.shfl_sync — Function

shfl_sync(mask::UInt64, val, lane, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane ID.

AMDGPU.Device.shfl_up — Function

shfl_up(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_up(i, 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0  0  1  2  3  4  5  6

AMDGPU.Device.shfl_up_sync — Function

shfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with lower ID relative to the caller.

AMDGPU.Device.shfl_xor — Function

shfl_xor(val, lane_mask, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_xor(i, 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  0  3  2  5  4  7  6

AMDGPU.Device.shfl_xor_sync — Function

shfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask.

AMDGPU.Device.sync_workgroup — Method

sync_workgroup()

Waits until all wavefronts in a workgroup have reached this call.

AMDGPU.Device.sync_workgroup_and — Method

sync_workgroup_and(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for all of them.

AMDGPU.Device.sync_workgroup_count — Method

sync_workgroup_count(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns the number of workitems for which predicate evaluates to non-zero.

AMDGPU.Device.sync_workgroup_or — Method

sync_workgroup_or(predicate::Cint)::Cint

AMDGPU.Device.threadIdx — Method

threadIdx()::ROCDim3

Returns the thread index within the block. See also: workitemIdx

AMDGPU.Device.wavefrontsize — Method

wavefrontsize()::Cuint

Get the wavefront size of the device that executes current kernel.

AMDGPU.Device.wfred — Function

wfred(op::Function, val::T) where T -> T

Performs a wavefront-wide reduction on val in each lane, and returns the result. A limited subset of functions are available to be passed as op. When op is one of (+, max, min, &, |, ⊻), T may be <:Union{Cint, Clong, Cuint, Culong}. When op is one of (+, max, min), T may also be <:Union{Float32, Float64}.

AMDGPU.Device.wfscan — Function

wfscan(op::Function, val::T) where T -> T

Performs a wavefront-wide scan on val in each lane, and returns the result. A limited subset of functions are available to be passed as op. When op is one of (+, max, min, &, |, ⊻), T may be <:Union{Cint, Clong, Cuint, Culong}. When op is one of (+, max, min), T may also be <:Union{Float32, Float64}.

AMDGPU.Device.workgroupDim — Method

workgroupDim()::ROCDim3

Returns the size of each workgroup in workitems. See also: blockDim

AMDGPU.Device.workgroupIdx — Method

workgroupIdx()::ROCDim3

Returns the work group index. See also: blockIdx

AMDGPU.Device.workitemIdx — Method

workitemIdx()::ROCDim3

Returns the work item index within the work group. See also: threadIdx

Base.rand — Method

Random.rand(rng::Philox2x32, UInt32)

Generate a byte of random data using the on-device Tausworthe generator.

Base.unsafe_load — Method

Read from the printf buffer on the host from HostCall task.

Random.seed! — Function

Random.seed!(rng::Philox2x32, seed::Integer, [counter::Integer=0])

Seed the on-device Philox2x32 generator with an UInt32 number. Should be called by at least one thread per warp.

AMDGPU.Device.@rocassert — Macro

@rocassert cond [text]

Signal assertion failure if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

AMDGPU.Runtime.Mem.unsafe_copy3d! — Method

Asynchronous 3D array copy.

Arguments:

width::Integer: Width of 3D memory copy.
height::Integer = 1: Height of 3D memory copy.
depth::Integer = 1: Depth of 3D memory copy.
dstPos::ROCDim3 = (1, 1, 1): Starting position of the destination data for the copy.
srcPos::ROCDim3 = (1, 1, 1): Starting position of the source data for the copy.
dstPitch::Integer = 0: Pitch in bytes of the destination data for the copy.
srcPitch::Integer = 0: Pitch in bytes of the source data for the copy.
async::Bool = false: If false, then synchronize stream immediately.
stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to perform the copy.

AMDGPU.Compiler.hipfunction — Method

hipfunction(f::F, tt::TT = Tuple{}; kwargs...)

Compile Julia function f to a HIP kernel given a tuple of argument's types tt that it accepts.

The following kwargs are supported:

name::Union{String, Nothing} = nothing: A unique name to give a compiled kernel.
unsafe_fp_atomics::Bool = true: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.

AMDGPU.Runtime.AbstractKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use AMDGPU.@roc.

The following keyword arguments are supported:

groupsize (defaults to 1) or threads (CUDA.jl compatibility shim) Can be either an Int or an NTuple{N,Int} (where 1 <= N <= 3)
gridsize (defaults to 1) Can be either an Int or an NTuple{N,Int} (where 1 <= N <= 3)
blocks (CUDA.jl compatibility shim) Can be either an Int or an NTuple{N,Int} (where 1 <= N <= 3)
config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields.
queue (defaults to the default queue)

AMDGPU.Runtime.HIPKernel — Type

(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)

Launch compiled HIPKernel by passing arguments to it.

The following kwargs are supported:

gridsize::ROCDim = 1: Size of the grid.
groupsize::ROCDim = 1: Size of the workgroup.
shmem::Integer = 0: Amount of dynamically-allocated shared memory in bytes.
stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to launch the kernel.

AMDGPU.Runtime.ROCDim3 — Type

ROCDim3(x)
ROCDim3((x,))
ROCDim3((x, y))
ROCDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for the x, y, and z dimension, respectively. Unspecified dimensions default to 1.

Often accepted as argument through the ROCDim type alias, eg. in the case of roccall or launch_kernel!, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit ROCDim3 object.

AMDGPU.Runtime.check — Method

check(result::AMDGPU.HSA.Status)

Check the return code of an HSA call, and throw an HSAError exception if it is not the success code.

AMDGPU.Runtime.description — Method

description(err::HSAError)

Gets the string description of an error code.

AMDGPU.HIP.HIPStream — Type

HIPStream(priority::Symbol = :normal)

Arguments:

priority::Symbol: Priority of the stream: :normal, :high or :low.

Create HIPStream with given priority. Device is the default device that's currently in use.

AMDGPU.HIP.HIPStream — Method

HIPStream(stream::hipStream_t)

Create HIPStream from hipStream_t handle. Device is the default device that's currently in use.

AMDGPU.HIP.device_id — Method

device_id(d::HIPDevice)

Zero-based device ID as expected by HIP functions. Differs from AMDGPU.device_id method by 1.

AMDGPU.HIP.elapsed — Method

elapsed(start::HIPEvent, stop::HIPEvent)

Computes the elapsed time between two events (in seconds).

See also @elapsed.

AMDGPU.HIP.gcn_arch — Method

gcn_arch(d::HIPDevice)::String

Get GCN architecture for the device.

AMDGPU.HIP.name — Method

name(dev::HIPDevice)::String

Get name of the device.

AMDGPU.HIP.properties — Method

properties(dev::HIPDevice)::hipDeviceProp_t

Get all properties for the device. See HIP documentation for hipDeviceProp_t for the meaning of each field.

AMDGPU.HIP.wavefrontsize — Method

wavefrontsize(d::HIPDevice)::Cint

Get size of the wavefront. AMD GPUs support either 32 or 64.

AMDGPU.HIP.@gcsafe_ccall — Macro

@gcsafe_ccall ...

Call a foreign function just like @ccall, but marking it safe for the GC to run. This is useful for functions that may block, so that the GC isn't blocked from running, but may also be required to prevent deadlocks (see JuliaGPU/CUDA.jl#2261).

Note that this is generally only safe with non-Julia C functions that do not call back into Julia. When using callbacks, the code should make sure to transition back into GC-unsafe mode using the @gcunsafe macro.

AMDGPU.MIOpen.ConvolutionDescriptor — Method

ConvolutionDescriptor(
    n_dims::Int64,
    padding::Vector{Int32}, stride::Vector{Int32},
    dilation::Vector{Int32}, groups::Int64)

Low-level constructor. Users should be using one with keyword arguments.

Arguments:

padding::Vector{Int32}: Convolution padding in reversed order.
stride::Vector{Int32}: Convolution stride in reversed order.
dilation::Vector{Int32}: Convolution dilation in reversed order.

AMDGPU.MIOpen.ConvolutionDescriptor — Method

ConvolutionDescriptor(
    n_dims::Integer; padding, stride, dilation, groups::Integer)

Arguments:

padding: Convolution padding (array or tuple).
stride: Convolution stride (array or tuple).
dilation: Convolution dilation (array or tuple).

AMDGPU.MIOpen.PoolingDescriptor — Method

PoolingDescriptor(
    mode::miopenPoolingMode_t, n_dims::Integer; dims, padding, stride)

Arguments:

dims: Pooling window dimensions (array or tuple).
padding: Pooling padding (array or tuple).
stride: Pooling stride (array or tuple).

AMDGPU.MIOpen.TensorDescriptor — Method

TensorDescriptor(
    dtype::miopenDataType_t, dims::Int64,
    sizes::Vector{Int32}, strides::Vector{Int32})

Arguments:

sizes::Vector{Int32}: Dimensions of the tensor in reversed order, i.e. reverse(size(x)).
strides::Vector{Int32}: Strides of the tensor in reversed order, i.e. reverse(strides(x)).

AMDGPU.MIOpen.batchnorm_inference — Method

batchnorm_inference(
    x::ROCArray{T},
    γ::ROCArray{T}, β::ROCArray{T},
    μ::ROCArray{T}, ν::ROCArray{T}; ϵ::Float64 = 1e-5,
) where T <: MIOPENFloat

Arguments:

γ: Scaling.
β: Bias.
μ: Running mean for inference.
ν: Running variance for inference.

If x has N dims, then N - 1 is considered as 'feature' dimension. Meaning, γ, β, μ, ν must have size(x, N - 1) shape.

AMDGPU.MIOpen.batchnorm_training — Method

batchnorm_training(
    x::ROCArray{T},
    γ::ROCArray{T}, β::ROCArray{T},
    μ::ROCArray{T}, ν::ROCArray{T}; iteration::Int, ϵ::Float64 = 1e-5,
) where T <: MIOPENFloat

Arguments:

γ: Scaling.
β: Bias.
μ: Running mean for inference.
ν: Running variance for inference.

If x has N dims, then N - 1 is considered as 'feature' dimension. Meaning, γ, β, μ, ν must have size(x, N - 1) shape.

AMDGPU.rocSPARSE.ROCSparseMatrix — Type

Utility union type of ROCSparseMatrixCSC, ROCSparseMatrixCSR, ROCSparseMatrixBSR, ROCSparseMatrixCOO.

AMDGPU.rocSPARSE.CSRIterator — Type

CSRIterator{Ti}(row, args...)

A GPU-compatible iterator for accessing the elements of a single row row of several CSR matrices args in one go. The row should be in-bounds for every sparse argument. Each iteration returns a 2-element tuple: The current column, and each arguments' pointer index (or 0 if that input didn't have an element at that column). The pointers can then be used to access the elements themselves.

For convenience, this iterator can be passed non-sparse arguments as well, which will be ignored (with the returned col/ptr values set to 0).

AMDGPU.rocSPARSE.ROCSparseMatrixBSR — Type

Container to hold sparse matrices in block compressed sparse row (BSR) format on the GPU. BSR format is also used in Intel MKL, and is suited to matrices that are "block" sparse - rare blocks of non-sparse regions.

AMDGPU.rocSPARSE.ROCSparseMatrixCOO — Type

Container to hold sparse matrices in coordinate (COO) format on the GPU. COO format is mainly useful to initially construct sparse matrices, afterwards switch to ROCSparseMatrixCSR for more functionality.

AMDGPU.rocSPARSE.ROCSparseMatrixCSR — Type

ROCSparseMatrixCSR{Tv, Ti} <: AbstractROCSparseMatrix{Tv, Ti}

Container to hold sparse matrices in compressed sparse row (CSR) format on the GPU.

Note

Most ROCSPARSE operations work with CSR formatted matrices, rather than CSC.

AMDGPU.rocSPARSE.axpby — Method

axpby(alpha::Number, x::ROCSparseVector, beta::Number, y::ROCSparseVector, index::SparseChar)

Performs z = alpha * x + beta * y. x and y are sparse vectors.

AMDGPU.rocSPARSE.axpyi! — Method

axpyi!(alpha::BlasFloat, X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Computes alpha * X + Y for sparse X and dense Y.

AMDGPU.rocSPARSE.chkmmdims — Method

check that the dimensions of matrices X and Y make sense for a multiplication

AMDGPU.rocSPARSE.chkmvdims — Method

check that the dimensions of matrix X and vector Y make sense for a multiplication

AMDGPU.rocSPARSE.geam — Method

geam(
    alpha::Number, A::ROCSparseMatrix, beta::Number,
    B::ROCSparseMatrix, index::SparseChar)

Performs C = alpha * A + beta * B. A and B are sparse matrix defined in CSR storage format.

AMDGPU.rocSPARSE.gthr! — Method

gthr!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Sets the nonzero elements of X equal to the nonzero elements of Y at the same indices.

AMDGPU.rocSPARSE.gthrz! — Method

gthrz!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Sets the nonzero elements of X equal to the nonzero elements of Y at the same indices, and zeros out those elements of Y.

AMDGPU.rocSPARSE.ic0! — Method

ic0!(A::ROCSparseMatrix, index::SparseChar)

Incomplete Cholesky factorization with no pivoting. Preserves the sparse layout of matrix A.

AMDGPU.rocSPARSE.ilu0! — Method

ilu0!(A::ROCSparseMatrix, index::SparseChar)

Incomplete LU factorization with no pivoting. Preserves the sparse layout of matrix A.

AMDGPU.rocSPARSE.mm! — Method

mm!(
    transa::SparseChar, transb::SparseChar, alpha::BlasFloat,
    A::ROCSparseMatrix, B::ROCMatrix, beta::BlasFloat,
    C::ROCMatrix, index::SparseChar)

Performs C = alpha * op(A) * op(B) + beta * C, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). A is a sparse matrix defined in BSR storage format. B and C are dense matrices.

AMDGPU.rocSPARSE.mv! — Method

mv!(
    transa::SparseChar, alpha::BlasFloat, A::ROCSparseMatrix, X::ROCVector,
    beta::BlasFloat, Y::ROCVector, index::SparseChar)

Performs Y = alpha * op(A) * X + beta * Y, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a sparse vector, and Y is dense.

AMDGPU.rocSPARSE.roti! — Method

roti!(X::ROCSparseVector, Y::ROCVector, c::BlasFloat, s::BlasFloat, index::SparseChar)

Performs the Givens rotation specified by c and s to sparse X and dense Y.

AMDGPU.rocSPARSE.sctr! — Method

sctr!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Set Y[:] = X[:] for dense Y and sparse X.

AMDGPU.rocSPARSE.sm2! — Method

sm2!(
    transa::SparseChar, transxy::SparseChar, uplo::SparseChar,
    diag::SparseChar, alpha::BlasFloat, A::ROCSparseMatrix,
    X::ROCMatrix, index::SparseChar)

Performs X = alpha * op(A) \ op(X), where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a dense matrix, and uplo tells sm2! which triangle of the block sparse matrix A to reference. If the triangle has unit diagonal, set diag to 'U'.

AMDGPU.rocSPARSE.sv2! — Method

sv2!(
    transa::SparseChar, uplo::SparseChar, diag::SparseChar,
    alpha::BlasFloat, A::ROCSparseMatrix, X::ROCVector, index::SparseChar)

Performs X = alpha * op(A) \ X, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a dense vector, and uplo tells sv2! which triangle of the block sparse matrix A to reference. If the triangle has unit diagonal, set diag to 'U'.

AMDGPU.rocSPARSE.with_workspace — Method

with_workspace([eltyp=UInt8], size) do workspace
    ...
end

Create a GPU workspace vector with element type eltyp and size in number of elements (in the default case of an UInt8 element type this equals to the amount of bytes) specified by size, and pass it to the do block.

SparseArrays.sparse — Method

sparse(x::DenseROCMatrix; fmt=:csc)
sparse(I::ROCVector, J::ROCVector, V::ROCVector, [m, n]; fmt=:csc)

Return a sparse rocm matrix, with type determined by fmt. Possible formats are :csc, :csr, :bsr, and :coo.

AMDGPU.ROCmDiscovery.find_roc_path — Method

Find root ROCm directory.

AMDGPU.ROCmDiscovery.use_artifacts! — Function

use_artifacts!(flag::Bool = true)

Pass true to switch from system-wide ROCm installtion to artifacts. When using artifacts, system-wide installation is not needed at all.