AMDGPU.ExceptionHolderType

ExceptionHolder

  • exception_flag::Mem.HostBuffer: Pinned host memory. Contains one element of Int32 type. If stored value is not 0, then there is an exception that occurred during kernel execution on the respective device.
  • gate::ROCArray{UInt64}: Linear index for x, y & z dimensions at which the exception occurred. This is used to filter out other threads from duplication exceptions.
  • buffers_counter::ROCArray{Int32}: Counts number of printf buffers errprintf_buffers currently used.
  • str_buffers_counter::ROCArray{Int32}: Error string counter. Counts number of string buffers string_buffers used for exception reporting.
  • errprintf_buffers::Vector{Mem.HostBuffer}: Array of buffers used for writing exceptions. These buffers are used in the same way as device-printf buffers, except they are pre-allocated.
  • string_buffers::Vector{Mem.HostBuffer}: Array of string buffers. These buffers are used every time we need to report the name of the exception, file, or line.
AMDGPU.device!Method
device!(device::HIPDevice)

Switch current device being used. This switches only for a task inside which it is called.

AMDGPU.deviceMethod
device(A::ROCArray) -> HIPDevice

Return the device associated with the array A.

AMDGPU.deviceMethod
device()::HIPDevice

Get currently active device. This device is used when launching kernels via @roc.

AMDGPU.device_id!Method
device_id!(idx::Integer)

Sets the current device to AMDGPU.devices()[idx]. See device_id for details on the numbering semantics.

AMDGPU.device_idMethod
device_id() -> Int
device_id(device::HIPDevice) -> Int

Returns the numerical device ID for device or for the current AMDGPU.device().

AMDGPU.functionalMethod
functional(component::Symbol) -> Bool

Returns true if the ROCm component component is configured and expected to function correctly. Available component values are:

  • :hip - Queries HIP library availability
  • :lld - Queries ld.lld tool availability
  • :device_libs - Queries ROCm device libraries availability
  • :rocblas - Queries rocBLAS library availability
  • :rocsolver - Queries rocSOLVER library availability
  • :rocalution - Queries rocALUTION library availability
  • :rocsparse - Queries rocSPARSE library availability
  • :rocrand - Queries rocRAND library availability
  • :rocfft - Queries rocFFT library availability
  • :MIOpen - Queries MIOpen library availability
  • :all - Queries all above components

This query should never throw for valid component values.

AMDGPU.functionalMethod
functional() -> Bool

Returns true if AMDGPU is nominally functional; "functional" currently means that HSA, HIP, lld, and device libraries are available (although it does not imply that usages of these components will be successful).

Packages may use the result of this query to determine whether it is safe to:

  • Use AMDGPU to compile code
  • Query devices, queues, and other runtime state
  • Launch compiled kernels on a device
  • Wait on launched kernels to complete
  • Utilize external ROCm libraries (rocBLAS et. al)

If the full compilation and launch pipeline is desired, then this query should be sufficient for most packages and applications. This query combines sub-queries of multiple components; a failing sub-query will propagate to a false return value. For more fine-grained queries, use functional(::Symbol).

This query should never throw.

AMDGPU.priority!Method
priority!(f::Base.Callable, priority::Symbol)

Chnage the priority of default stream, execute f and revert to the original priority. Accepted values are :normal (the default), :low and :high.

Returns:

Return value of the function f.

AMDGPU.priority!Method
priority!(p::Symbol)

Change the priority of the default stream. Accepted values are :normal (the default), :low and :high.

AMDGPU.rocconvertMethod
rocconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the AMDGPU.Adaptor type.

AMDGPU.stream!Method
stream!(f::Base.Callable, stream::HIPStream)

Change the default stream to be used within the same Julia task, execute f and revert to the original stream.

Returns:

Return value of the function f.

AMDGPU.stream!Method
stream!(s::HIPStream)

Change the default stream to be used within the same Julia task.

AMDGPU.streamMethod
stream()::HIPStream

Get the HIP stream that should be used as the default one for the currently executing task.

AMDGPU.synchronizeFunction
synchronize(stream::HIPStream = stream(); blocking::Bool = false)

Wait until all kernels executing on stream have completed.

If there are running HostCalls, then blocking must be false. Additionally, if you want to stop host calls afterwards, then provide stop_hostcalls=true keyword argument.

Base.resize!Method
resize!(a::ROCVector, n::Integer)

Resize a to contain n elements. If n is smaller than the current collection length, the first n elements will be retained. If n is larger, the new elements are not guaranteed to be initialized.

Note that this operation is only supported on managed buffers, i.e., not on arrays that are created by unsafe_wrap.

AMDGPU.@elapsedMacro
@elapsed ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

AMDGPU.@rocMacro
@roc [kwargs...] func(args...)

High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.

Several keyword arguments are supported:

  • launch::Bool = true: whether to launch the kernel. If false, then returns a compiled kernel which can be launched by calling it and passing arguments.
  • Arguments that influence kernel compilation, see AMDGPU.Compiler.hipfunction.
  • Arguments that influence kernel launch, see AMDGPU.Runtime.HIPKernel.
AMDGPU.@syncMacro
@sync ex

Run expression ex on currently active stream and synchronize the GPU on that stream afterwards.

See also: synchronize.

AMDGPU.Device.HostCallHolderMethod
HostCallHolder(func, return_type::Type, arg_types::Type{Tuple}) -> HostCall

Construct a HostCall that executes func with the arguments passed from the calling kernel.

func must be passed arguments of types contained in arg_types, and must return a value of type return_type, or else the hostcall will fail with undefined behavior.

Note: This API is currently experimental and is subject to change at any time.

AMDGPU.Device.ROCDeviceArrayType
ROCDeviceArray(dims, ptr)
ROCDeviceArray{T}(dims, ptr)
ROCDeviceArray{T,A}(dims, ptr)
ROCDeviceArray{T,A,N}(dims, ptr)

Construct an N-dimensional dense ROC device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

AMDGPU.Device.activelaneMethod
activelane()::Cuint

Get id of the current lane within a wavefront/warp.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = i
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Array(x)
1×8 Matrix{Int32}:
 0  1  2  3  4  5  6  7
AMDGPU.Device.all_syncMethod
all_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.all_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1
AMDGPU.Device.any_syncMethod
any_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.any_sync(mask, i == 0)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1
AMDGPU.Device.ballotMethod
ballot(predicate::Bool)::UInt64

Return a value whose Nth bit is set if and only if predicate evaluates to true for the Nth lane and the lane is active.

julia> function ker!(x)
           x[1] = AMDGPU.Device.ballot(true)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Culong}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 0x00000000ffffffff
AMDGPU.Device.ballot_syncMethod
ballot_sync(mask::UInt64, predicate::Bool)::UInt64

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the wavefront and the Nth thread is active.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.ballot_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt64}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> bitstring(Array(x)[1])
"0000000000000000000000000000000001010101010101010101010101010101"
AMDGPU.Device.bpermuteMethod
bpermute(addr::Integer, val::Cint)::Cint

Read data stored in val from the lane VGPR (vector general purpose register) given by addr.

The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr VGPR should be desired_lane_id * 4, since VGPR values are 4 bytes wide.

Example below shifts all values in the wavefront by 1 to the "left".

julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Read data from the next immediate lane.
           x[i + 1] = AMDGPU.Device.bpermute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0
AMDGPU.Device.gridGroupDimMethod
gridGroupDim()::ROCDim3

Returns the size of the grid in workgroups. This is equivalent to CUDA's gridDim.

AMDGPU.Device.gridItemDimMethod
gridItemDim()::ROCDim3

Returns the size of the grid in workitems. This behaviour is different from CUDA where gridDim gives the size of the grid in blocks.

AMDGPU.Device.permuteMethod
permute(addr::Integer, val::Cint)::Cint

Put data stored in val to the lane VGPR (vector general purpose register) given by addr.

Example below shifts all values in the wavefront by 1 to the "right".

julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Put data into the next immediate lane.
           x[i + 1] = AMDGPU.Device.permute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 7  0  1  2  3  4  5  6
AMDGPU.Device.shflFunction
shfl(val, lane, width = wavefrontsize())

Read data stored in val from a lane (this is a more high-level op than bpermute).

If lane is outside the range [0:width - 1], the value returned corresponds to the value held by the lane modulo width (within the same subsection).

julia> function ker!(x)
           i::UInt32 = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl(i, i + 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0

If width is less than wavefront size then each subsection of the wavefront behaves as a separate entity with a starting logical lane ID of 0.

julia> function ker!(x)
           i::UInt32 = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl(i, i + 1, 4) # <-- Notice width = 4.
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  0  5  6  7  4
AMDGPU.Device.shfl_downFunction
shfl_down(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_down(i, 1, 8)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  7
AMDGPU.Device.shfl_down_syncFunction
shfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with higher ID relative to the caller.

AMDGPU.Device.shfl_syncFunction
shfl_sync(mask::UInt64, val, lane, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane ID.

AMDGPU.Device.shfl_upFunction
shfl_up(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_up(i, 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0  0  1  2  3  4  5  6
AMDGPU.Device.shfl_up_syncFunction
shfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with lower ID relative to the caller.

AMDGPU.Device.shfl_xorFunction
shfl_xor(val, lane_mask, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_xor(i, 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  0  3  2  5  4  7  6
AMDGPU.Device.shfl_xor_syncFunction
shfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask.

AMDGPU.Device.sync_workgroup_andMethod
sync_workgroup_and(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for all of them.

AMDGPU.Device.sync_workgroup_countMethod
sync_workgroup_count(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns the number of workitems for which predicate evaluates to non-zero.

AMDGPU.Device.sync_workgroup_orMethod
sync_workgroup_or(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for any of them.

AMDGPU.Device.wfredFunction
wfred(op::Function, val::T) where T -> T

Performs a wavefront-wide reduction on val in each lane, and returns the result. A limited subset of functions are available to be passed as op. When op is one of (+, max, min, &, |, ⊻), T may be <:Union{Cint, Clong, Cuint, Culong}. When op is one of (+, max, min), T may also be <:Union{Float32, Float64}.

AMDGPU.Device.wfscanFunction
wfscan(op::Function, val::T) where T -> T

Performs a wavefront-wide scan on val in each lane, and returns the result. A limited subset of functions are available to be passed as op. When op is one of (+, max, min, &, |, ⊻), T may be <:Union{Cint, Clong, Cuint, Culong}. When op is one of (+, max, min), T may also be <:Union{Float32, Float64}.

Base.randMethod
Random.rand(rng::Philox2x32, UInt32)

Generate a byte of random data using the on-device Tausworthe generator.

Base.unsafe_loadMethod

Read from the printf buffer on the host from HostCall task.

Random.seed!Function
Random.seed!(rng::Philox2x32, seed::Integer, [counter::Integer=0])

Seed the on-device Philox2x32 generator with an UInt32 number. Should be called by at least one thread per warp.

AMDGPU.Device.@rocassertMacro
@rocassert cond [text]

Signal assertion failure if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

AMDGPU.Runtime.Mem.freeMethod
free()

Returns the free amount of memory (in bytes), available for allocation on the device.

AMDGPU.Runtime.Mem.infoMethod
info()

Returns a tuple of two integers, indicating respectively the free and total amount of memory (in bytes) available for allocation on the device.

AMDGPU.Runtime.Mem.totalMethod
total()

Returns the total amount of memory (in bytes), available for allocation on the device.

AMDGPU.Runtime.Mem.unsafe_copy3d!Method

Asynchronous 3D array copy.

Arguments:

  • width::Integer: Width of 3D memory copy.
  • height::Integer = 1: Height of 3D memory copy.
  • depth::Integer = 1: Depth of 3D memory copy.
  • dstPos::ROCDim3 = (1, 1, 1): Starting position of the destination data for the copy.
  • srcPos::ROCDim3 = (1, 1, 1): Starting position of the source data for the copy.
  • dstPitch::Integer = 0: Pitch in bytes of the destination data for the copy.
  • srcPitch::Integer = 0: Pitch in bytes of the source data for the copy.
  • async::Bool = false: If false, then synchronize stream immediately.
  • stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to perform the copy.
AMDGPU.Compiler.hipfunctionMethod
hipfunction(f::F, tt::TT = Tuple{}; kwargs...)

Compile Julia function f to a HIP kernel given a tuple of argument's types tt that it accepts.

The following kwargs are supported:

  • name::Union{String, Nothing} = nothing: A unique name to give a compiled kernel.
  • unsafe_fp_atomics::Bool = true: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
AMDGPU.Runtime.AbstractKernelType
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use AMDGPU.@roc.

The following keyword arguments are supported:

  • groupsize (defaults to 1) or threads (CUDA.jl compatibility shim) Can be either an Int or an NTuple{N,Int} (where 1 <= N <= 3)
  • gridsize (defaults to 1) Can be either an Int or an NTuple{N,Int} (where 1 <= N <= 3)
  • blocks (CUDA.jl compatibility shim) Can be either an Int or an NTuple{N,Int} (where 1 <= N <= 3)
  • config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields.
  • queue (defaults to the default queue)
AMDGPU.Runtime.HIPKernelType
(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)

Launch compiled HIPKernel by passing arguments to it.

The following kwargs are supported:

  • gridsize::ROCDim = 1: Size of the grid.
  • groupsize::ROCDim = 1: Size of the workgroup.
  • shmem::Integer = 0: Amount of dynamically-allocated shared memory in bytes.
  • stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to launch the kernel.
AMDGPU.Runtime.ROCDim3Type
ROCDim3(x)
ROCDim3((x,))
ROCDim3((x, y))
ROCDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for the x, y, and z dimension, respectively. Unspecified dimensions default to 1.

Often accepted as argument through the ROCDim type alias, eg. in the case of roccall or launch_kernel!, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit ROCDim3 object.

AMDGPU.Runtime.checkMethod
check(result::AMDGPU.HSA.Status)

Check the return code of an HSA call, and throw an HSAError exception if it is not the success code.

AMDGPU.HIP.HIPStreamType
HIPStream(priority::Symbol = :normal)

Arguments:

  • priority::Symbol: Priority of the stream: :normal, :high or :low.

Create HIPStream with given priority. Device is the default device that's currently in use.

AMDGPU.HIP.HIPStreamMethod
HIPStream(stream::hipStream_t)

Create HIPStream from hipStream_t handle. Device is the default device that's currently in use.

AMDGPU.HIP.elapsedMethod
elapsed(start::HIPEvent, stop::HIPEvent)

Computes the elapsed time between two events (in seconds).

See also @elapsed.

AMDGPU.HIP.gcn_archMethod
gcn_arch(d::HIPDevice)::String

Get GCN architecture for the device.

AMDGPU.HIP.nameMethod
name(dev::HIPDevice)::String

Get name of the device.

AMDGPU.HIP.propertiesMethod
properties(dev::HIPDevice)::hipDeviceProp_t

Get all properties for the device. See HIP documentation for hipDeviceProp_t for the meaning of each field.

AMDGPU.HIP.wavefrontsizeMethod
wavefrontsize(d::HIPDevice)::Cint

Get size of the wavefront. AMD GPUs support either 32 or 64.

AMDGPU.HIP.@gcsafe_ccallMacro
@gcsafe_ccall ...

Call a foreign function just like @ccall, but marking it safe for the GC to run. This is useful for functions that may block, so that the GC isn't blocked from running, but may also be required to prevent deadlocks (see JuliaGPU/CUDA.jl#2261).

Note that this is generally only safe with non-Julia C functions that do not call back into Julia. When using callbacks, the code should make sure to transition back into GC-unsafe mode using the @gcunsafe macro.

AMDGPU.MIOpen.ConvolutionDescriptorMethod
ConvolutionDescriptor(
    n_dims::Int64,
    padding::Vector{Int32}, stride::Vector{Int32},
    dilation::Vector{Int32}, groups::Int64)

Low-level constructor. Users should be using one with keyword arguments.

Arguments:

  • padding::Vector{Int32}: Convolution padding in reversed order.
  • stride::Vector{Int32}: Convolution stride in reversed order.
  • dilation::Vector{Int32}: Convolution dilation in reversed order.
AMDGPU.MIOpen.ConvolutionDescriptorMethod
ConvolutionDescriptor(
    n_dims::Integer; padding, stride, dilation, groups::Integer)

Arguments:

  • padding: Convolution padding (array or tuple).
  • stride: Convolution stride (array or tuple).
  • dilation: Convolution dilation (array or tuple).
AMDGPU.MIOpen.PoolingDescriptorMethod
PoolingDescriptor(
    mode::miopenPoolingMode_t, n_dims::Integer; dims, padding, stride)

Arguments:

  • dims: Pooling window dimensions (array or tuple).
  • padding: Pooling padding (array or tuple).
  • stride: Pooling stride (array or tuple).
AMDGPU.MIOpen.TensorDescriptorMethod
TensorDescriptor(
    dtype::miopenDataType_t, dims::Int64,
    sizes::Vector{Int32}, strides::Vector{Int32})

Arguments:

  • sizes::Vector{Int32}: Dimensions of the tensor in reversed order, i.e. reverse(size(x)).
  • strides::Vector{Int32}: Strides of the tensor in reversed order, i.e. reverse(strides(x)).
AMDGPU.MIOpen.batchnorm_inferenceMethod
batchnorm_inference(
    x::ROCArray{T},
    γ::ROCArray{T}, β::ROCArray{T},
    μ::ROCArray{T}, ν::ROCArray{T}; ϵ::Float64 = 1e-5,
) where T <: MIOPENFloat

Arguments:

  • γ: Scaling.
  • β: Bias.
  • μ: Running mean for inference.
  • ν: Running variance for inference.

If x has N dims, then N - 1 is considered as 'feature' dimension. Meaning, γ, β, μ, ν must have size(x, N - 1) shape.

AMDGPU.MIOpen.batchnorm_trainingMethod
batchnorm_training(
    x::ROCArray{T},
    γ::ROCArray{T}, β::ROCArray{T},
    μ::ROCArray{T}, ν::ROCArray{T}; iteration::Int, ϵ::Float64 = 1e-5,
) where T <: MIOPENFloat

Arguments:

  • γ: Scaling.
  • β: Bias.
  • μ: Running mean for inference.
  • ν: Running variance for inference.

If x has N dims, then N - 1 is considered as 'feature' dimension. Meaning, γ, β, μ, ν must have size(x, N - 1) shape.

AMDGPU.rocSPARSE.CSRIteratorType
CSRIterator{Ti}(row, args...)

A GPU-compatible iterator for accessing the elements of a single row row of several CSR matrices args in one go. The row should be in-bounds for every sparse argument. Each iteration returns a 2-element tuple: The current column, and each arguments' pointer index (or 0 if that input didn't have an element at that column). The pointers can then be used to access the elements themselves.

For convenience, this iterator can be passed non-sparse arguments as well, which will be ignored (with the returned col/ptr values set to 0).

AMDGPU.rocSPARSE.ROCSparseMatrixBSRType

Container to hold sparse matrices in block compressed sparse row (BSR) format on the GPU. BSR format is also used in Intel MKL, and is suited to matrices that are "block" sparse - rare blocks of non-sparse regions.

AMDGPU.rocSPARSE.ROCSparseMatrixCSRType
ROCSparseMatrixCSR{Tv, Ti} <: AbstractROCSparseMatrix{Tv, Ti}

Container to hold sparse matrices in compressed sparse row (CSR) format on the GPU.

Note

Most ROCSPARSE operations work with CSR formatted matrices, rather than CSC.

AMDGPU.rocSPARSE.axpbyMethod
axpby(alpha::Number, x::ROCSparseVector, beta::Number, y::ROCSparseVector, index::SparseChar)

Performs z = alpha * x + beta * y. x and y are sparse vectors.

AMDGPU.rocSPARSE.axpyi!Method
axpyi!(alpha::BlasFloat, X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Computes alpha * X + Y for sparse X and dense Y.

AMDGPU.rocSPARSE.geamMethod
geam(
    alpha::Number, A::ROCSparseMatrix, beta::Number,
    B::ROCSparseMatrix, index::SparseChar)

Performs C = alpha * A + beta * B. A and B are sparse matrix defined in CSR storage format.

AMDGPU.rocSPARSE.gthr!Method
gthr!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Sets the nonzero elements of X equal to the nonzero elements of Y at the same indices.

AMDGPU.rocSPARSE.gthrz!Method
gthrz!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Sets the nonzero elements of X equal to the nonzero elements of Y at the same indices, and zeros out those elements of Y.

AMDGPU.rocSPARSE.ic0!Method
ic0!(A::ROCSparseMatrix, index::SparseChar)

Incomplete Cholesky factorization with no pivoting. Preserves the sparse layout of matrix A.

AMDGPU.rocSPARSE.ilu0!Method
ilu0!(A::ROCSparseMatrix, index::SparseChar)

Incomplete LU factorization with no pivoting. Preserves the sparse layout of matrix A.

AMDGPU.rocSPARSE.mm!Method
mm!(
    transa::SparseChar, transb::SparseChar, alpha::BlasFloat,
    A::ROCSparseMatrix, B::ROCMatrix, beta::BlasFloat,
    C::ROCMatrix, index::SparseChar)

Performs C = alpha * op(A) * op(B) + beta * C, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). A is a sparse matrix defined in BSR storage format. B and C are dense matrices.

AMDGPU.rocSPARSE.mv!Method
mv!(
    transa::SparseChar, alpha::BlasFloat, A::ROCSparseMatrix, X::ROCVector,
    beta::BlasFloat, Y::ROCVector, index::SparseChar)

Performs Y = alpha * op(A) * X + beta * Y, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a sparse vector, and Y is dense.

AMDGPU.rocSPARSE.roti!Method
roti!(X::ROCSparseVector, Y::ROCVector, c::BlasFloat, s::BlasFloat, index::SparseChar)

Performs the Givens rotation specified by c and s to sparse X and dense Y.

AMDGPU.rocSPARSE.sctr!Method
sctr!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)

Set Y[:] = X[:] for dense Y and sparse X.

AMDGPU.rocSPARSE.sm2!Method
sm2!(
    transa::SparseChar, transxy::SparseChar, uplo::SparseChar,
    diag::SparseChar, alpha::BlasFloat, A::ROCSparseMatrix,
    X::ROCMatrix, index::SparseChar)

Performs X = alpha * op(A) \ op(X), where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a dense matrix, and uplo tells sm2! which triangle of the block sparse matrix A to reference. If the triangle has unit diagonal, set diag to 'U'.

AMDGPU.rocSPARSE.sv2!Method
sv2!(
    transa::SparseChar, uplo::SparseChar, diag::SparseChar,
    alpha::BlasFloat, A::ROCSparseMatrix, X::ROCVector, index::SparseChar)

Performs X = alpha * op(A) \ X, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a dense vector, and uplo tells sv2! which triangle of the block sparse matrix A to reference. If the triangle has unit diagonal, set diag to 'U'.

AMDGPU.rocSPARSE.with_workspaceMethod
with_workspace([eltyp=UInt8], size) do workspace
    ...
end

Create a GPU workspace vector with element type eltyp and size in number of elements (in the default case of an UInt8 element type this equals to the amount of bytes) specified by size, and pass it to the do block.

SparseArrays.sparseMethod
sparse(x::DenseROCMatrix; fmt=:csc)
sparse(I::ROCVector, J::ROCVector, V::ROCVector, [m, n]; fmt=:csc)

Return a sparse rocm matrix, with type determined by fmt. Possible formats are :csc, :csr, :bsr, and :coo.

AMDGPU.ROCmDiscovery.use_artifacts!Function
use_artifacts!(flag::Bool = true)

Pass true to switch from system-wide ROCm installtion to artifacts. When using artifacts, system-wide installation is not needed at all.