AMDGPU.ExceptionHolder
— TypeExceptionHolder
exception_flag::Mem.HostBuffer
: Pinned host memory. Contains one element ofInt32
type. If stored value is not 0, then there is an exception that occurred during kernel execution on the respective device.gate::ROCArray{UInt64}
: Linear index for x, y & z dimensions at which the exception occurred. This is used to filter out other threads from duplication exceptions.buffers_counter::ROCArray{Int32}
: Counts number of printf bufferserrprintf_buffers
currently used.str_buffers_counter::ROCArray{Int32}
: Error string counter. Counts number of string buffersstring_buffers
used for exception reporting.errprintf_buffers::Vector{Mem.HostBuffer}
: Array of buffers used for writing exceptions. These buffers are used in the same way as device-printf buffers, except they are pre-allocated.string_buffers::Vector{Mem.HostBuffer}
: Array of string buffers. These buffers are used every time we need to report the name of the exception, file, or line.
AMDGPU.device!
— Methoddevice!(device::HIPDevice)
Switch current device being used. This switches only for a task inside which it is called.
AMDGPU.device
— Methoddevice(A::ROCArray) -> HIPDevice
Return the device associated with the array A
.
AMDGPU.device
— Methoddevice()::HIPDevice
Get currently active device. This device is used when launching kernels via @roc
.
AMDGPU.device_id!
— Methoddevice_id!(idx::Integer)
Sets the current device to AMDGPU.devices()[idx]
. See device_id
for details on the numbering semantics.
AMDGPU.device_id
— Methoddevice_id() -> Int
device_id(device::HIPDevice) -> Int
Returns the numerical device ID for device
or for the current AMDGPU.device()
.
AMDGPU.device_synchronize
— MethodBlocks until all kernels on all streams have completed. Uses currently active device.
AMDGPU.devices
— Methoddevices()
Get list of all devices.
AMDGPU.finalize_shmem!
— MethodCopy swap
back into global memory x
.
AMDGPU.free
— Methodfree()
Returns the free amount of memory (in bytes), available for allocation on the device.
AMDGPU.functional
— Methodfunctional(component::Symbol) -> Bool
Returns true
if the ROCm component component
is configured and expected to function correctly. Available component
values are:
:hip
- Queries HIP library availability:lld
- Queriesld.lld
tool availability:device_libs
- Queries ROCm device libraries availability:rocblas
- Queries rocBLAS library availability:rocsolver
- Queries rocSOLVER library availability:rocalution
- Queries rocALUTION library availability:rocsparse
- Queries rocSPARSE library availability:rocrand
- Queries rocRAND library availability:rocfft
- Queries rocFFT library availability:MIOpen
- Queries MIOpen library availability:all
- Queries all above components
This query should never throw for valid component
values.
AMDGPU.functional
— Methodfunctional() -> Bool
Returns true
if AMDGPU is nominally functional; "functional" currently means that HSA, HIP, lld, and device libraries are available (although it does not imply that usages of these components will be successful).
Packages may use the result of this query to determine whether it is safe to:
- Use AMDGPU to compile code
- Query devices, queues, and other runtime state
- Launch compiled kernels on a device
- Wait on launched kernels to complete
- Utilize external ROCm libraries (rocBLAS et. al)
If the full compilation and launch pipeline is desired, then this query should be sufficient for most packages and applications. This query combines sub-queries of multiple components; a failing sub-query will propagate to a false
return value. For more fine-grained queries, use functional(::Symbol)
.
This query should never throw.
AMDGPU.hard_memory_limit!
— MethodSet a hard limit for total GPU memory allocations.
AMDGPU.info
— Methodinfo()
Returns a tuple of two integers, indicating respectively the free and total amount of memory (in bytes) available for allocation on the device.
AMDGPU.priority!
— Methodpriority!(f::Base.Callable, priority::Symbol)
Chnage the priority of default stream, execute f
and revert to the original priority. Accepted values are :normal
(the default), :low
and :high
.
Returns:
Return value of the function f
.
AMDGPU.priority!
— Methodpriority!(p::Symbol)
Change the priority of the default stream. Accepted values are :normal
(the default), :low
and :high
.
AMDGPU.rocconvert
— Methodrocconvert(x)
This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x
as-is.
Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the AMDGPU.Adaptor
type.
AMDGPU.stream!
— Methodstream!(f::Base.Callable, stream::HIPStream)
Change the default stream to be used within the same Julia task, execute f
and revert to the original stream.
Returns:
Return value of the function f
.
AMDGPU.stream!
— Methodstream!(s::HIPStream)
Change the default stream to be used within the same Julia task.
AMDGPU.stream
— Methodstream()::HIPStream
Get the HIP stream that should be used as the default one for the currently executing task.
AMDGPU.synchronize
— Functionsynchronize(stream::HIPStream = stream(); blocking::Bool = false)
Wait until all kernels executing on stream
have completed.
If there are running HostCalls, then blocking
must be false
. Additionally, if you want to stop host calls afterwards, then provide stop_hostcalls=true
keyword argument.
AMDGPU.total
— Methodtotal()
Returns the total amount of memory (in bytes), available for allocation on the device.
AMDGPU.used
— Methodused()
Returns the used amount of memory (in bytes), allocated on the device.
Base.resize!
— Methodresize!(a::ROCVector, n::Integer)
Resize a
to contain n
elements. If n
is smaller than the current collection length, the first n
elements will be retained. If n
is larger, the new elements are not guaranteed to be initialized.
Note that this operation is only supported on managed buffers, i.e., not on arrays that are created by unsafe_wrap
.
AMDGPU.@elapsed
— Macro@elapsed ex
A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.
AMDGPU.@roc
— Macro@roc [kwargs...] func(args...)
High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.
Several keyword arguments are supported:
launch::Bool = true
: whether to launch the kernel. Iffalse
, then returns a compiled kernel which can be launched by calling it and passing arguments.- Arguments that influence kernel compilation, see
AMDGPU.Compiler.hipfunction
. - Arguments that influence kernel launch, see
AMDGPU.Runtime.HIPKernel
.
AMDGPU.@sync
— Macro@sync ex
Run expression ex
on currently active stream and synchronize the GPU on that stream afterwards.
See also: synchronize
.
AMDGPU.Device.DEVICE_ERR_SENTINEL
— ConstantFatal error on device wavefront accessing the signal.
AMDGPU.Device.DEVICE_LOCK_SENTINEL
— ConstantDevice has locked the signal.
AMDGPU.Device.DEVICE_MSG_SENTINEL
— ConstantDevice-sourced message is available.
AMDGPU.Device.HOST_ERR_SENTINEL
— ConstantFatal error on host thread accessing the signal.
AMDGPU.Device.HOST_LOCK_SENTINEL
— ConstantHost has locked the signal.
AMDGPU.Device.HOST_MSG_SENTINEL
— ConstantHost-sourced message is available.
AMDGPU.Device.READY_SENTINEL
— ConstantSignal is ready for accessing by host or device.
AMDGPU.Device.DeviceStaticString
— TypeInternal representation of a static string.
AMDGPU.Device.HostCall
— TypeHostCall{RT,AT}
GPU-compatible struct for making hostcalls.
AMDGPU.Device.HostCallHolder
— MethodHostCallHolder(func, return_type::Type, arg_types::Type{Tuple}) -> HostCall
Construct a HostCall
that executes func
with the arguments passed from the calling kernel.
func
must be passed arguments of types contained in arg_types
, and must return a value of type return_type
, or else the hostcall will fail with undefined behavior.
Note: This API is currently experimental and is subject to change at any time.
AMDGPU.Device.ROCDeviceArray
— TypeROCDeviceArray(dims, ptr)
ROCDeviceArray{T}(dims, ptr)
ROCDeviceArray{T,A}(dims, ptr)
ROCDeviceArray{T,A,N}(dims, ptr)
Construct an N
-dimensional dense ROC device array with element type T
wrapping a pointer, where N
is determined from the length of dims
and T
is determined from the type of ptr
. dims
may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N
is supplied explicitly as in Array{T,N}(dims)
, then it must match the length of dims
. The same applies to the element type T
, which should match the type of the pointer ptr
.
AMDGPU.Device.activelane
— Methodactivelane()::Cuint
Get id of the current lane within a wavefront/warp.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = i
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Cint}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> Array(x)
1×8 Matrix{Int32}:
0 1 2 3 4 5 6 7
AMDGPU.Device.activemask
— Methodactivemask()::UInt64
Get the mask of all active lanes in a warp.
AMDGPU.Device.all_sync
— Methodall_sync(mask::UInt64, predicate::Bool)::Bool
Evaluate predicate
for all non-exited threads in mask
and return non-zero if and only if predicate
evaluates to non-zero for all of them.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
if i % 2 == 0
mask = 0x0000000055555555 # Only even threads.
x[1] = AMDGPU.Device.all_sync(mask, true)
end
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Bool}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
1
AMDGPU.Device.alloc_special
— MethodAllocates on-device memory statically from the specified address space.
AMDGPU.Device.any_sync
— Methodany_sync(mask::UInt64, predicate::Bool)::Bool
Evaluate predicate
for all non-exited threads in mask
and return non-zero if and only if predicate
evaluates to non-zero for any of them.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
if i % 2 == 0
mask = 0x0000000055555555 # Only even threads.
x[1] = AMDGPU.Device.any_sync(mask, i == 0)
end
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Bool}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
1
AMDGPU.Device.ballot
— Methodballot(predicate::Bool)::UInt64
Return a value whose N
th bit is set if and only if predicate
evaluates to true
for the N
th lane and the lane is active.
julia> function ker!(x)
x[1] = AMDGPU.Device.ballot(true)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Culong}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> x
1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
0x00000000ffffffff
AMDGPU.Device.ballot_sync
— Methodballot_sync(mask::UInt64, predicate::Bool)::UInt64
Evaluate predicate
for all non-exited threads in mask
and return an integer whose Nth bit is set if and only if predicate
is true
for the Nth thread of the wavefront and the Nth thread is active.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
if i % 2 == 0
mask = 0x0000000055555555 # Only even threads.
x[1] = AMDGPU.Device.ballot_sync(mask, true)
end
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{UInt64}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> bitstring(Array(x)[1])
"0000000000000000000000000000000001010101010101010101010101010101"
AMDGPU.Device.blockDim
— MethodblockDim()::ROCDim3
Returns the dimensions of the block. See also: workgroupDim
AMDGPU.Device.blockIdx
— MethodblockIdx()::ROCDim3
Returns the block index within the grid. See also: workgroupIdx
AMDGPU.Device.bpermute
— Methodbpermute(addr::Integer, val::Cint)::Cint
Read data stored in val
from the lane VGPR (vector general purpose register) given by addr
.
The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr
VGPR should be desired_lane_id * 4
, since VGPR values are 4 bytes wide.
Example below shifts all values in the wavefront by 1 to the "left".
julia> function ker!(x)
i::Cint = AMDGPU.Device.activelane()
# `addr` points to the next immediate lane.
addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
# Read data from the next immediate lane.
x[i + 1] = AMDGPU.Device.bpermute(addr, i)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Cint}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 4 5 6 7 0
AMDGPU.Device.gridGroupDim
— MethodgridGroupDim()::ROCDim3
Returns the size of the grid in workgroups. This is equivalent to CUDA's gridDim
.
AMDGPU.Device.gridItemDim
— MethodgridItemDim()::ROCDim3
Returns the size of the grid in workitems. This behaviour is different from CUDA where gridDim
gives the size of the grid in blocks.
AMDGPU.Device.hostcall!
— MethodCalls the host function stored in hc
with arguments args
.
AMDGPU.Device.permute
— Methodpermute(addr::Integer, val::Cint)::Cint
Put data stored in val
to the lane VGPR (vector general purpose register) given by addr
.
Example below shifts all values in the wavefront by 1 to the "right".
julia> function ker!(x)
i::Cint = AMDGPU.Device.activelane()
# `addr` points to the next immediate lane.
addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
# Put data into the next immediate lane.
x[i + 1] = AMDGPU.Device.permute(addr, i)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Cint}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
7 0 1 2 3 4 5 6
AMDGPU.Device.readfirstlane
— Methodreadfirstlane(val)
Read a value stored in val
from the first lane in the wavefront.
AMDGPU.Device.shfl
— Functionshfl(val, lane, width = wavefrontsize())
Read data stored in val
from a lane
(this is a more high-level op than bpermute
).
If lane
is outside the range [0:width - 1]
, the value returned corresponds to the value held by the lane modulo width
(within the same subsection).
julia> function ker!(x)
i::UInt32 = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl(i, i + 1)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{UInt32}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 4 5 6 7 0
If width
is less than wavefront size then each subsection of the wavefront behaves as a separate entity with a starting logical lane ID of 0.
julia> function ker!(x)
i::UInt32 = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl(i, i + 1, 4) # <-- Notice width = 4.
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{UInt32}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 0 5 6 7 4
AMDGPU.Device.shfl_down
— Functionshfl_down(val, δ, width = wavefrontsize())
Same as shfl
, but instead of specifying lane ID, accepts δ
that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl_down(i, 1, 8)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Int}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 4 5 6 7 7
AMDGPU.Device.shfl_down_sync
— Functionshfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane
with higher ID relative to the caller.
AMDGPU.Device.shfl_sync
— Functionshfl_sync(mask::UInt64, val, lane, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane
ID.
AMDGPU.Device.shfl_up
— Functionshfl_up(val, δ, width = wavefrontsize())
Same as shfl
, but instead of specifying lane ID, accepts δ
that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl_up(i, 1)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Int}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
0 0 1 2 3 4 5 6
AMDGPU.Device.shfl_up_sync
— Functionshfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane
with lower ID relative to the caller.
AMDGPU.Device.shfl_xor
— Functionshfl_xor(val, lane_mask, width = wavefrontsize())
Same as shfl
, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask
.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl_xor(i, 1)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Int}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 0 3 2 5 4 7 6
AMDGPU.Device.shfl_xor_sync
— Functionshfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask
.
AMDGPU.Device.sync_workgroup
— Methodsync_workgroup()
Waits until all wavefronts in a workgroup have reached this call.
AMDGPU.Device.sync_workgroup_and
— Methodsync_workgroup_and(predicate::Cint)::Cint
Identical to sync_workgroup
, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for all of them.
AMDGPU.Device.sync_workgroup_count
— Methodsync_workgroup_count(predicate::Cint)::Cint
Identical to sync_workgroup
, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns the number of workitems for which predicate evaluates to non-zero.
AMDGPU.Device.sync_workgroup_or
— Methodsync_workgroup_or(predicate::Cint)::Cint
Identical to sync_workgroup
, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for any of them.
AMDGPU.Device.threadIdx
— MethodthreadIdx()::ROCDim3
Returns the thread index within the block. See also: workitemIdx
AMDGPU.Device.wavefrontsize
— Methodwavefrontsize()::Cuint
Get the wavefront size of the device that executes current kernel.
AMDGPU.Device.wfred
— Functionwfred(op::Function, val::T) where T -> T
Performs a wavefront-wide reduction on val
in each lane, and returns the result. A limited subset of functions are available to be passed as op
. When op
is one of (+, max, min, &, |, ⊻)
, T
may be <:Union{Cint, Clong, Cuint, Culong}
. When op
is one of (+, max, min)
, T
may also be <:Union{Float32, Float64}
.
AMDGPU.Device.wfscan
— Functionwfscan(op::Function, val::T) where T -> T
Performs a wavefront-wide scan on val
in each lane, and returns the result. A limited subset of functions are available to be passed as op
. When op
is one of (+, max, min, &, |, ⊻)
, T
may be <:Union{Cint, Clong, Cuint, Culong}
. When op
is one of (+, max, min)
, T
may also be <:Union{Float32, Float64}
.
AMDGPU.Device.workgroupDim
— MethodworkgroupDim()::ROCDim3
Returns the size of each workgroup in workitems. See also: blockDim
AMDGPU.Device.workgroupIdx
— MethodworkgroupIdx()::ROCDim3
Returns the work group index. See also: blockIdx
AMDGPU.Device.workitemIdx
— MethodworkitemIdx()::ROCDim3
Returns the work item index within the work group. See also: threadIdx
Base.rand
— MethodRandom.rand(rng::Philox2x32, UInt32)
Generate a byte of random data using the on-device Tausworthe generator.
Base.unsafe_load
— MethodRead from the printf buffer on the host from HostCall task.
Random.seed!
— FunctionRandom.seed!(rng::Philox2x32, seed::Integer, [counter::Integer=0])
Seed the on-device Philox2x32 generator with an UInt32 number. Should be called by at least one thread per warp.
AMDGPU.Device.@rocassert
— Macro@rocassert cond [text]
Signal assertion failure if cond
is false
. Preferred syntax for writing assertions, mimicking Base.@assert
. Message text
is optionally displayed upon assertion failure.
A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.
AMDGPU.Runtime.Mem.unsafe_copy3d!
— MethodAsynchronous 3D array copy.
Arguments:
width::Integer
: Width of 3D memory copy.height::Integer = 1
: Height of 3D memory copy.depth::Integer = 1
: Depth of 3D memory copy.dstPos::ROCDim3 = (1, 1, 1)
: Starting position of the destination data for the copy.srcPos::ROCDim3 = (1, 1, 1)
: Starting position of the source data for the copy.dstPitch::Integer = 0
: Pitch in bytes of the destination data for the copy.srcPitch::Integer = 0
: Pitch in bytes of the source data for the copy.async::Bool = false
: Iffalse
, then synchronizestream
immediately.stream::HIP.HIPStream = AMDGPU.stream()
: Stream on which to perform the copy.
AMDGPU.Compiler.hipfunction
— Methodhipfunction(f::F, tt::TT = Tuple{}; kwargs...)
Compile Julia function f
to a HIP kernel given a tuple of argument's types tt
that it accepts.
The following kwargs are supported:
name::Union{String, Nothing} = nothing
: A unique name to give a compiled kernel.unsafe_fp_atomics::Bool = true
: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
AMDGPU.Runtime.AbstractKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use AMDGPU.@roc
.
The following keyword arguments are supported:
groupsize
(defaults to1
) orthreads
(CUDA.jl compatibility shim) Can be either anInt
or anNTuple{N,Int}
(where1 <= N <= 3
)gridsize
(defaults to1
) Can be either anInt
or anNTuple{N,Int}
(where1 <= N <= 3
)blocks
(CUDA.jl compatibility shim) Can be either anInt
or anNTuple{N,Int}
(where1 <= N <= 3
)config
: callback function to dynamically compute the launch configuration. should accept aHostKernel
and return a name tuple with any of the above as fields.queue
(defaults to the default queue)
AMDGPU.Runtime.HIPKernel
— Type(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)
Launch compiled HIPKernel by passing arguments to it.
The following kwargs are supported:
gridsize::ROCDim = 1
: Size of the grid.groupsize::ROCDim = 1
: Size of the workgroup.shmem::Integer = 0
: Amount of dynamically-allocated shared memory in bytes.stream::HIP.HIPStream = AMDGPU.stream()
: Stream on which to launch the kernel.
AMDGPU.Runtime.ROCDim3
— TypeROCDim3(x)
ROCDim3((x,))
ROCDim3((x, y))
ROCDim3((x, y, x))
A type used to specify dimensions, consisting of 3 integers for the x
, y
, and z
dimension, respectively. Unspecified dimensions default to 1
.
Often accepted as argument through the ROCDim
type alias, eg. in the case of roccall
or launch_kernel!
, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit ROCDim3
object.
AMDGPU.Runtime.check
— Methodcheck(result::AMDGPU.HSA.Status)
Check the return code of an HSA call, and throw an HSAError
exception if it is not the success code.
AMDGPU.Runtime.description
— Methoddescription(err::HSAError)
Gets the string description of an error code.
AMDGPU.HIP.HIPStream
— TypeHIPStream(priority::Symbol = :normal)
Arguments:
priority::Symbol
: Priority of the stream::normal
,:high
or:low
.
Create HIPStream with given priority. Device is the default device that's currently in use.
AMDGPU.HIP.HIPStream
— MethodHIPStream(stream::hipStream_t)
Create HIPStream from hipStream_t
handle. Device is the default device that's currently in use.
AMDGPU.HIP.device_id
— Methoddevice_id(d::HIPDevice)
Zero-based device ID as expected by HIP functions. Differs from AMDGPU.device_id
method by 1
.
AMDGPU.HIP.elapsed
— Methodelapsed(start::HIPEvent, stop::HIPEvent)
Computes the elapsed time between two events (in seconds).
See also @elapsed
.
AMDGPU.HIP.gcn_arch
— Methodgcn_arch(d::HIPDevice)::String
Get GCN architecture for the device.
AMDGPU.HIP.name
— Methodname(dev::HIPDevice)::String
Get name of the device.
AMDGPU.HIP.properties
— Methodproperties(dev::HIPDevice)::hipDeviceProp_t
Get all properties for the device. See HIP documentation for hipDeviceProp_t
for the meaning of each field.
AMDGPU.HIP.wavefrontsize
— Methodwavefrontsize(d::HIPDevice)::Cint
Get size of the wavefront. AMD GPUs support either 32 or 64.
AMDGPU.HIP.@gcsafe_ccall
— Macro@gcsafe_ccall ...
Call a foreign function just like @ccall
, but marking it safe for the GC to run. This is useful for functions that may block, so that the GC isn't blocked from running, but may also be required to prevent deadlocks (see JuliaGPU/CUDA.jl#2261).
Note that this is generally only safe with non-Julia C functions that do not call back into Julia. When using callbacks, the code should make sure to transition back into GC-unsafe mode using the @gcunsafe
macro.
AMDGPU.MIOpen.ConvolutionDescriptor
— MethodConvolutionDescriptor(
n_dims::Int64,
padding::Vector{Int32}, stride::Vector{Int32},
dilation::Vector{Int32}, groups::Int64)
Low-level constructor. Users should be using one with keyword arguments.
Arguments:
padding::Vector{Int32}
: Convolution padding in reversed order.stride::Vector{Int32}
: Convolution stride in reversed order.dilation::Vector{Int32}
: Convolution dilation in reversed order.
AMDGPU.MIOpen.ConvolutionDescriptor
— MethodConvolutionDescriptor(
n_dims::Integer; padding, stride, dilation, groups::Integer)
Arguments:
padding
: Convolution padding (array or tuple).stride
: Convolution stride (array or tuple).dilation
: Convolution dilation (array or tuple).
AMDGPU.MIOpen.PoolingDescriptor
— MethodPoolingDescriptor(
mode::miopenPoolingMode_t, n_dims::Integer; dims, padding, stride)
Arguments:
dims
: Pooling window dimensions (array or tuple).padding
: Pooling padding (array or tuple).stride
: Pooling stride (array or tuple).
AMDGPU.MIOpen.TensorDescriptor
— MethodTensorDescriptor(
dtype::miopenDataType_t, dims::Int64,
sizes::Vector{Int32}, strides::Vector{Int32})
Arguments:
sizes::Vector{Int32}
: Dimensions of the tensor in reversed order, i.e.reverse(size(x))
.strides::Vector{Int32}
: Strides of the tensor in reversed order, i.e.reverse(strides(x))
.
AMDGPU.MIOpen.batchnorm_inference
— Methodbatchnorm_inference(
x::ROCArray{T},
γ::ROCArray{T}, β::ROCArray{T},
μ::ROCArray{T}, ν::ROCArray{T}; ϵ::Float64 = 1e-5,
) where T <: MIOPENFloat
Arguments:
γ
: Scaling.β
: Bias.μ
: Running mean for inference.ν
: Running variance for inference.
If x
has N
dims, then N - 1
is considered as 'feature' dimension. Meaning, γ, β, μ, ν must have size(x, N - 1)
shape.
AMDGPU.MIOpen.batchnorm_training
— Methodbatchnorm_training(
x::ROCArray{T},
γ::ROCArray{T}, β::ROCArray{T},
μ::ROCArray{T}, ν::ROCArray{T}; iteration::Int, ϵ::Float64 = 1e-5,
) where T <: MIOPENFloat
Arguments:
γ
: Scaling.β
: Bias.μ
: Running mean for inference.ν
: Running variance for inference.
If x
has N
dims, then N - 1
is considered as 'feature' dimension. Meaning, γ, β, μ, ν must have size(x, N - 1)
shape.
AMDGPU.rocSPARSE.ROCSparseMatrix
— TypeUtility union type of ROCSparseMatrixCSC
, ROCSparseMatrixCSR
, ROCSparseMatrixBSR
, ROCSparseMatrixCOO
.
AMDGPU.rocSPARSE.CSRIterator
— TypeCSRIterator{Ti}(row, args...)
A GPU-compatible iterator for accessing the elements of a single row row
of several CSR matrices args
in one go. The row should be in-bounds for every sparse argument. Each iteration returns a 2-element tuple: The current column, and each arguments' pointer index (or 0 if that input didn't have an element at that column). The pointers can then be used to access the elements themselves.
For convenience, this iterator can be passed non-sparse arguments as well, which will be ignored (with the returned col
/ptr
values set to 0).
AMDGPU.rocSPARSE.ROCSparseMatrixBSR
— TypeContainer to hold sparse matrices in block compressed sparse row (BSR) format on the GPU. BSR format is also used in Intel MKL, and is suited to matrices that are "block" sparse - rare blocks of non-sparse regions.
AMDGPU.rocSPARSE.ROCSparseMatrixCOO
— TypeContainer to hold sparse matrices in coordinate (COO) format on the GPU. COO format is mainly useful to initially construct sparse matrices, afterwards switch to ROCSparseMatrixCSR
for more functionality.
AMDGPU.rocSPARSE.ROCSparseMatrixCSR
— TypeROCSparseMatrixCSR{Tv, Ti} <: AbstractROCSparseMatrix{Tv, Ti}
Container to hold sparse matrices in compressed sparse row (CSR) format on the GPU.
Most ROCSPARSE operations work with CSR formatted matrices, rather than CSC.
AMDGPU.rocSPARSE.axpby
— Methodaxpby(alpha::Number, x::ROCSparseVector, beta::Number, y::ROCSparseVector, index::SparseChar)
Performs z = alpha * x + beta * y
. x
and y
are sparse vectors.
AMDGPU.rocSPARSE.axpyi!
— Methodaxpyi!(alpha::BlasFloat, X::ROCSparseVector, Y::ROCVector, index::SparseChar)
Computes alpha * X + Y
for sparse X
and dense Y
.
AMDGPU.rocSPARSE.chkmmdims
— Methodcheck that the dimensions of matrices X
and Y
make sense for a multiplication
AMDGPU.rocSPARSE.chkmvdims
— Methodcheck that the dimensions of matrix X
and vector Y
make sense for a multiplication
AMDGPU.rocSPARSE.geam
— Methodgeam(
alpha::Number, A::ROCSparseMatrix, beta::Number,
B::ROCSparseMatrix, index::SparseChar)
Performs C = alpha * A + beta * B
. A
and B
are sparse matrix defined in CSR storage format.
AMDGPU.rocSPARSE.gthr!
— Methodgthr!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)
Sets the nonzero elements of X
equal to the nonzero elements of Y
at the same indices.
AMDGPU.rocSPARSE.gthrz!
— Methodgthrz!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)
Sets the nonzero elements of X
equal to the nonzero elements of Y
at the same indices, and zeros out those elements of Y
.
AMDGPU.rocSPARSE.ic0!
— Methodic0!(A::ROCSparseMatrix, index::SparseChar)
Incomplete Cholesky factorization with no pivoting. Preserves the sparse layout of matrix A
.
AMDGPU.rocSPARSE.ilu0!
— Methodilu0!(A::ROCSparseMatrix, index::SparseChar)
Incomplete LU factorization with no pivoting. Preserves the sparse layout of matrix A
.
AMDGPU.rocSPARSE.mm!
— Methodmm!(
transa::SparseChar, transb::SparseChar, alpha::BlasFloat,
A::ROCSparseMatrix, B::ROCMatrix, beta::BlasFloat,
C::ROCMatrix, index::SparseChar)
Performs C = alpha * op(A) * op(B) + beta * C
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). A
is a sparse matrix defined in BSR storage format. B
and C
are dense matrices.
AMDGPU.rocSPARSE.mv!
— Methodmv!(
transa::SparseChar, alpha::BlasFloat, A::ROCSparseMatrix, X::ROCVector,
beta::BlasFloat, Y::ROCVector, index::SparseChar)
Performs Y = alpha * op(A) * X + beta * Y
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). X
is a sparse vector, and Y
is dense.
AMDGPU.rocSPARSE.roti!
— Methodroti!(X::ROCSparseVector, Y::ROCVector, c::BlasFloat, s::BlasFloat, index::SparseChar)
Performs the Givens rotation specified by c
and s
to sparse X
and dense Y
.
AMDGPU.rocSPARSE.sctr!
— Methodsctr!(X::ROCSparseVector, Y::ROCVector, index::SparseChar)
Set Y[:] = X[:]
for dense Y
and sparse X
.
AMDGPU.rocSPARSE.sm2!
— Methodsm2!(
transa::SparseChar, transxy::SparseChar, uplo::SparseChar,
diag::SparseChar, alpha::BlasFloat, A::ROCSparseMatrix,
X::ROCMatrix, index::SparseChar)
Performs X = alpha * op(A) \ op(X)
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). X
is a dense matrix, and uplo
tells sm2!
which triangle of the block sparse matrix A
to reference. If the triangle has unit diagonal, set diag
to 'U'.
AMDGPU.rocSPARSE.sv2!
— Methodsv2!(
transa::SparseChar, uplo::SparseChar, diag::SparseChar,
alpha::BlasFloat, A::ROCSparseMatrix, X::ROCVector, index::SparseChar)
Performs X = alpha * op(A) \ X
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). X
is a dense vector, and uplo
tells sv2!
which triangle of the block sparse matrix A
to reference. If the triangle has unit diagonal, set diag
to 'U'.
AMDGPU.rocSPARSE.with_workspace
— Methodwith_workspace([eltyp=UInt8], size) do workspace
...
end
Create a GPU workspace vector with element type eltyp
and size in number of elements (in the default case of an UInt8 element type this equals to the amount of bytes) specified by size
, and pass it to the do block.
SparseArrays.sparse
— Methodsparse(x::DenseROCMatrix; fmt=:csc)
sparse(I::ROCVector, J::ROCVector, V::ROCVector, [m, n]; fmt=:csc)
Return a sparse rocm matrix, with type determined by fmt
. Possible formats are :csc, :csr, :bsr, and :coo.
AMDGPU.ROCmDiscovery.find_roc_path
— MethodFind root ROCm directory.
AMDGPU.ROCmDiscovery.use_artifacts!
— Functionuse_artifacts!(flag::Bool = true)
Pass true
to switch from system-wide ROCm installtion to artifacts. When using artifacts, system-wide installation is not needed at all.