CUDA.AbstractKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use @cuda
.
A HostKernel
is callable on the host, and a DeviceKernel
is callable on the device (created by @cuda
with dynamic=true
).
The following keyword arguments are supported:
threads
(default:1
): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g.threads=(32, 32)
for a 2D block of 32×32 threads). UsethreadIdx()
andblockDim()
to query from within the kernel.blocks
(default:1
): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g.blocks=(2, 4, 2)
for a 3D grid of blocks). UseblockIdx()
andgridDim()
to query from within the kernel.shmem
(default:0
): Amount of dynamic shared memory in bytes to allocate per thread block; used byCuDynamicSharedArray
.stream
(default:stream()
):CuStream
to launch the kernel on.cooperative
(default:false
): whether to launch a cooperative kernel that supports grid synchronization (seeCG.this_grid
andCG.sync
). Note that this requires care wrt. the number of blocks launched.
CUDA.ArrayMemory
— TypeArrayMemory
Array memory residing on the GPU, possibly in a specially-formatted way.
CUDA.Const
— TypeConst(A::CuDeviceArray)
Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.
This API can only be used on devices with compute capability 3.5 or higher.
Experimental API. Subject to change without deprecation.
CUDA.CuContext
— TypeCuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)
Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.
When you are done using the context, call CUDA.unsafe_destroy!
to mark it for deletion, or use do-block syntax with this constructor.
CUDA.CuContext
— MethodCuContext(pctx::CuPrimaryContext)
Derive a context from a primary context.
Calling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy!
function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!
, or set to zero by calling unsafe_reset!
. The easiest way to do this is by using the do
-block syntax.
CUDA.CuDevice
— TypeCuDevice(ordinal::Integer)
Get a handle to a compute device.
CUDA.CuDeviceArray
— TypeCuDeviceArray{T,N,A}(ptr, dims, [maxsize])
Construct an N
-dimensional dense CUDA device array with element type T
wrapping a pointer, where N
is determined from the length of dims
and T
is determined from the type of ptr
. dims
may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N
is supplied explicitly as in Array{T,N}(dims)
, then it must match the length of dims
. The same applies to the element type T
, which should match the type of the pointer ptr
.
CUDA.CuDeviceTexture
— TypeCuDeviceTexture{T,N,M,NC,I}
N
-dimensional device texture with elements of type T
. This type is the device-side counterpart of CuTexture{T,N,P}
, and can be used to access textures using regular indexing notation. If NC
is true, indices used by these accesses should be normalized, i.e., fall into the [0,1)
domain. The I
type parameter indicates the kind of interpolation that happens when indexing into this texture. The source memory of the texture is specified by the M
parameter, either linear memory or a texture array.
Device-side texture objects cannot be created directly, but should be created host-side using CuTexture{T,N,P}
and passed to the kernel as an argument.
Experimental API. Subject to change without deprecation.
CUDA.CuDim3
— TypeCuDim3(x)
CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))
A type used to specify dimensions, consisting of 3 integers for respectively the x
, y
and z
dimension. Unspecified dimensions default to 1
.
Often accepted as argument through the CuDim
type alias, eg. in the case of cudacall
or CUDA.launch
, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3
object.
CUDA.CuError
— TypeCuError(code)
Create a CUDA error object with error code code
.
CUDA.CuEvent
— TypeCuEvent()
Create a new CUDA event.
CUDA.CuFunction
— TypeCuFunction(mod::CuModule, name::String)
Acquires a function handle from a named function in a module.
CUDA.CuGlobal
— TypeCuGlobal{T}(mod::CuModule, name::String)
Acquires a typed global variable handle from a named global in a module.
CUDA.CuGraph
— TypeCUDA.CuIterator
— TypeCuIterator([to], batches)
Create a CuIterator
that iterates through the provided batches
via iterate
. Upon each iteration, the current batch
is copied to the GPU, and the previous iteration is marked as freeable from GPU memory (via unsafe_free!
).
The conversion to GPU memory is done recursively, using Adapt.jl, so that each batch can be an array, an array of arrays, or more complex iterable objects. To customize the conversion, an adaptor can be specified as the first argument, e.g., to change the element type:
julia> first(CuIterator([[1.]]))
1-element CuArray{Float64, 1, CUDA.DeviceMemory}:
1.0
julia> first(CuIterator(CuArray{Float32}, [[1.]]))
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
1.0
This abstraction is useful for batching data into GPU memory in a manner that allows old iterations to potentially be freed (or marked as reusable) earlier than they otherwise would via CuArray
's internal polling mechanism.
CUDA.CuLink
— TypeCuLink()
Creates a pending JIT linker invocation.
CUDA.CuLinkImage
— TypeThe result of a linking operation.
This object keeps its parent linker object alive, as destroying a linker destroys linked images too.
CUDA.CuModule
— TypeCuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})
Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.
The options
is an optional dictionary of JIT options and their respective value.
CUDA.CuModule
— MethodCuModule(img::CuLinkImage, ...)
Create a CUDA module from a completed linking operation. Options from CuModule
apply.
CUDA.CuPrimaryContext
— TypeCuPrimaryContext(dev::CuDevice)
Create a primary CUDA context for a given device.
Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.
CUDA.CuPtr
— TypeCuPtr{T}
A memory address that refers to data of type T
that is accessible from the GPU. A CuPtr
is ABI compatible with regular Ptr
objects, e.g. it can be used to ccall
a function that expects a Ptr
to GPU memory, but it prevents erroneous conversions between the two.
CUDA.CuStream
— TypeCuStream(; flags=STREAM_DEFAULT, priority=nothing)
Create a CUDA stream.
CUDA.CuTexture
— TypeCuTexture{T,N,P}
N
-dimensional texture object with elements of type T
. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture
type.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture(x::CuArray{T,N})
Create a N
-dimensional texture object that reads from a CuArray
.
Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture(x::CuTextureArray{T,N})
Create a N
-dimensional texture object withelements of type T
that will be read from x
.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)
Construct a N
-dimensional texture object with elements of type T
as stored in parent
.
Several keyword arguments alter the behavior of texture objects:
address_mode
(wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple ofN
entries.interpolation
(nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.normalized_coordinates
(true, false): whether indices are expected to fall in the normalized[0:1)
range.
!!! warning Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— TypeCuTextureArray{T,N}(undef, dims)
N
-dimensional dense texture array with elements of type T
. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P}
objects.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray(A::AbstractArray)
Allocate and initialize a texture array from host memory in A
.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray(A::CuArray)
Allocate and initialize a texture array from device memory in A
.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray{T,N}(undef, dims)
Construct an uninitialized texture array of N
dimensions specified in the dims
tuple, with elements of type T
. Use Base.copyto!
to initialize this texture array, or use constructors that take a non-texture array to do so automatically.
Experimental API. Subject to change without deprecation.
CUDA.DeviceKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use @cuda
.
A HostKernel
is callable on the host, and a DeviceKernel
is callable on the device (created by @cuda
with dynamic=true
).
The following keyword arguments are supported:
threads
(default:1
): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g.threads=(32, 32)
for a 2D block of 32×32 threads). UsethreadIdx()
andblockDim()
to query from within the kernel.blocks
(default:1
): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g.blocks=(2, 4, 2)
for a 3D grid of blocks). UseblockIdx()
andgridDim()
to query from within the kernel.shmem
(default:0
): Amount of dynamic shared memory in bytes to allocate per thread block; used byCuDynamicSharedArray
.stream
(default:stream()
):CuStream
to launch the kernel on.cooperative
(default:false
): whether to launch a cooperative kernel that supports grid synchronization (seeCG.this_grid
andCG.sync
). Note that this requires care wrt. the number of blocks launched.
CUDA.DeviceMemory
— TypeDeviceMemory
Device memory residing on the GPU.
CUDA.HostKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use @cuda
.
A HostKernel
is callable on the host, and a DeviceKernel
is callable on the device (created by @cuda
with dynamic=true
).
The following keyword arguments are supported:
threads
(default:1
): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g.threads=(32, 32)
for a 2D block of 32×32 threads). UsethreadIdx()
andblockDim()
to query from within the kernel.blocks
(default:1
): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g.blocks=(2, 4, 2)
for a 3D grid of blocks). UseblockIdx()
andgridDim()
to query from within the kernel.shmem
(default:0
): Amount of dynamic shared memory in bytes to allocate per thread block; used byCuDynamicSharedArray
.stream
(default:stream()
):CuStream
to launch the kernel on.cooperative
(default:false
): whether to launch a cooperative kernel that supports grid synchronization (seeCG.this_grid
andCG.sync
). Note that this requires care wrt. the number of blocks launched.
CUDA.HostMemory
— TypeHostMemory
Pinned memory residing on the CPU, possibly accessible on the GPU.
CUDA.OutOfGPUMemoryError
— TypeOutOfGPUMemoryError()
An operation allocated too much GPU memory for either the system or the memory pool to handle properly.
CUDA.PerDevice
— TypePerDevice{T}()
A helper struct for maintaining per-device state that's lazily initialized and automatically invalidated when the device is reset. Use get!(per_device, dev) do ... end
to initialize and fetch a value.
Mutating or deleting state is not supported. If this is required, use a boxed value, like a Ref
or a Threads.Atomic
.
Furthermore, even though the initialization of this helper, fetching its value for a given device, and clearing it when the device is reset are all performed in a thread-safe manner, you should still take care about thread-safety when using the contained value. For example, if you need to update the value, use atomics; if it's a complex structure like an array or a dictionary, use additional locks.
CUDA.PtrOrCuPtr
— TypePtrOrCuPtr{T}
A special pointer type, ABI-compatible with both Ptr
and CuPtr
, for use in ccall
expressions to convert values to either a GPU or a CPU type (in that order). This is required for CUDA APIs which accept pointers that either point to host or device memory.
CUDA.RNG
— TypeCUDA.UnifiedMemory
— TypeUnifiedMemory
Unified memory that is accessible on both the CPU and GPU.
CUDA.align
— TypeCUDA.align{N}(obj)
Construct an aligned object, providing alignment information to APIs that require it.
Base.eltype
— Methodeltype(var::CuGlobal)
Return the element type of a global variable object.
Base.getindex
— MethodBase.getindex(var::CuGlobal)
Return the current value of a global variable.
Base.pop!
— Methodpop!(CuContext)
Pops the current CUDA context from the current CPU thread.
Base.push!
— Methodpush!(CuContext, ctx::CuContext)
Pushes a context on the current CPU thread.
Base.rand
— MethodRandom.rand(rng::Philox2x32, UInt32)
Generate a byte of random data using the on-device Tausworthe generator.
Base.resize!
— Methodresize!(a::CuVector, n::Integer)
Resize a
to contain n
elements. If n
is smaller than the current collection length, the first n
elements will be retained. If n
is larger, the new elements are not guaranteed to be initialized.
Base.setindex!
— MethodBase.setindex(var::CuGlobal{T}, val::T)
Set the value of a global variable to val
Base.unsafe_wrap
— Functionsimple case, wrapping a CuArray around an existing GPU pointer
unsafe_wrap(CuArray, ptr::CuPtr{T}, dims; own=false, ctx=context())
wraps a CPU array object around a unified GPU array
unsafe_wrap(Array, a::CuArray)
wraps a GPU array object around a CPU array.
if your system supports HMM, this is a fast operation.
in other cases, it has to use page locking, which can be slow.
unsafewrap(CuArray, ptr::ptr{T}, dims) unsafewrap(CuArray, a::Array)
Wrap a CuArray
object around the data at the address given by the CUDA-managed pointer ptr
. The element type T
determines the array element type. dims
is either an integer (for a 1d array) or a tuple of the array dimensions. own
optionally specified whether Julia should take ownership of the memory, calling cudaFree
when the array is no longer referenced. The ctx
argument determines the CUDA context where the data is allocated in.
CUDA.CuDynamicSharedArray
— MethodCuDynamicSharedArray(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,N,AS.Shared}
Get an array of type T
and dimensions dims
(either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.
Note that the amount of dynamic shared memory needs to specified when launching the kernel.
Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view
.
CUDA.CuStaticSharedArray
— MethodCuStaticSharedArray(T::Type, dims) -> CuDeviceArray{T,N,AS.Shared}
Get an array of type T
and dimensions dims
(either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.
CUDA.access!
— Methodaccess!(pool::CuMemoryPool, dev::CuDevice, flags::CUmemAccess_flags)
access!(pool::CuMemoryPool, devs::Vector{CuDevice}, flags::CUmemAccess_flags)
Control the visibility of memory pool pool
on device dev
or a list of devices devs
.
CUDA.activate
— Methodactivate(ctx::CuContext)
Binds the specified CUDA context to the calling CPU thread.
CUDA.active_blocks
— Methodactive_blocks(fun::CuFunction, threads; shmem=0)
Calculate the maximum number of active blocks per multiprocessor when running threads
threads of a kernel fun
requiring shmem
bytes of dynamic shared memory.
CUDA.active_mask
— Methodactive_mask()
Returns a 32-bit mask indicating which threads in a warp are active with the current executing thread.
CUDA.add_data!
— Methodadd_data!(link::CuLink, name::String, code::String)
Add PTX code to a pending link operation.
CUDA.add_data!
— Methodadd_data!(link::CuLink, name::String, data::Vector{UInt8})
Add object code to a pending link operation.
CUDA.add_file!
— Methodadd_file!(link::CuLink, path::String, typ::CUjitInputType)
Add data from a file to a link operation. The argument typ
indicates the type of the contained data.
CUDA.advise
— Functionadvise(::UnifiedMemory, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])
Advise about the usage of a given memory range.
CUDA.alloc
— Functionalloc(UnifiedMemory, bytesize::Integer, [flags::CUmemAttach_flags])
Allocate bytesize
bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.
CUDA.alloc
— Functionalloc(HostMemory, bytesize::Integer, [flags])
Allocate bytesize
bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags
is set to MEMHOSTALLOC_DEVICEMAP
the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags
is set to MEMHOSTALLOC_PORTABLE
, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags
can be set at one time using a bytewise OR
:
flags = MEMHOSTALLOC_PORTABLE | MEMHOSTALLOC_DEVICEMAP
CUDA.alloc
— Methodalloc(DeviceMemory, bytesize::Integer;
[async=false], [stream::CuStream], [pool::CuMemoryPool])
Allocate bytesize
bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!
, which wraps cuMemcpy
, for access on the CPU.
CUDA.alloc
— Methodalloc(ArrayMemory, dims::Dims)
Allocate array memory with dimensions dims
. The memory is accessible on the GPU, but can only be used in conjunction with special intrinsics (e.g., texture intrinsics).
CUDA.atomic_add!
— Functionatomic_add!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes old + val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.
CUDA.atomic_and!
— Functionatomic_and!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes old & val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_cas!
— Functionatomic_cas!(ptr::LLVMPtr{T}, cmp::T, val::T)
Reads the value old
located at address ptr
and compare with cmp
. If old
equals to cmp
, stores val
at the same address. Otherwise, doesn't change the value old
. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 7.0+, values of type UInt16 are supported.
CUDA.atomic_dec!
— Functionatomic_dec!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes (((old == 0) | (old > val)) ? val : (old-1) )
, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old
.
This operation is only supported for values of type Int32.
CUDA.atomic_inc!
— Functionatomic_inc!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes ((old >= val) ? 0 : (old+1))
, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old
.
This operation is only supported for values of type Int32.
CUDA.atomic_max!
— Functionatomic_max!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes max(old, val)
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_min!
— Functionatomic_min!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes min(old, val)
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_or!
— Functionatomic_or!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes old | val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_sub!
— Functionatomic_sub!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes old - val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_xchg!
— Functionatomic_xchg!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
and stores val
at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.atomic_xor!
— Functionatomic_xor!(ptr::LLVMPtr{T}, val::T)
Reads the value old
located at address ptr
, computes old ⊻ val
, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old
.
This operation is supported for values of type Int32, Int64, UInt32 and UInt64.
CUDA.attribute!
— Methodattribute!(ptr::Union{Ptr,CuPtr}, attr, val)
Sets attributeattr
on a pointer ptr
to val
.
CUDA.attribute!
— Methodattribute!(ptr::Union{Ptr,CuPtr}, attr, val)
Sets attributeattr
on a pointer ptr
to val
.
CUDA.attribute
— Methodattribute(dev::CuDevice, code)
Returns information about the device.
CUDA.attribute
— Methodattribute(X, ptr::Union{Ptr,CuPtr}, attr)
Returns attribute attr
about pointer ptr
. The type of the returned value depends on the attribute, and as such must be passed as the X
parameter.
CUDA.attribute
— Methodattribute(X, pool::CuMemoryPool, attr)
Returns attribute attr
about pool
. The type of the returned value depends on the attribute, and as such must be passed as the X
parameter.
CUDA.blockDim
— MethodblockDim()::NamedTuple
Returns the dimensions of the block.
CUDA.blockIdx
— MethodblockIdx()::NamedTuple
Returns the block index within the grid.
CUDA.cached_memory
— Methodcached_memory()
Returns the amount of backing memory currently allocated for the CUDA memory pool.
CUDA.capability
— Methodcapability(dev::CuDevice)
Returns the compute capability of the device.
CUDA.capture
— Functioncapture([flags], [throw_error::Bool=true]) do
...
end
Capture a graph of CUDA operations. The returned graph can then be instantiated and executed repeatedly for improved performance.
Note that many operations, like initial kernel compilation or memory allocations, cannot be captured. To work around this, you can set the throw_error
keyword to false, which will cause this function to return nothing
if such a failure happens. You can then try to evaluate the function in a regular way, and re-record afterwards.
See also: instantiate
.
CUDA.clock
— Methodclock(UInt32)
Returns the value of a per-multiprocessor counter that is incremented every clock cycle.
CUDA.clock
— Methodclock(UInt64)
Returns the value of a per-multiprocessor counter that is incremented every clock cycle.
CUDA.code_sass
— Methodcode_sass([io], f, types; raw=false)
code_sass(f, [io]; raw=false)
Prints the SASS code corresponding to one or more CUDA modules to io
, which defaults to stdout
.
If providing both f
and types
, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io
.
If only providing a callable function f
, typically specified using the do
syntax, the SASS code for all modules executed during evaluation of f
will be printed. This can be convenient to display the SASS code for functions whose source code is not available.
raw
: dump the assembly likenvdisasm
reports it, without post-processing;- in the case of specifying
f
andtypes
: all keyword arguments fromcufunction
See also: @device_code_sass
CUDA.complete
— Methodcomplete(link::CuLink)
Complete a pending linker invocation, returning an output image.
CUDA.context!
— Methodcontext!(ctx::CuContext)
context!(ctx::CuContext) do ... end
Bind the current host thread to the context ctx
. Returns the previously-bound context. If used with do-block syntax, the change is only temporary.
Note that the contexts used with this call should be previously acquired by calling context
, and not arbitrary contexts created by calling the CuContext
constructor.
CUDA.context
— Methodcontext(ptr)
Identify the context memory was allocated in.
CUDA.context
— Methodcontext()::CuContext
Get or create a CUDA context for the current thread (as opposed to current_context
which may return nothing
if there is no context bound to the current thread).
CUDA.cu
— Methodcu(A; unified=false)
Opinionated GPU array adaptor, which may alter the element type T
of arrays:
- For
T<:AbstractFloat
, it makes aCuArray{Float32}
for performance reasons. (Except thatFloat16
andBFloat16
element types are not changed.) - For
T<:Complex{<:AbstractFloat}
it makes aCuArray{ComplexF32}
. - For other
isbitstype(T)
, it makes aCuArray{T}
.
By contrast, CuArray(A)
never changes the element type.
Uses Adapt.jl to act inside some wrapper structs.
Examples
julia> cu(ones(3)')
1×3 adjoint(::CuArray{Float32, 1, CUDA.DeviceMemory}) with eltype Float32:
1.0 1.0 1.0
julia> cu(zeros(1, 3); unified=true)
1×3 CuArray{Float32, 2, CUDA.UnifiedMemory}:
0.0 0.0 0.0
julia> cu(1:3)
1:3
julia> CuArray(ones(3)') # ignores Adjoint, preserves Float64
1×3 CuArray{Float64, 2, CUDA.DeviceMemory}:
1.0 1.0 1.0
julia> adapt(CuArray, ones(3)') # this restores Adjoint wrapper
1×3 adjoint(::CuArray{Float64, 1, CUDA.DeviceMemory}) with eltype Float64:
1.0 1.0 1.0
julia> CuArray(1:3)
3-element CuArray{Int64, 1, CUDA.DeviceMemory}:
1
2
3
CUDA.cudacall
— Functioncudacall(f, types, values...; blocks::CuDim, threads::CuDim,
cooperative=false, shmem=0, stream=stream())
ccall
-like interface for launching a CUDA function f
on a GPU.
For example:
vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))
The blocks
and threads
arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types
argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.
CUDA.cudaconvert
— Methodcudaconvert(x)
This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x
as-is.
Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor
type.
CUDA.cufunction
— Methodcufunction(f, tt=Tuple{}; kwargs...)
Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda
.
The following keyword arguments are supported:
minthreads
: the required number of threads in a thread blockmaxthreads
: the maximum number of threads in a thread blockblocks_per_sm
: a minimum number of thread blocks to be scheduled on a single multiprocessormaxregs
: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)name
: override the name that the kernel will have in the generated codealways_inline
: inline all function calls in the kernelfastmath
: use less precise square roots and flush denormalscap
andptx
: to override the compute capability and PTX version to compile for
The output of this function is automatically cached, i.e. you can simply call cufunction
in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.
CUDA.current_context
— Methodcurrent_context()
Returns the current context. Throws an undefined reference error if the current thread has no context bound to it, or if the bound context has been destroyed.
This is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context
method instead.
CUDA.current_device
— Methodcurrent_device()
Returns the current device.
This is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device
method instead.
CUDA.default_stream
— Methoddefault_stream()
Return the default stream.
It is generally better to use stream()
to get a stream object that's local to the current task. That way, operations scheduled in other tasks can overlap.
CUDA.description
— Methoddescription(err::CuError)
Gets the string description of an error code.
CUDA.device!
— Functiondevice!(dev::Integer)
device!(dev::CuDevice)
device!(dev) do ... end
Sets dev
as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice
(slightly faster). Both functions can be used with do-block syntax, in which case the device is only changed temporarily, without changing the default device used to initialize new threads or tasks.
Calling this function at the start of a session will make sure CUDA is initialized (i.e., a primary context will be created and activated).
CUDA.device
— Methoddevice(::CuContext)
Returns the device for a context.
CUDA.device
— Methoddevice(ptr)
Identify the device memory was allocated on.
CUDA.device
— Methoddevice()::CuDevice
Get the CUDA device for the current thread, similar to how context()
works compared to current_context()
.
CUDA.device_pointer
— Methoddevice_pointer(ptr::Ptr)
Returns the device pointer value through which ptr
may be accessed by kernels running in the current context.
CUDA.device_reset!
— Functiondevice_reset!(dev::CuDevice=device())
Reset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.
Note that this does not guarantee to free up all memory allocations, as many are not bound to a context, so it is generally not useful to call this function to free up memory.
This function is only reliable on CUDA driver >= v12.0, and may lead to crashes if used on older drivers.
CUDA.device_synchronize
— Methoddevice_synchronize()
Block for the all operations on ctx
to complete. This is a heavyweight operation, typically you only need to call synchronize
which only synchronizes the stream associated with the current task.
On the device, device_synchronize
acts as a synchronization point for child grids in the context of dynamic parallelism.
CUDA.deviceid
— Methoddeviceid(dev::CuDevice)::Int
Get the ID number of the current device of execution. This is a 0-indexed number, corresponding to the device ID as known to CUDA.
CUDA.devices
— Methoddevices()
Get an iterator for the compute devices.
CUDA.driver_version
— Methoddriver_version()
Returns the latest version of CUDA supported by the loaded driver.
CUDA.dynamic_cufunction
— Methoddynamic_cufunction(f, tt=Tuple{})
Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction
.
No keyword arguments are supported.
CUDA.elapsed
— Methodelapsed(start::CuEvent, stop::CuEvent)
Computes the elapsed time between two events (in seconds).
CUDA.exit
— Methodexit()
Terminate a thread.
CUDA.flags
— Methodflags(pctx::CuPrimaryContext)
Query the flags of a primary context.
CUDA.free_memory
— Methodfree_memory()
Returns the free amount of memory (in bytes), available for allocation by the CUDA context.
CUDA.functional
— Functionfunctional(show_reason=false)
Check if the package has been configured successfully and is ready to use.
This call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.
CUDA.gridDim
— MethodgridDim()::NamedTuple
Returns the dimensions of the grid.
CUDA.has_cuda
— Functionhas_cuda()::Bool
Check whether the local system provides an installation of the CUDA driver and runtime. Use this function if your code loads packages that require CUDA.jl. ```
CUDA.has_cuda_gpu
— Functionhas_cuda_gpu()::Bool
Check whether the local system provides an installation of the CUDA driver and runtime, and if it contains a CUDA-capable GPU. See has_cuda
for more details.
Note that this function initializes the CUDA API in order to check for the number of GPUs.
CUDA.host_pointer
— Methodhost_pointer(ptr::CuPtr)
Returns the host pointer value through which ptr
` may be accessed by by the host program.
CUDA.instantiate
— FunctionCUDA.isactive
— Methodisactive(pctx::CuPrimaryContext)
Query whether a primary context is active.
CUDA.isdone
— Methodisdone(e::CuEvent)
Return false
if there is outstanding work preceding the most recent call to record(e)
and true
if all captured work has been completed.
CUDA.isdone
— Methodisdone(s::CuStream)
Return false
if a stream is busy (has task running or queued) and true
if that stream is free.
CUDA.laneid
— Methodlaneid()::Int32
Returns the thread's lane within the warp.
CUDA.lanemask
— Methodlanemask(pred)::UInt32
Returns a 32-bit mask indicating which threads in a warp satisfy the given predicate. Supported predicates are ==
, <
, <=
, >=
, and >
.
CUDA.launch
— Functionlaunch(exec::CuGraphExec, [stream::CuStream])
Launches an executable graph, by default in the currently-active stream.
CUDA.launch
— Methodlaunch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
cooperative=false, shmem=0, stream=stream())
Low-level call to launch a CUDA function f
on the GPU, using blocks
and threads
as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem
, and the kernel is launched on stream stream
.
Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.
This is a low-level call, prefer to use cudacall
instead.
CUDA.launch_configuration
— Methodlaunch_configuration(fun::CuFunction; shmem=0, max_threads=0)
Calculate a suggested launch configuration for kernel fun
requiring shmem
bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads
.
In the case of a variable amount of shared memory, pass a callable object for shmem
instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.
CUDA.legacy_stream
— Methodlegacy_stream()
Return a special object to use use an implicit stream with legacy synchronization behavior.
You can use this stream to perform operations that should block on all streams (with the exception of streams created with STREAM_NON_BLOCKING
). This matches the old pre-CUDA 7 global stream behavior.
CUDA.maxthreads
— Methodmaxthreads(k::HostKernel)
Queries the maximum amount of threads a kernel can use in a single block.
CUDA.memory
— Methodmemory(k::HostKernel)
Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.
CUDA.memset
— Functionmemset(mem::CuPtr, value::Union{UInt8,UInt16,UInt32}, len::Integer; [stream::CuStream])
Initialize device memory by copying val
for len
times.
CUDA.name
— Methodname(dev::CuDevice)
Returns an identifier string for the device.
CUDA.name
— Methodname(err::CuError)
Gets the string representation of an error code.
julia> err = CuError(CUDA.cudaError_enum(1))
CuError(CUDA_ERROR_INVALID_VALUE)
julia> name(err)
"ERROR_INVALID_VALUE"
CUDA.nanosleep
— Methodnanosleep(t)
Puts a thread for a given amount t
(in nanoseconds).
Requires CUDA >= 10.0 and sm_6.2
CUDA.nextwarp
— Methodnextwarp(dev, threads)
prevwarp(dev, threads)
Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev
. This is a common requirement when using intra-warp communication.
CUDA.occupancy
— Methodoccupancy(fun::CuFunction, threads; shmem=0)
Calculate the theoretical occupancy of launching threads
threads of a kernel fun
requiring shmem
bytes of dynamic shared memory.
CUDA.p2p_attribute
— Methodp2p_attribute(src::CuDevice, dst::CuDevice, code)
Returns information about the P2P relationship between a pair of devices.
CUDA.per_thread_stream
— Methodper_thread_stream()
Return a special object to use an implicit stream with per-thread synchronization behavior. This stream object is normally meant to be used with APIs that do not have per-thread versions of their APIs (i.e. without a ptsz
or ptds
suffix).
It is generally not needed to use this type of stream. With CUDA.jl, each task already gets its own non-blocking stream, and multithreading in Julia is typically accomplished using tasks.
CUDA.pool_alloc
— Methodpool_alloc([DeviceMemory], sz)::Managed{<:AbstractMemory}
Allocate a number of bytes sz
from the memory pool on the current stream. Returns a managed memory object; may throw an OutOfGPUMemoryError
if the allocation request cannot be satisfied.
CUDA.pool_free
— Methodpool_free(mem::Managed{<:AbstractMemory})
Releases memory to the pool. If possible, this operation will not block but will be ordered against the stream that last used the memory.
CUDA.pool_status
— Functionpool_status([io=stdout])
Report to io
on the memory status of the current GPU and the active memory pool.
CUDA.prefetch
— Functionprefetch(::UnifiedMemory, [bytes::Integer]; [device::CuDevice], [stream::CuStream])
Prefetches memory to the specified destination device.
CUDA.prevwarp
— Methodnextwarp(dev, threads)
prevwarp(dev, threads)
Returns the next or previous nearest number of threads that is a multiple of the warp size of a device dev
. This is a common requirement when using intra-warp communication.
CUDA.priority
— Methodpriority_range(s::CuStream)
Return the priority of a stream s
.
CUDA.priority_range
— Methodpriority_range()
Return the valid range of stream priorities as a StepRange
(with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).
CUDA.reclaim
— Functionreclaim([sz=typemax(Int)])
Reclaims sz
bytes of cached memory. Use this to free GPU memory before calling into functionality that does not use the CUDA memory pool. Returns the number of bytes actually reclaimed.
CUDA.record
— Functionrecord(e::CuEvent, [stream::CuStream])
Record an event on a stream.
CUDA.register
— Functionregister(HostMemory, ptr::Ptr, bytesize::Integer, [flags])
Page-lock the host memory pointed to by ptr
. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the MEMHOSTREGISTER_DEVICEMAP
flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the MEMHOSTREGISTER_PORTABLE
flag is specified, any CUDA context can access the memory.
CUDA.registers
— Methodregisters(k::HostKernel)
Queries the register usage of a kernel.
CUDA.reset_runtime_version!
— MethodCUDA.reset_runtime_version!()
Resets the CUDA version preferences in the active project to the default, which is to use the most recent compatible runtime available from an artifact source, unless a higher-up depot has configured a different preference. To force use of the default behavior for the local project, use CUDA.set_runtime_version!
with no arguments.
CUDA.retry_reclaim
— Methodretry_reclaim(retry_if) do
# code that may fail due to insufficient GPU memory
end
Run a block of code repeatedly until it successfully allocates the memory it needs. Retries are only attempted when calling retry_if
with the current return value is true. At each try, more and more memory is freed from the CUDA memory pool. When that is not possible anymore, the latest returned value will be returned.
This function is intended for use with CUDA APIs, which sometimes allocate (outside of the CUDA memory pool) and return a specific error code when failing to. It is similar to Base.retry
, but deals with return values instead of exceptions for performance reasons.
CUDA.return_type
— MethodCUDA.return_type(f, tt) -> r::Type
Return a type r
such that f(args...)::r
where args::tt
.
CUDA.runtime_version
— Methodruntime_version()
Returns the CUDA Runtime version.
CUDA.set_runtime_version!
— FunctionCUDA.set_runtime_version!([version::VersionNumber]; [local_toolkit::Bool])
Configures the active project to use a specific CUDA toolkit version from a specific source.
If local_toolkit
is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version
informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version
controls which version will be downloaded and used.
When not specifying either the version
or the local_toolkit
argument, the default behavior will be used, which is to use the most recent compatible runtime available from an artifact source. Note that this will override any Preferences that may be configured in a higher-up depot; to clear preferences nondestructively, use CUDA.reset_runtime_version!
instead.
CUDA.setflags!
— Methodsetflags!(pctx::CuPrimaryContext)
Set the flags of a primary context.
CUDA.shfl_down_sync
— Functionshfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)
Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask
.
CUDA.shfl_recurse
— Methodshfl_recurse(op, x::T)::T
Register how a shuffle operation op
should be applied to a value x
of type T
that is not natively supported by the shuffle intrinsics.
CUDA.shfl_sync
— Functionshfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)
Shuffle a value from a directly indexed lane lane
, and synchronize threads according to threadmask
.
CUDA.shfl_up_sync
— Functionshfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)
Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask
.
CUDA.shfl_xor_sync
— Functionshfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)
Shuffle a value from a lane based on bitwise XOR of own lane ID with mask
, and synchronize threads according to threadmask
.
CUDA.stream
— Functionstream()
Get the CUDA stream that should be used as the default one for the currently executing task.
CUDA.stream!
— Functionstream!(::CuStream)
stream!(::CuStream) do ... end
Change the default CUDA stream for the currently executing task, temporarily if using the do-block version of this function.
CUDA.sync_threads
— Methodsync_threads()
Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads()
are visible to all threads in the block.
CUDA.sync_threads_and
— Methodsync_threads_and(predicate)
Identical to sync_threads()
with the additional feature that it evaluates predicate for all threads of the block and returns true
if and only if predicate
evaluates to true
for all of them.
CUDA.sync_threads_count
— Methodsync_threads_count(predicate)
Identical to sync_threads()
with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate
evaluates to true.
CUDA.sync_threads_or
— Methodsync_threads_or(predicate)
Identical to sync_threads()
with the additional feature that it evaluates predicate for all threads of the block and returns true
if and only if predicate
evaluates to true
for any of them.
CUDA.sync_warp
— Functionsync_warp(mask::Integer=FULL_MASK)
Waits threads in the warp, selected by means of the bitmask mask
, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp()
are visible to those threads in the warp. The default value for mask
selects all threads in the warp.
Requires CUDA >= 9.0 and sm_6.2
CUDA.synchronize
— Functionsynchronize([stream::CuStream])
Wait until stream
has finished executing, with stream
defaulting to the stream associated with the current Julia task.
See also: device_synchronize
CUDA.synchronize
— Methodsynchronize(ctx::Context)
Block for the all operations on ctx
to complete. This is a heavyweight operation, typically you only need to call synchronize
which only synchronizes the stream associated with the current task.
CUDA.synchronize
— Methodsynchronize(e::CuEvent)
Waits for an event to complete.
CUDA.system_driver_version
— Methodsystem_driver_version()
Returns the latest version of CUDA supported by the original system driver, or nothing
if the driver was not upgraded.
CUDA.threadIdx
— MethodthreadIdx()::NamedTuple
Returns the thread index within the block.
CUDA.threadfence
— Methodthreadfence()
A memory fence that acts as threadfence_block
for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence()
are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence()
.
Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.
CUDA.threadfence_block
— Methodthreadfence_block()
A memory fence that ensures that:
- All writes to all memory made by the calling thread before the call to
threadfence_block()
are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call tothreadfence_block()
- All reads from all memory made by the calling thread before the call to
threadfence_block()
are ordered before all reads from all memory made by the calling thread after the call tothreadfence_block()
.
CUDA.threadfence_system
— Methodthreadfence_system()
A memory fence that acts as threadfence_block
for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system()
are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system()
.
CUDA.total_memory
— Methodtotal_memory()
Returns the total amount of memory (in bytes), available for allocation by the CUDA context.
CUDA.totalmem
— Methodtotalmem(dev::CuDevice)
Returns the total amount of memory (in bytes) on the device.
CUDA.unregister
— Methodunregister(::HostMemory)
Unregisters a memory range that was registered with register
.
CUDA.unsafe_copy2d!
— Methodunsafe_copy2d!(dst, dstTyp, src, srcTyp, width, height=1;
dstPos=(1,1), dstPitch=0,
srcPos=(1,1), srcPitch=0,
async=false, stream=nothing)
Perform a 2D memory copy between pointers src
and dst
, at respectively position srcPos
and dstPos
(1-indexed). Pitch can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async
is set, otherwise stream
is synchronized.
CUDA.unsafe_copy3d!
— Methodunsafe_copy3d!(dst, dstTyp, src, srcTyp, width, height=1, depth=1;
dstPos=(1,1,1), dstPitch=0, dstHeight=0,
srcPos=(1,1,1), srcPitch=0, srcHeight=0,
async=false, stream=nothing)
Perform a 3D memory copy between pointers src
and dst
, at respectively position srcPos
and dstPos
(1-indexed). Both pitch and height can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async
is set, otherwise stream
is synchronized.
CUDA.unsafe_destroy!
— Methodunsafe_destroy!(ctx::CuContext)
Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
CUDA.unsafe_free!
— MethodCUDA.unsafe_free!(a::CuArray)
Release the memory of an array for reuse by future allocations. This operation is performed automatically by the GC when an array goes out of scope, but can be called earlier to reduce pressure on the memory allocator.
CUDA.unsafe_release!
— MethodCUDA.unsafe_release!(pctx::CuPrimaryContext)
Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
CUDA.unsafe_reset!
— Methodunsafe_reset!(pctx::CuPrimaryContext)
Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.
CUDA.update
— Methodupdate(exec::CuGraphExec, graph::CuGraph; [throw_error::Bool=true])
Check whether an executable graph can be updated with a graph and perform the update if possible. Returns a boolean indicating whether the update was successful. Unless throw_error
is set to false, also throws an error if the update failed.
CUDA.used_memory
— Methodused_memory()
Returns the amount of memory from the CUDA memory pool that is currently in use by the application.
CUDA.version
— Methodversion(k::HostKernel)
Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.
CUDA.vote_all_sync
— Functionvote_all_sync(mask::UInt32, predicate::Bool)
Evaluate predicate
for all active threads of the warp and return whether predicate
is true for all of them.
CUDA.vote_any_sync
— Functionvote_any_sync(mask::UInt32, predicate::Bool)
Evaluate predicate
for all active threads of the warp and return whether predicate
is true for any of them.
CUDA.vote_ballot_sync
— Functionvote_ballot_sync(mask::UInt32, predicate::Bool)
Evaluate predicate
for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate
is true for the Nth thread of the warp and the Nth thread is active.
CUDA.vote_uni_sync
— Functionvote_uni_sync(mask::UInt32, predicate::Bool)
Evaluate predicate
for all active threads of the warp and return whether predicate
is the same for any of them.
CUDA.wait
— Functionwait(e::CuEvent, [stream::CuStream])
Make a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent)
for that.
CUDA.warpsize
— Methodwarpsize(dev::CuDevice)
Returns the warp size (in threads) of the device.
CUDA.warpsize
— Methodwarpsize()::Int32
Returns the warp size (in threads).
Random.seed!
— FunctionRandom.seed!(rng::Philox2x32, seed::Integer, [counter::Integer=0])
Seed the on-device Philox2x32 generator with an UInt32 number. Should be called by at least one thread per warp.
CUDA.@allocated
— Macro@allocated
A macro to evaluate an expression, discarding the resulting value, instead returning the total number of bytes allocated during evaluation of the expression.
CUDA.@atomic
— Macro@atomic a[I] = op(a[I], val)
@atomic a[I] ...= val
Atomically perform a sequence of operations that loads an array element a[I]
, performs the operation op
on that value and a second value val
, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.
This interface is experimental, and might change without warning. Use the lower-level atomic_...!
functions for a stable API, albeit one limited to natively-supported ops.
CUDA.@bprofile
— MacroCUDA.@bprofile [time=1.0] [kwargs...] code...
Benchmark the given code by running it for time
seconds, and report the results using the internal profiler CUDA.@profile
.
The time
keyword argument is optional, and defaults to 1.0
seconds. Other keyword arguments are forwarded to CUDA.@profile
.
See also: CUDA.@profile
.
CUDA.@captured
— Macrofor ...
@captured begin
# code that executes several kernels or CUDA operations
end
end
A convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.
For this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.
See also: capture
.
CUDA.@cuassert
— Macro@assert cond [text]
Signal assertion failure to the CUDA driver if cond
is false
. Preferred syntax for writing assertions, mimicking Base.@assert
. Message text
is optionally displayed upon assertion failure.
A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.
CUDA.@cuda
— Macro@cuda [kwargs...] func(args...)
High-level interface for executing code on a GPU. The @cuda
macro should prefix a call, with func
a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert
. Finally, a call to cudacall
is performed, scheduling a kernel launch on the current CUDA context.
Several keyword arguments are supported that influence the behavior of @cuda
.
launch
: whether to launch this kernel, defaults totrue
. Iffalse
the returned kernel object should be launched by calling it and passing arguments again.dynamic
: use dynamic parallelism to launch device-side kernels, defaults tofalse
.- arguments that influence kernel compilation: see
cufunction
anddynamic_cufunction
- arguments that influence kernel launch: see
CUDA.HostKernel
andCUDA.DeviceKernel
CUDA.@cuprint
— Macro@cuprint(xs...)
@cuprintln(xs...)
Print a textual representation of values xs
to standard output from the GPU. The functionality builds on @cuprintf
, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchar
s and pointers. For more complex output, use @cuprintf
directly.
Limited string interpolation is also possible:
@cuprint("Hello, World ", 42, "\n")
@cuprint "Hello, World $(42)\n"
CUDA.@cuprintf
— Macro@cuprintf("%Fmt", args...)
Print a formatted string in device context on the host standard output.
Note that this is not a fully C-compliant printf
implementation; see the CUDA documentation for supported options and inputs.
Also beware that it is an untyped, and unforgiving printf
implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld
formatting string.
CUDA.@cuprintln
— Macro@cuprint(xs...)
@cuprintln(xs...)
Print a textual representation of values xs
to standard output from the GPU. The functionality builds on @cuprintf
, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchar
s and pointers. For more complex output, use @cuprintf
directly.
Limited string interpolation is also possible:
@cuprint("Hello, World ", 42, "\n")
@cuprint "Hello, World $(42)\n"
CUDA.@cushow
— Macro@cushow(ex)
GPU analog of Base.@show
. It comes with the same type restrictions as @cuprintf
.
@cushow threadIdx().x
CUDA.@device_code_sass
— Macro@device_code_sass [io::IO=stdout, ...] ex
Evaluates the expression ex
and prints the result of CUDA.code_sass
to io
for every executed CUDA kernel. For other supported keywords, see CUDA.code_sass
.
CUDA.@elapsed
— Macro@elapsed [blocking=false] ex
A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.
See also: @sync
.
CUDA.@profile
— Macro@profile [trace=false] [raw=false] code...
@profile external=true code...
Profile the GPU execution of code
.
There are two modes of operation, depending on whether external
is true
or false
. The default value depends on whether Julia is being run under an external profiler.
Integrated profiler (external=false
, the default)
In this mode, CUDA.jl will profile the execution of code
and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace
can be set to true
. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw
is true
, all data will always be included, even if it may not be relevant. The output will be written to io
, which defaults to stdout
.
Slow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.
!!! compat "Julia 1.9" This functionality is only available on Julia 1.9 and later.
!!! compat "CUDA 11.2" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile
macro to work. It is recommended to use a newer runtime.
External profilers (external=true
, when an external profiler is detected)
For more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true
, which used to be the only way to use this macro.
CUDA.@sync
— Macro@sync [blocking=false] ex
Run expression ex
and synchronize the GPU afterwards.
The blocking
keyword argument determines how synchronization is performed. By default, non-blocking synchronization will be used, which gives other Julia tasks a chance to run while waiting for the GPU to finish. This may increase latency, so for short operations, or when benchmaring code that does not use multiple tasks, it may be beneficial to use blocking synchronization instead by setting blocking=true
. Blocking synchronization can also be enabled globally by changing the nonblocking_synchronization
preference.
See also: synchronize
.
CUDA.@time
— Macro@time ex
Run expression ex
and report on execution time and GPU/CPU memory behavior. The GPU is synchronized right before and after executing ex
to exclude any external effects.
CUDA.BitonicSortImpl
— ModuleThis is an iterative bitonic sort that mimics a recursive version to support non-power2 lengths.
Credit for the recursive form of this algorithm goes to: https://www.inf.hs-flensburg.de/lang/algorithmen/sortieren/bitonic/oddn.htm
CUDA.jl implementation originally by @xaellison
Overview: comparator_kernel
implements a layer of sorting network comparators generally. The sort could run just by looping over comparator
, but comparator_small_kernel
copies values into shmem and loops over several comparators that don't need to access any values outside the range held in shared memory. It provides a moderate speedup.
Notation: k
, j
denote the level of the sorting network (equivalently, recursion depth). vals
is the array of values of type T
that is either being sort
-ed or sortperm
-ed. inds
is an array of indices of type J
that gets permuted in sortperm!
(standard 1-indexed) i1
, i2
index either vals
or inds
depending on the operation. lo
, n
, and m
are integers of type I
used to denote/calculate ranges as described in the recursive algorithm link above. Note these follow the 0-indexing convention from the above source.
CUDA.BitonicSortImpl.bitonic_sort!
— MethodCall bitonic sort on c
which can be a CuArray of values to sort!
or a tuple of values and an index array for doing sortperm!
. Cannot provide a stable sort!
although sortperm!
is properly stable. To reverse, set rev=true
rather than lt=!isless
(otherwise stability of sortperm breaks down).
CUDA.BitonicSortImpl.block_range
— MethodFor each thread in the block, "re-compute" the range which would have been passed in recursively. This range only depends on the block, and guarantees all threads perform swaps accessible using shmem.
Various negative exit values just for debugging.
CUDA.BitonicSortImpl.comparator_kernel
— MethodPerforms a step of bitonic sort requiring swaps between indices further apart than the size of block allows (eg, 1 <–> 10000)
The grid index directly maps to the index of c
that will be used in the swap.
Note that to avoid synchronization issues, only one thread from each pair of indices being swapped will actually move data.
CUDA.BitonicSortImpl.comparator_small_kernel
— MethodPerforms consecutive steps of bitonic sort requiring swaps between indices no further apart than the size of block allows. This effectively moves part of the inner loop (over j, below) inside of a kernel to minimize launches and do swaps in shared mem.
Note that the x dimension of a thread block is treated as a comparator, so when the maximum size of a comparator in this kernel is small, multiple may be executed along the block y dimension, allowing for higher occupancy. These threads in a block with the same threadIdx().x are a 'pseudo-block', and are indexed by pseudo_block_idx
.
Unlike comparator_kernel
, a thread's gridindex does not directly map to the index of c
it will read from. `blockrange` gives gives each pseudo-block a unique range of indices corresponding to a comparator in the sorting network.
Note that this moves the array values copied within shmem, but doesn't copy them back to global the way it does for indices.
CUDA.BitonicSortImpl.finalize_shmem!
— MethodFor sortperm/sortperm!, copy shmem view swap
back to global index array index
is expected to be from a 0-indexing context, but the indices stored in val_inds
are expected to be 1-indexed
CUDA.BitonicSortImpl.finalize_shmem!
— MethodFor sort/sort!, copy shmem view swap
back into global array c
index
is expected to be from a 0-indexing context
CUDA.BitonicSortImpl.get_range
— MethodDetermines parameters for swapping when the grid index directly maps to an Array index for swapping
CUDA.BitonicSortImpl.initialize_shmem!
— MethodFor sort/sort! c
, allocate and return shared memory view of c
Each view is indexed along block x dim: one view per pseudo-block index
is expected to be from a 0-indexing context
CUDA.BitonicSortImpl.initialize_shmem!
— MethodFor sortperm/sortperm!, allocate and return shared memory views of c
and index array. Each view is indexed along block x dim: one view per pseudo-block. index
is expected to be from a 0-indexing context, but the indices stored in val_inds
are expected to be 1-indexed
CUDA.Profile.ProfileResults
— TypeProfileResults(...)
The results of a profiling run, as returned by @profile
. The recommended way to interpret these results is to visualize them using the I/O stack (e.g. by calling display
, print
, string
, ...)
For programmatic access, it is possible to access the fields of this struct. However, the exact format is not guaranteed to be stable, and may change between CUDA.jl releases. Currently, it contains three dataframes:
host
, containing host-side activity;device
, containing device-side activity;nvtx
, with information on captured NVTX ranges and events.
See also: @profile
CUDA.Profile.start
— Methodstart()
Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.
CUDA.Profile.stop
— Methodstop()
Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.
CUDA.CUSPARSE.CuSparseMatrix
— TypeUtility union type of CuSparseMatrixCSC
, CuSparseMatrixCSR
, CuSparseMatrixBSR
, CuSparseMatrixCOO
.
CUDA.CUSPARSE.CSRIterator
— TypeCSRIterator{Ti}(row, args...)
A GPU-compatible iterator for accessing the elements of a single row row
of several CSR matrices args
in one go. The row should be in-bounds for every sparse argument. Each iteration returns a 2-element tuple: The current column, and each arguments' pointer index (or 0 if that input didn't have an element at that column). The pointers can then be used to access the elements themselves.
For convenience, this iterator can be passed non-sparse arguments as well, which will be ignored (with the returned col
/ptr
values set to 0).
CUDA.CUSPARSE.CuSparseMatrixBSR
— TypeContainer to hold sparse matrices in block compressed sparse row (BSR) format on the GPU. BSR format is also used in Intel MKL, and is suited to matrices that are "block" sparse - rare blocks of non-sparse regions.
CUDA.CUSPARSE.CuSparseMatrixCOO
— TypeContainer to hold sparse matrices in coordinate (COO) format on the GPU. COO format is mainly useful to initially construct sparse matrices, afterwards switch to CuSparseMatrixCSR
for more functionality.
CUDA.CUSPARSE.CuSparseMatrixCSR
— TypeCuSparseMatrixCSR{Tv, Ti} <: AbstractCuSparseMatrix{Tv, Ti}
Container to hold sparse matrices in compressed sparse row (CSR) format on the GPU.
Most CUSPARSE operations work with CSR formatted matrices, rather than CSC.
Support of indices type rather than Cint
(Int32
) requires at least CUDA 11.
CUDA.CUSPARSE.axpby!
— Methodaxpby!(alpha::Number, X::CuSparseVector, beta::Number, Y::CuVector, index::SparseChar)
Computes alpha * X + beta * Y
for sparse X
and dense Y
.
CUDA.CUSPARSE.axpby
— Methodaxpby(alpha::Number, x::CuSparseVector, beta::Number, y::CuSparseVector, index::SparseChar)
Performs z = alpha * x + beta * y
. x
and y
are sparse vectors.
CUDA.CUSPARSE.chkbmmdims
— Methodcheck that the dimensions of arrays B
and C
make sense for a batched matrix-matrix multiplication
CUDA.CUSPARSE.chkmmdims
— Methodcheck that the dimensions of matrices B
and C
make sense for a multiplication
CUDA.CUSPARSE.chkmvdims
— Methodcheck that the dimensions of matrix X
and vector Y
make sense for a multiplication
CUDA.CUSPARSE.color
— Functioncolor(A::CuSparseMatrixCSC, index::SparseChar; percentage::Number=1.0)
color(A::CuSparseMatrixCSR, index::SparseChar; percentage::Number=1.0)
This function performs the coloring of the adjacency graph associated with the matrix A. The coloring is an assignment of colors (integer numbers) to nodes, such that neighboring nodes have distinct colors. An approximate coloring algorithm is used in this routine, and is stopped when a certain percentage of nodes has been colored. The rest of the nodes are assigned distinct colors (an increasing sequence of integers numbers, starting from the last integer used previously). The reordering is such that nodes that have been assigned the same color are reordered to be next to each other.
The matrix A passed to this routine, must be stored as a general matrix and have a symmetric sparsity pattern. If the matrix is non-symmetric the user should pass A + Aᵀ as a parameter to this routine.
CUDA.CUSPARSE.gather!
— Methodgather!(X::CuSparseVector, Y::CuVector, index::SparseChar)
Sets the nonzero elements of X
equal to the nonzero elements of Y
at the same indices.
CUDA.CUSPARSE.geam
— Methodgeam(alpha::Number, A::CuSparseMatrix, beta::Number, B::CuSparseMatrix, index::SparseChar)
Performs C = alpha * A + beta * B
. A
and B
are sparse matrices defined in CSR or CSC storage formats.
CUDA.CUSPARSE.gtsv2!
— Functiongtsv2!(dl::CuVector, d::CuVector, du::CuVector, B::CuVecOrMat, index::SparseChar='O'; pivoting::Bool=true)
Solve the linear system A * X = B
where A
is a tridiagonal matrix defined by three vectors corresponding to its lower (dl
), main (d
), and upper (du
) diagonals. With pivoting
, the solution is more accurate but also more expensive. Note that the solution X
overwrites the right-hand side B
.
CUDA.CUSPARSE.ic02!
— Functionic02!(A::CuSparseMatrix, index::SparseChar='O')
Incomplete Cholesky factorization with no pivoting. Preserves the sparse layout of matrix A
.
CUDA.CUSPARSE.ilu02!
— Functionilu02!(A::CuSparseMatrix, index::SparseChar='O')
Incomplete LU factorization with no pivoting. Preserves the sparse layout of matrix A
.
CUDA.CUSPARSE.mm!
— Methodmm!(transa::SparseChar, transb::SparseChar, alpha::Number, A::CuSparseMatrix, B::CuMatrix, beta::Number, C::CuMatrix, index::SparseChar)
Performs C = alpha * op(A) * op(B) + beta * C
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). B
and C
are dense matrices.
CUDA.CUSPARSE.mv!
— Methodmv!(transa::SparseChar, alpha::Number, A::CuSparseMatrix, X::CuVector, beta::Number, Y::CuVector, index::SparseChar)
Performs Y = alpha * op(A) * X + beta * Y
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). X
and Y
are dense vectors.
CUDA.CUSPARSE.rot!
— Methodrot!(X::CuSparseVector, Y::CuVector, c::Number, s::Number, index::SparseChar)
Performs the Givens rotation specified by c
and s
to sparse X
and dense Y
.
CUDA.CUSPARSE.scatter!
— Methodscatter!(Y::CuVector, X::CuSparseVector, index::SparseChar)
Set Y[:] = X[:]
for dense Y
and sparse X
.
CUDA.CUSPARSE.sm2!
— Methodsm2!(transa::SparseChar, transxy::SparseChar, uplo::SparseChar, diag::SparseChar, alpha::BlasFloat, A::CuSparseMatrix, X::CuMatrix, index::SparseChar)
Performs X = alpha * op(A) \ op(X)
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). X
is a dense matrix, and uplo
tells sm2!
which triangle of the block sparse matrix A
to reference. If the triangle has unit diagonal, set diag
to 'U'.
CUDA.CUSPARSE.sv2!
— Methodsv2!(transa::SparseChar, uplo::SparseChar, diag::SparseChar, alpha::BlasFloat, A::CuSparseMatrix, X::CuVector, index::SparseChar)
Performs X = alpha * op(A) \ X
, where op
can be nothing (transa = N
), tranpose (transa = T
) or conjugate transpose (transa = C
). X
is a dense vector, and uplo
tells sv2!
which triangle of the block sparse matrix A
to reference. If the triangle has unit diagonal, set diag
to 'U'.
SparseArrays.sparse
— Methodsparse(x::DenseCuMatrix; fmt=:csc)
sparse(I::CuVector, J::CuVector, V::CuVector, [m, n]; fmt=:csc)
Return a sparse cuda matrix, with type determined by fmt
. Possible formats are :csc, :csr, :bsr, and :coo.
CUDA.QuickSortImpl
— ModuleThe main quicksort kernel uses dynamic parallelism. Let's call blocksize M
. The first part of the kernel bubble sorts M
elements with maximal stride between lo
and hi
. If the sublist is <= M
elements, stride
= 1 and no recursion happens. Otherwise, we pick element lo + M ÷ 2 * stride
as a pivot. This is an efficient choice for random lists and pre-sorted lists.
Partition is done in stages:
- For batches of M values, cumsum how many > pivot are left of each index. The comparison alternates between < and <= with recursion depth. This makes no difference when there are many unique values, but when there are many duplicates, this effectively partitions into <, =, and >.
- Consolidate batches. This runs inside the quicksort kernel.
Sublists (ranges of the list being sorted) are denoted by lo
and one of L
and hi
. lo
is an exclusive lower bound, hi
is an inclusive upperboard, L
is their difference. b_sums
is "batch sums", the number of values in a batch which are >= pivot or > pivot depending on the relevant parity
Originally developed by @xaellison (Alex Ellison).
CUDA.QuickSortImpl.batch_partition
— MethodPartition the region of values
after index lo
up to (inclusive) hi
with respect to pivot
. Computes each value's comparison to pivot, performs a cumsum of those comparisons, and performs one movement using shmem. Comparison is affected by parity
. See flex_lt
. swap
is an array for exchanging values and sums
is an array of Ints used during the merge sort. Uses block y index to decide which values to operate on.
CUDA.QuickSortImpl.bitonic_median
— MethodFinds the median of vals
starting after lo
and going for blockDim().x
elements spaced by stride
. Performs bitonic sort in shmem, returns middle value. Faster than bubble sort, but not as flexible. Does not modify vals
CUDA.QuickSortImpl.bubble_sort
— MethodPerforms bubble sort on vals
starting after lo
and going for min(L
, blockDim().x
) elements spaced by stride
. Good for sampling pivot values as well as short sorts.
CUDA.QuickSortImpl.call_batch_partition
— MethodPartition batches in a loop using a single block
CUDA.QuickSortImpl.call_batch_partition
— MethodLaunch batch partition kernel and sync
CUDA.QuickSortImpl.consolidate_batch_partition
— MethodThis assumes the region of vals
of length L
starting after lo
has been batch partitioned with respect to pivot
. Further, it assumes that these batches are of size blockDim().x
.
Using 1 step per batch, consolidate these partitioned batches such that the region is fully partitioned. Each step moves at most blockDim().x
values.
b_sums
: either shared memory or a global array which serves as scratch space for storing the partition of each batch.
parity
: see top docstring
Must only run on 1 SM.
CUDA.QuickSortImpl.cumsum!
— MethodPerforms in-place cumsum using shared memory. Intended for use with indexes
CUDA.QuickSortImpl.find_partition
— MethodFinds the index in array
of the last value <= pivot
if parity
= true or the last value < pivot
if parity
= false. Searches after index lo
up to (inclusive) index hi
CUDA.QuickSortImpl.partial_range_overlap
— MethodQuicksort recursion condition If the domain to sort lo
to hi
overlaps with partial
, then we should do recursion on it, and this returns true (if not, then false)
CUDA.QuickSortImpl.partial_range_overlap
— MethodQuicksort recursion condition For a full sort, partial
is nothing so it shouldn't affect whether recursion happens.
CUDA.QuickSortImpl.partition_batches_kernel
— MethodEach block evaluates batch_partition
on consecutive regions of length blockDim().x from lo
to hi
of values
.
CUDA.QuickSortImpl.qsort_kernel
— MethodPerform quicksort on dimension dims
of vals
for the region with lo
as an exclusive floor and hi
as an inclusive ceiling. parity
is a boolean which says whether to partition by < or <= with respect to the pivot. sync_depth
is how many (more) levels of recursion with qsort_kernel
can be done before reaching cudaLimitDevRuntimeSyncDepth
. From the host, this value must not exceed that limit.
sync
and enclosed type S
determine how partition occurs: If sync
is true
, the kernel partitions batches in a child kernel, synchronizes, and then consolidates the batches. The benefit of this kernel is that it distributes the work of partitioning batches across multiple SMs. If sync
is false
, the kernel partitions without launching any child kernels, then has recursive qsort_kernel
children for left and right partitions. device_synchronize
is never called from this kernel, so there is no practical limit on recursion.
To detect the scenario of all values in the region being the same, we have two args: prev_pivot
and stuck
. If two consecutive partitions have the same pivot and both failed to split the region in two, that means all the values are equal. stuck
is incremented when the pivot hasn't changed and partition = lo
or hi
. If stuck
reaches 2, recursion ends. stuck
is initialized at -1 because prev_pivot
must be initialized to some value, and it's possible that the first pivot will be that value, which could lead to an incorrectly early end to recursion if we started stuck
at 0.
CUDA.APIUtils.LazyInitialized
— TypeLazyInitialized{T}()
A thread-safe, lazily-initialized wrapper for a value of type T
. Initialize and fetch the value by calling get!
. The constructor is ensured to only be called once.
This type is intended for lazy initialization of e.g. global structures, without using __init__
. It is similar to protecting accesses using a lock, but is much cheaper.
CUDA.APIUtils.with_workspace
— Methodwith_workspace([cache], bytesize) do workspace
...
end
Create a GPU workspace vector with size bytesize
(either a number, or a callable function), and pass it to the do block. Afterwards, the buffer is freed. If you instead want to cache the workspace, pass any previous instance as the first argument, which will result in it getting resized instead.
This helper protects against the rare but real issue of the workspace size getter returning different results based on the GPU device memory pressure, which might change after initial allocation of the workspace (which can cause a GC collection).
See also: with_workspaces
, if you need both a GPU and CPU workspace.
CUDA.APIUtils.with_workspaces
— Methodwith_workspaces([cache_gpu], [cache_cpu], size_gpu, size_cpu) do workspace_gpu, workspace_cpu
...
end
Create GPU and CPU workspace vectors with size bytesize
(either a number, or a callable function), and pass them to the do block. Afterwards, the buffers are freed. If you instead want to cache the workspaces, pass any previous instances as the first arguments, which will result in them getting resized instead.
This helper protects against the rare but real issue of the workspace size getters returning different results based on the memory pressure, which might change after initial allocation of the workspace (which can cause a GC collection).
See also: with_workspace
, if you only need a GPU workspace.
CUDA.APIUtils.@checked
— Macro@checked function foo(...)
rv = ...
return rv
end
Macro for wrapping a function definition returning a status code. Two versions of the function will be generated: foo
, with the function execution wrapped by an invocation of the check
function (to be implemented by the caller of this macro), and unchecked_foo
where no such invocation is present and the status code is returned to the caller.
CUDA.APIUtils.@gcsafe_ccall
— Macro@gcsafe_ccall ...
Call a foreign function just like @ccall
, but marking it safe for the GC to run. This is useful for functions that may block, so that the GC isn't blocked from running, but may also be required to prevent deadlocks (see JuliaGPU/CUDA.jl#2261).
Note that this is generally only safe with non-Julia C functions that do not call back into Julia. When using callbacks, the code should make sure to transition back into GC-unsafe mode using the @gcunsafe
macro.
CUDA.APIUtils.@gcunsafe_callback
— Macro@gcunsafe_callback function callback(...)
...
end
Mark a callback function as unsafe for the GC to run. This is normally the default for Julia code, and is meant to be used in combination with @gcsafe_ccall
.
CUDA.APIUtils.@memoize
— Macro@memoize [key::T] [maxlen=...] begin
# expensive computation
end::T
Low-level, no-frills memoization macro that stores values in a thread-local, typed cache. The types of the caches are derived from the syntactical type assertions.
The cache consists of two levels, the outer one indexed with the thread index. If no key
is specified, the second level of the cache is dropped.
If the the maxlen
option is specified, the key
is assumed to be an integer, and the secondary cache will be a vector with length maxlen
. Otherwise, a dictionary is used.
CUDA.CUPTI.ActivityConfig
— Typecfg = CUPTI.ActivityConfig(activity_kinds)
CUPTI.enable!(cfg) do
# do stuff
end
CUPTI.process(cfg) do ctx, stream_id, record
# inspect record
end
High-level interface to the CUPTI activity API.
CUDA.CUPTI.CallbackConfig
— Typecfg = CUPTI.CallbackConfig(callback_kinds) do domain, id, data
# inspect data
end
CUPTI.enable!(cfg) do
# do stuff
end
CUDA.CG
— ModuleCUDA.jl's cooperative groups implementation.
Cooperative groups in CUDA offer a structured approach to synchronize and communicate among threads. They allow developers to define specific groups of threads, providing a means to fine-tune inter-thread communication granularity. By offering a more nuanced alternative to traditional CUDA synchronization methods, cooperative groups enable a more controlled and efficient parallel decomposition in kernel design.
The following functionality is available in CUDA.jl:
- implicit groups: thread blocks, grid groups, and coalesced groups.
- synchronization:
sync
,barrier_arrive
,barrier_wait
- warp collectives for coalesced groups: shuffle and voting
- data transfer:
memcpy_async
,wait
andwait_prior
Noteworthy missing functionality:
- implicit groups: clusters, and multi-grid groups (which are deprecated)
- explicit groups: tiling and partitioning
CUDA.CG.coalesced_group
— Typecoalesced_group <: thread_group
A group representing the current set of converged threads in a warp. The size of the group is not guaranteed and it may return a group of only one thread (itself).
This group exposes warp-synchronous builtins. Constructed via coalesced_threads
.
CUDA.CG.grid_group
— Typegrid_group <: thread_group
Threads within this this group are guaranteed to be co-resident on the same device within the same launched kernel. To use this group, the kernel must have been launched with @cuda cooperative=true
, and the device must support it (queryable device attribute).
Constructed via this_grid
.
CUDA.CG.thread_block
— Typethread_block <: thread_group
Every GPU kernel is executed by a grid of thread blocks, and threads within each block are guaranteed to reside on the same streaming multiprocessor. A thread_block
represents a thread block whose dimensions are not known until runtime.
Constructed via this_thread_block
CUDA.CG.barrier_arrive
— Functionbarrier_arrive(group)
Arrive on the barrier, returns a token that needs to be passed into barrier_wait
.
CUDA.CG.barrier_wait
— Functionbarrier_wait(group, token)
Wait on the barrier, takes arrival token returned from barrier_arrive
.
CUDA.CG.block_index
— Methodblock_index(gg::grid_group)
3-Dimensional index of the block within the launched grid.
CUDA.CG.block_rank
— Methodblock_rank(gg::grid_group)
Rank of the calling block within [0, num_blocks)
CUDA.CG.coalesced_threads
— Methodcoalesced_threads()
Constructs a coalesced_group
.
CUDA.CG.dim_blocks
— Methoddim_blocks(gg::grid_group)
Dimensions of the launched grid in units of blocks.
CUDA.CG.dim_threads
— Methoddim_threads(tb::thread_block)
Dimensions of the launched block in units of threads.
CUDA.CG.group_index
— Methodgroup_index(tb::thread_block)
3-Dimensional index of the block within the launched grid.
CUDA.CG.is_valid
— Methodis_valid(gg::grid_group)
Returns whether the grid_group can synchronize
CUDA.CG.memcpy_async
— Functionmemcpy_async(group, dst, src, bytes)
Perform a group-wide collective memory copy from src
to dst
of bytes
bytes. This operation may be performed asynchronously, so you should wait
or wait_prior
before using the data. It is only supported by thread blocks and coalesced groups.
For this operation to be performed asynchronously, the following conditions must be met:
- the source and destination memory should be aligned to 4, 8 or 16 bytes. this will be deduced from the datatype, but can also be specified explicitly using
CUDA.align
. - the source should be global memory, and the destination should be shared memory.
- the device should have compute capability 8.0 or higher.
CUDA.CG.meta_group_rank
— Methodmeta_group_rank(cg::coalesced_group)
Rank of this group in the upper level of the hierarchy.
CUDA.CG.meta_group_size
— Methodmeta_group_size(cg::coalesced_group)
Total number of partitions created out of all CTAs when the group was created.
CUDA.CG.num_blocks
— Methodnum_blocks(gg::grid_group)
Total number of blocks in the group.
CUDA.CG.num_threads
— Functionnum_threads(group)
Returns the total number of threads in the group.
CUDA.CG.sync
— Functionsync(group)
Synchronize the threads named in the group, equivalent to calling barrier_wait
and barrier_arrive
in sequence.
CUDA.CG.this_grid
— Methodthis_grid()
Constructs a grid_group
.
CUDA.CG.this_thread_block
— Methodthis_thread_block()
Constructs a thread_block
group
CUDA.CG.thread_index
— Methodthread_index(tb::thread_block)
3-Dimensional index of the thread within the launched block.
CUDA.CG.thread_rank
— Functionthread_rank(group)
Returns the linearized rank of the calling thread along the interval [1, num_threads()]
.
CUDA.CG.wait
— Methodwait(group)
Make all threads in this group wait for all previously submitted memcpy_async
operations to complete.
CUDA.CG.wait_prior
— Methodwait_prior(group, stage)
Make all threads in this group wait for all but stage
previously submitted memcpy_async
operations to complete.
CUDA.WMMA.ColMajor
— TypeWMMA.ColMajor
Type that represents a matrix stored in column major (Julia style) order.
CUDA.WMMA.Config
— TypeWMMA.Config{M, N, K, d_type}
Type that contains all information for WMMA operations that cannot be inferred from the argument's types.
WMMA instructions calculate the matrix multiply-accumulate operation $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.
d_type
refers to the type of the elements of matrix $D$, and can be either Float16
or Float32
.
All WMMA operations take a Config
as their final argument.
Examples
julia> config = WMMA.Config{16, 16, 16, Float32}
CUDA.WMMA.Config{16, 16, 16, Float32}
CUDA.WMMA.Fragment
— TypeWMMA.Fragment
Type that represents per-thread intermediate results of WMMA operations.
You can access individual elements using the x
member or []
operator, but beware that the exact ordering of elements is unspecified.
CUDA.WMMA.FragmentLayout
— TypeWMMA.FragmentLayout
Abstract type that specifies the storage layout of a matrix.
Possible values are WMMA.RowMajor
, WMMA.ColMajor
and WMMA.Unspecified
.
CUDA.WMMA.RowMajor
— TypeWMMA.RowMajor
Type that represents a matrix stored in row major (C style) order.
CUDA.WMMA.Unspecified
— TypeWMMA.Unspecified
Type that represents a matrix stored in an unspecified order.
This storage format is not valid for all WMMA operations!
CUDA.WMMA.fill_c
— FunctionWMMA.fill_c(value, config)
Return a WMMA.Fragment
filled with the value value
.
This operation is useful if you want to implement a matrix multiplication (and thus want to set $C = O$).
Arguments
value
: The value used to fill the fragment. Can be aFloat16
orFloat32
.config
: The WMMA configuration that should be used for this WMMA operation. SeeWMMA.Config
.
CUDA.WMMA.llvm_wmma_load
— MethodWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_global_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_global_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_shared_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_shared_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_stride_s8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_stride_u8
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_global_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_global_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_stride_f32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_stride_s32
— FunctionWMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
src_addr
: The memory address to load from.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{matrix}
: The matrix to load. Can bea
,b
orc
.{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_mma
— MethodWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f16_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f16_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f32_f16
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f32_f32
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_s8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_u8
— FunctionWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)
For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type}
For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}
Arguments
a
: The WMMA fragment corresponding to the matrix $A$.b
: The WMMA fragment corresponding to the matrix $B$.c
: The WMMA fragment corresponding to the matrix $C$.
Placeholders
{a_layout}
: The storage layout for matrix $A$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{b_layout}
: The storage layout for matrix $B$. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{a_elem_type}
: The type of each element in the $A$ matrix. Valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point).{d_elem_type}
: The type of each element in the resultant $D$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).{c_elem_type}
: The type of each element in the $C$ matrix. Valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!
CUDA.WMMA.llvm_wmma_store
— MethodWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_global_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_global_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_global_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_global_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_global_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_global_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_global_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_global_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_global_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_global_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_global_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_global_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_global_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_global_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_global_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_shared_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_shared_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_shared_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_stride_f16
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_stride_f32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_stride_s32
— FunctionWMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)
Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}
.
Arguments
dst_addr
: The memory address to store to.data
: The $D$ fragment to store.stride
: The leading dimension of the matrix, in numbers of elements.
Placeholders
{layout}
: The storage layout for the matrix. Can berow
orcol
, for row major (C style) or column major (Julia style), respectively.{shape}
: The overall shape of the MAC operation. Valid values arem16n16k16
,m32n8k16
, andm8n32k16
.{addr_space}
: The address space ofsrc_addr
. Can be empty (generic addressing),shared
orglobal
.{elem_type}
: The type of each element in the matrix. Fora
andb
matrices, valid values areu8
(byte unsigned integer),s8
(byte signed integer), andf16
(half precision floating point). Forc
andd
matrices, valid values ares32
(32-bit signed integer),f16
(half precision floating point), andf32
(full precision floating point).
CUDA.WMMA.load_a
— FunctionWMMA.load_a(addr, stride, layout, config)
WMMA.load_b(addr, stride, layout, config)
WMMA.load_c(addr, stride, layout, config)
Load the matrix a
, b
or c
from the memory location indicated by addr
, and return the resulting WMMA.Fragment
.
Arguments
addr
: The address to load the matrix from.stride
: The leading dimension of the matrix pointed to byaddr
, specified in number of elements.layout
: The storage layout of the matrix. Possible values areWMMA.RowMajor
andWMMA.ColMajor
.config
: The WMMA configuration that should be used for loading this matrix. SeeWMMA.Config
.
See also: WMMA.Fragment
, WMMA.FragmentLayout
, WMMA.Config
All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.
CUDA.WMMA.load_b
— FunctionWMMA.load_a(addr, stride, layout, config)
WMMA.load_b(addr, stride, layout, config)
WMMA.load_c(addr, stride, layout, config)
Load the matrix a
, b
or c
from the memory location indicated by addr
, and return the resulting WMMA.Fragment
.
Arguments
addr
: The address to load the matrix from.stride
: The leading dimension of the matrix pointed to byaddr
, specified in number of elements.layout
: The storage layout of the matrix. Possible values areWMMA.RowMajor
andWMMA.ColMajor
.config
: The WMMA configuration that should be used for loading this matrix. SeeWMMA.Config
.
See also: WMMA.Fragment
, WMMA.FragmentLayout
, WMMA.Config
All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.
CUDA.WMMA.load_c
— FunctionWMMA.load_a(addr, stride, layout, config)
WMMA.load_b(addr, stride, layout, config)
WMMA.load_c(addr, stride, layout, config)
Load the matrix a
, b
or c
from the memory location indicated by addr
, and return the resulting WMMA.Fragment
.
Arguments
addr
: The address to load the matrix from.stride
: The leading dimension of the matrix pointed to byaddr
, specified in number of elements.layout
: The storage layout of the matrix. Possible values areWMMA.RowMajor
andWMMA.ColMajor
.config
: The WMMA configuration that should be used for loading this matrix. SeeWMMA.Config
.
See also: WMMA.Fragment
, WMMA.FragmentLayout
, WMMA.Config
All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.
CUDA.WMMA.mma
— FunctionWMMA.mma(a, b, c, conf)
Perform the matrix multiply-accumulate operation $D = A \cdot B + C$.
Arguments
a
: TheWMMA.Fragment
corresponding to the matrix $A$.b
: TheWMMA.Fragment
corresponding to the matrix $B$.c
: TheWMMA.Fragment
corresponding to the matrix $C$.conf
: TheWMMA.Config
that should be used in this WMMA operation.
All threads in a warp MUST execute the mma
operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.
CUDA.WMMA.store_d
— FunctionWMMA.store_d(addr, d, stride, layout, config)
Store the result matrix d
to the memory location indicated by addr
.
Arguments
addr
: The address to store the matrix to.d
: TheWMMA.Fragment
corresponding to thed
matrix.stride
: The leading dimension of the matrix pointed to byaddr
, specified in number of elements.layout
: The storage layout of the matrix. Possible values areWMMA.RowMajor
andWMMA.ColMajor
.config
: The WMMA configuration that should be used for storing this matrix. SeeWMMA.Config
.
See also: WMMA.Fragment
, WMMA.FragmentLayout
, WMMA.Config
All threads in a warp MUST execute the store
operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.