Docstrings · CUDA.jl

CUDA.AbstractKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
stream (default: stream()): CuStream to launch the kernel on.
cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.

CUDA.ArrayMemory — Type

ArrayMemory

Array memory residing on the GPU, possibly in a specially-formatted way.

CUDA.Const — Type

Const(A::CuDeviceArray)

Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.

This API can only be used on devices with compute capability 3.5 or higher.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuContext — Type

CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)

Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.

When you are done using the context, call CUDA.unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.

CUDA.CuContext — Method

CuContext(pctx::CuPrimaryContext)

Derive a context from a primary context.

Calling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy! function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!, or set to zero by calling unsafe_reset!. The easiest way to do this is by using the do-block syntax.

CUDA.CuDevice — Type

CuDevice(ordinal::Integer)

Get a handle to a compute device.

CUDA.CuDeviceArray — Type

CuDeviceArray{T,N,A}(ptr, dims, [maxsize])

Construct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

CUDA.CuDeviceTexture — Type

CuDeviceTexture{T,N,M,NC,I}

N-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture{T,N,P}, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain. The I type parameter indicates the kind of interpolation that happens when indexing into this texture. The source memory of the texture is specified by the M parameter, either linear memory or a texture array.

Device-side texture objects cannot be created directly, but should be created host-side using CuTexture{T,N,P} and passed to the kernel as an argument.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuDim3 — Type

CuDim3(x)

CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall or CUDA.launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.

CUDA.CuError — Type

CuError(code)

Create a CUDA error object with error code code.

CUDA.CuEvent — Type

CuEvent()

Create a new CUDA event.

CUDA.CuFunction — Type

CuFunction(mod::CuModule, name::String)

Acquires a function handle from a named function in a module.

CUDA.CuGlobal — Type

CuGlobal{T}(mod::CuModule, name::String)

Acquires a typed global variable handle from a named global in a module.

CUDA.CuGraph — Type

CuGraph([flags])

Create an empty graph for use with low-level graph operations. If you want to create a graph while directly recording operations, use capture. For a high-level interface that also automatically executes the graph, use the @captured macro.

CUDA.CuIterator — Type

CuIterator([to], batches)

Create a CuIterator that iterates through the provided batches via iterate. Upon each iteration, the current batch is copied to the GPU, and the previous iteration is marked as freeable from GPU memory (via unsafe_free!).

The conversion to GPU memory is done recursively, using Adapt.jl, so that each batch can be an array, an array of arrays, or more complex iterable objects. To customize the conversion, an adaptor can be specified as the first argument, e.g., to change the element type:

julia> first(CuIterator([[1.]]))
1-element CuArray{Float64, 1, CUDA.DeviceMemory}:
 1.0

julia> first(CuIterator(CuArray{Float32}, [[1.]]))
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 1.0

This abstraction is useful for batching data into GPU memory in a manner that allows old iterations to potentially be freed (or marked as reusable) earlier than they otherwise would via CuArray's internal polling mechanism.

CUDA.CuLink — Type

CuLink()

Creates a pending JIT linker invocation.

CUDA.CuLinkImage — Type

The result of a linking operation.

This object keeps its parent linker object alive, as destroying a linker destroys linked images too.

CUDA.CuModule — Type

CuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})

Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.

The options is an optional dictionary of JIT options and their respective value.

CUDA.CuModule — Method

CuModule(img::CuLinkImage, ...)

Create a CUDA module from a completed linking operation. Options from CuModule apply.

CUDA.CuPrimaryContext — Type

CuPrimaryContext(dev::CuDevice)

Create a primary CUDA context for a given device.

Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.

CUDA.CuPtr — Type

CuPtr{T}

A memory address that refers to data of type T that is accessible from the GPU. A CuPtr is ABI compatible with regular Ptr objects, e.g. it can be used to ccall a function that expects a Ptr to GPU memory, but it prevents erroneous conversions between the two.

CUDA.CuStream — Type

CuStream(; flags=STREAM_DEFAULT, priority=nothing)

Create a CUDA stream.

CUDA.CuTexture — Type

CuTexture{T,N,P}

N-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTexture — Method

CuTexture(x::CuArray{T,N})

Create a N-dimensional texture object that reads from a CuArray.

Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTexture — Method

CuTexture(x::CuTextureArray{T,N})

Create a N-dimensional texture object withelements of type T that will be read from x.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTexture — Method

CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)

Construct a N-dimensional texture object with elements of type T as stored in parent.

Several keyword arguments alter the behavior of texture objects:

address_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.
interpolation (nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.
normalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.

!!! warning Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Type

CuTextureArray{T,N}(undef, dims)

N-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P} objects.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Method

CuTextureArray(A::AbstractArray)

Allocate and initialize a texture array from host memory in A.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Method

CuTextureArray(A::CuArray)

Allocate and initialize a texture array from device memory in A.

Warning

Experimental API. Subject to change without deprecation.

CUDA.CuTextureArray — Method

CuTextureArray{T,N}(undef, dims)

Construct an uninitialized texture array of N dimensions specified in the dims tuple, with elements of type T. Use Base.copyto! to initialize this texture array, or use constructors that take a non-texture array to do so automatically.

Warning

Experimental API. Subject to change without deprecation.

CUDA.DeviceKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
stream (default: stream()): CuStream to launch the kernel on.
cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.

CUDA.DeviceMemory — Type

DeviceMemory

Device memory residing on the GPU.

CUDA.HostKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
stream (default: stream()): CuStream to launch the kernel on.
cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.

CUDA.HostMemory — Type

HostMemory

Pinned memory residing on the CPU, possibly accessible on the GPU.

CUDA.OutOfGPUMemoryError — Type

OutOfGPUMemoryError()

An operation allocated too much GPU memory for either the system or the memory pool to handle properly.

CUDA.PerDevice — Type

PerDevice{T}()

A helper struct for maintaining per-device state that's lazily initialized and automatically invalidated when the device is reset. Use get!(per_device, dev) do ... end to initialize and fetch a value.

Mutating or deleting state is not supported. If this is required, use a boxed value, like a Ref or a Threads.Atomic.

Furthermore, even though the initialization of this helper, fetching its value for a given device, and clearing it when the device is reset are all performed in a thread-safe manner, you should still take care about thread-safety when using the contained value. For example, if you need to update the value, use atomics; if it's a complex structure like an array or a dictionary, use additional locks.

CUDA.PtrOrCuPtr — Type

PtrOrCuPtr{T}

A special pointer type, ABI-compatible with both Ptr and CuPtr, for use in ccall expressions to convert values to either a GPU or a CPU type (in that order). This is required for CUDA APIs which accept pointers that either point to host or device memory.

CUDA.RNG — Type

CUDA.RNG()

A random number generator using rand() in a device kernel.

See also: @profile

CUDA.Profile.start — Method

start()

Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.

CUDA.Profile.stop — Method

stop()

Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.

CUDA.CUSPARSE.CuSparseMatrix — Type

Utility union type of CuSparseMatrixCSC, CuSparseMatrixCSR, CuSparseMatrixBSR, CuSparseMatrixCOO.

CUDA.CUSPARSE.CSRIterator — Type

CSRIterator{Ti}(row, args...)

A GPU-compatible iterator for accessing the elements of a single row row of several CSR matrices args in one go. The row should be in-bounds for every sparse argument. Each iteration returns a 2-element tuple: The current column, and each arguments' pointer index (or 0 if that input didn't have an element at that column). The pointers can then be used to access the elements themselves.

For convenience, this iterator can be passed non-sparse arguments as well, which will be ignored (with the returned col/ptr values set to 0).

CUDA.CUSPARSE.CuSparseMatrixBSR — Type

Container to hold sparse matrices in block compressed sparse row (BSR) format on the GPU. BSR format is also used in Intel MKL, and is suited to matrices that are "block" sparse - rare blocks of non-sparse regions.

CUDA.CUSPARSE.CuSparseMatrixCOO — Type

Container to hold sparse matrices in coordinate (COO) format on the GPU. COO format is mainly useful to initially construct sparse matrices, afterwards switch to CuSparseMatrixCSR for more functionality.

CUDA.CUSPARSE.CuSparseMatrixCSR — Type

CuSparseMatrixCSR{Tv, Ti} <: AbstractCuSparseMatrix{Tv, Ti}

Container to hold sparse matrices in compressed sparse row (CSR) format on the GPU.

Note

Most CUSPARSE operations work with CSR formatted matrices, rather than CSC.

CUDA 11

Support of indices type rather than Cint (Int32) requires at least CUDA 11.

CUDA.CUSPARSE.axpby! — Method

axpby!(alpha::Number, X::CuSparseVector, beta::Number, Y::CuVector, index::SparseChar)

Computes alpha * X + beta * Y for sparse X and dense Y.

CUDA.CUSPARSE.axpby — Method

axpby(alpha::Number, x::CuSparseVector, beta::Number, y::CuSparseVector, index::SparseChar)

Performs z = alpha * x + beta * y. x and y are sparse vectors.

CUDA.CUSPARSE.chkbmmdims — Method

check that the dimensions of arrays B and C make sense for a batched matrix-matrix multiplication

CUDA.CUSPARSE.chkmmdims — Method

check that the dimensions of matrices B and C make sense for a multiplication

CUDA.CUSPARSE.chkmvdims — Method

check that the dimensions of matrix X and vector Y make sense for a multiplication

CUDA.CUSPARSE.color — Function

color(A::CuSparseMatrixCSC, index::SparseChar; percentage::Number=1.0)
color(A::CuSparseMatrixCSR, index::SparseChar; percentage::Number=1.0)

This function performs the coloring of the adjacency graph associated with the matrix A. The coloring is an assignment of colors (integer numbers) to nodes, such that neighboring nodes have distinct colors. An approximate coloring algorithm is used in this routine, and is stopped when a certain percentage of nodes has been colored. The rest of the nodes are assigned distinct colors (an increasing sequence of integers numbers, starting from the last integer used previously). The reordering is such that nodes that have been assigned the same color are reordered to be next to each other.

The matrix A passed to this routine, must be stored as a general matrix and have a symmetric sparsity pattern. If the matrix is non-symmetric the user should pass A + Aᵀ as a parameter to this routine.

CUDA.CUSPARSE.gather! — Method

gather!(X::CuSparseVector, Y::CuVector, index::SparseChar)

Sets the nonzero elements of X equal to the nonzero elements of Y at the same indices.

CUDA.CUSPARSE.geam — Method

geam(alpha::Number, A::CuSparseMatrix, beta::Number, B::CuSparseMatrix, index::SparseChar)

Performs C = alpha * A + beta * B. A and B are sparse matrices defined in CSR or CSC storage formats.

CUDA.CUSPARSE.gtsv2! — Function

gtsv2!(dl::CuVector, d::CuVector, du::CuVector, B::CuVecOrMat, index::SparseChar='O'; pivoting::Bool=true)

Solve the linear system A * X = B where A is a tridiagonal matrix defined by three vectors corresponding to its lower (dl), main (d), and upper (du) diagonals. With pivoting, the solution is more accurate but also more expensive. Note that the solution X overwrites the right-hand side B.

CUDA.CUSPARSE.ic02! — Function

ic02!(A::CuSparseMatrix, index::SparseChar='O')

Incomplete Cholesky factorization with no pivoting. Preserves the sparse layout of matrix A.

CUDA.CUSPARSE.ilu02! — Function

ilu02!(A::CuSparseMatrix, index::SparseChar='O')

Incomplete LU factorization with no pivoting. Preserves the sparse layout of matrix A.

CUDA.CUSPARSE.mm! — Method

mm!(transa::SparseChar, transb::SparseChar, alpha::Number, A::CuSparseMatrix, B::CuMatrix, beta::Number, C::CuMatrix, index::SparseChar)

Performs C = alpha * op(A) * op(B) + beta * C, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). B and C are dense matrices.

CUDA.CUSPARSE.mv! — Method

mv!(transa::SparseChar, alpha::Number, A::CuSparseMatrix, X::CuVector, beta::Number, Y::CuVector, index::SparseChar)

Performs Y = alpha * op(A) * X + beta * Y, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X and Y are dense vectors.

CUDA.CUSPARSE.rot! — Method

rot!(X::CuSparseVector, Y::CuVector, c::Number, s::Number, index::SparseChar)

Performs the Givens rotation specified by c and s to sparse X and dense Y.

CUDA.CUSPARSE.scatter! — Method

scatter!(Y::CuVector, X::CuSparseVector, index::SparseChar)

Set Y[:] = X[:] for dense Y and sparse X.

CUDA.CUSPARSE.sm2! — Method

sm2!(transa::SparseChar, transxy::SparseChar, uplo::SparseChar, diag::SparseChar, alpha::BlasFloat, A::CuSparseMatrix, X::CuMatrix, index::SparseChar)

Performs X = alpha * op(A) \ op(X), where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a dense matrix, and uplo tells sm2! which triangle of the block sparse matrix A to reference. If the triangle has unit diagonal, set diag to 'U'.

CUDA.CUSPARSE.sv2! — Method

sv2!(transa::SparseChar, uplo::SparseChar, diag::SparseChar, alpha::BlasFloat, A::CuSparseMatrix, X::CuVector, index::SparseChar)

Performs X = alpha * op(A) \ X, where op can be nothing (transa = N), tranpose (transa = T) or conjugate transpose (transa = C). X is a dense vector, and uplo tells sv2! which triangle of the block sparse matrix A to reference. If the triangle has unit diagonal, set diag to 'U'.

SparseArrays.sparse — Method

sparse(x::DenseCuMatrix; fmt=:csc)
sparse(I::CuVector, J::CuVector, V::CuVector, [m, n]; fmt=:csc)

Return a sparse cuda matrix, with type determined by fmt. Possible formats are :csc, :csr, :bsr, and :coo.

CUDA.QuickSortImpl — Module

The main quicksort kernel uses dynamic parallelism. Let's call blocksize M. The first part of the kernel bubble sorts M elements with maximal stride between lo and hi. If the sublist is <= M elements, stride = 1 and no recursion happens. Otherwise, we pick element lo + M ÷ 2 * stride as a pivot. This is an efficient choice for random lists and pre-sorted lists.

Partition is done in stages:

For batches of M values, cumsum how many > pivot are left of each index. The comparison alternates between < and <= with recursion depth. This makes no difference when there are many unique values, but when there are many duplicates, this effectively partitions into <, =, and >.
Consolidate batches. This runs inside the quicksort kernel.

Sublists (ranges of the list being sorted) are denoted by lo and one of L and hi. lo is an exclusive lower bound, hi is an inclusive upperboard, L is their difference. b_sums is "batch sums", the number of values in a batch which are >= pivot or > pivot depending on the relevant parity

Originally developed by @xaellison (Alex Ellison).

CUDA.QuickSortImpl.batch_partition — Method

Partition the region of values after index lo up to (inclusive) hi with respect to pivot. Computes each value's comparison to pivot, performs a cumsum of those comparisons, and performs one movement using shmem. Comparison is affected by parity. See flex_lt. swap is an array for exchanging values and sums is an array of Ints used during the merge sort. Uses block y index to decide which values to operate on.

CUDA.QuickSortImpl.bitonic_median — Method

Finds the median of vals starting after lo and going for blockDim().x elements spaced by stride. Performs bitonic sort in shmem, returns middle value. Faster than bubble sort, but not as flexible. Does not modify vals

CUDA.QuickSortImpl.bubble_sort — Method

Performs bubble sort on vals starting after lo and going for min(L, blockDim().x) elements spaced by stride. Good for sampling pivot values as well as short sorts.

CUDA.QuickSortImpl.call_batch_partition — Method

Partition batches in a loop using a single block

CUDA.QuickSortImpl.call_batch_partition — Method

Launch batch partition kernel and sync

CUDA.QuickSortImpl.consolidate_batch_partition — Method

This assumes the region of vals of length L starting after lo has been batch partitioned with respect to pivot. Further, it assumes that these batches are of size blockDim().x.

Using 1 step per batch, consolidate these partitioned batches such that the region is fully partitioned. Each step moves at most blockDim().x values.

b_sums: either shared memory or a global array which serves as scratch space for storing the partition of each batch.

parity: see top docstring

Must only run on 1 SM.

CUDA.QuickSortImpl.cumsum! — Method

Performs in-place cumsum using shared memory. Intended for use with indexes

CUDA.QuickSortImpl.find_partition — Method

Finds the index in array of the last value <= pivot if parity = true or the last value < pivot if parity = false. Searches after index lo up to (inclusive) index hi

CUDA.QuickSortImpl.partial_range_overlap — Method

Quicksort recursion condition If the domain to sort lo to hi overlaps with partial, then we should do recursion on it, and this returns true (if not, then false)

CUDA.QuickSortImpl.partial_range_overlap — Method

Quicksort recursion condition For a full sort, partial is nothing so it shouldn't affect whether recursion happens.

CUDA.QuickSortImpl.partition_batches_kernel — Method

Each block evaluates batch_partition on consecutive regions of length blockDim().x from lo to hi of values.

CUDA.QuickSortImpl.qsort_kernel — Method

Perform quicksort on dimension dims of vals for the region with lo as an exclusive floor and hi as an inclusive ceiling. parity is a boolean which says whether to partition by < or <= with respect to the pivot. sync_depth is how many (more) levels of recursion with qsort_kernel can be done before reaching cudaLimitDevRuntimeSyncDepth. From the host, this value must not exceed that limit.

sync and enclosed type S determine how partition occurs: If sync is true, the kernel partitions batches in a child kernel, synchronizes, and then consolidates the batches. The benefit of this kernel is that it distributes the work of partitioning batches across multiple SMs. If sync is false, the kernel partitions without launching any child kernels, then has recursive qsort_kernel children for left and right partitions. device_synchronize is never called from this kernel, so there is no practical limit on recursion.

To detect the scenario of all values in the region being the same, we have two args: prev_pivot and stuck. If two consecutive partitions have the same pivot and both failed to split the region in two, that means all the values are equal. stuck is incremented when the pivot hasn't changed and partition = lo or hi. If stuck reaches 2, recursion ends. stuck is initialized at -1 because prev_pivot must be initialized to some value, and it's possible that the first pivot will be that value, which could lead to an incorrectly early end to recursion if we started stuck at 0.

CUDA.APIUtils.LazyInitialized — Type

LazyInitialized{T}()

A thread-safe, lazily-initialized wrapper for a value of type T. Initialize and fetch the value by calling get!. The constructor is ensured to only be called once.

This type is intended for lazy initialization of e.g. global structures, without using __init__. It is similar to protecting accesses using a lock, but is much cheaper.

CUDA.APIUtils.with_workspace — Method

with_workspace([cache], bytesize) do workspace
    ...
end

Create a GPU workspace vector with size bytesize (either a number, or a callable function), and pass it to the do block. Afterwards, the buffer is freed. If you instead want to cache the workspace, pass any previous instance as the first argument, which will result in it getting resized instead.

This helper protects against the rare but real issue of the workspace size getter returning different results based on the GPU device memory pressure, which might change after initial allocation of the workspace (which can cause a GC collection).

See also: with_workspaces, if you need both a GPU and CPU workspace.

CUDA.APIUtils.with_workspaces — Method

with_workspaces([cache_gpu], [cache_cpu], size_gpu, size_cpu) do workspace_gpu, workspace_cpu
    ...
end

Create GPU and CPU workspace vectors with size bytesize (either a number, or a callable function), and pass them to the do block. Afterwards, the buffers are freed. If you instead want to cache the workspaces, pass any previous instances as the first arguments, which will result in them getting resized instead.

This helper protects against the rare but real issue of the workspace size getters returning different results based on the memory pressure, which might change after initial allocation of the workspace (which can cause a GC collection).

See also: with_workspace, if you only need a GPU workspace.

CUDA.APIUtils.@checked — Macro

@checked function foo(...)
    rv = ...
    return rv
end

Macro for wrapping a function definition returning a status code. Two versions of the function will be generated: foo, with the function execution wrapped by an invocation of the check function (to be implemented by the caller of this macro), and unchecked_foo where no such invocation is present and the status code is returned to the caller.

CUDA.APIUtils.@gcsafe_ccall — Macro

@gcsafe_ccall ...

Call a foreign function just like @ccall, but marking it safe for the GC to run. This is useful for functions that may block, so that the GC isn't blocked from running, but may also be required to prevent deadlocks (see JuliaGPU/CUDA.jl#2261).

Note that this is generally only safe with non-Julia C functions that do not call back into Julia. When using callbacks, the code should make sure to transition back into GC-unsafe mode using the @gcunsafe macro.

CUDA.APIUtils.@gcunsafe_callback — Macro

@gcunsafe_callback function callback(...)
    ...
end

Mark a callback function as unsafe for the GC to run. This is normally the default for Julia code, and is meant to be used in combination with @gcsafe_ccall.

CUDA.APIUtils.@memoize — Macro

@memoize [key::T] [maxlen=...] begin
    # expensive computation
end::T

Low-level, no-frills memoization macro that stores values in a thread-local, typed cache. The types of the caches are derived from the syntactical type assertions.

The cache consists of two levels, the outer one indexed with the thread index. If no key is specified, the second level of the cache is dropped.

If the the maxlen option is specified, the key is assumed to be an integer, and the secondary cache will be a vector with length maxlen. Otherwise, a dictionary is used.

CUDA.CUPTI.ActivityConfig — Type

cfg = CUPTI.ActivityConfig(activity_kinds)

CUPTI.enable!(cfg) do
    # do stuff
end

CUPTI.process(cfg) do ctx, stream_id, record
    # inspect record
end

High-level interface to the CUPTI activity API.

CUDA.CUPTI.CallbackConfig — Type

cfg = CUPTI.CallbackConfig(callback_kinds) do domain, id, data
    # inspect data
end

CUPTI.enable!(cfg) do
    # do stuff
end

CUDA.CG — Module

CUDA.jl's cooperative groups implementation.

Cooperative groups in CUDA offer a structured approach to synchronize and communicate among threads. They allow developers to define specific groups of threads, providing a means to fine-tune inter-thread communication granularity. By offering a more nuanced alternative to traditional CUDA synchronization methods, cooperative groups enable a more controlled and efficient parallel decomposition in kernel design.

The following functionality is available in CUDA.jl:

implicit groups: thread blocks, grid groups, and coalesced groups.
synchronization: sync, barrier_arrive, barrier_wait
warp collectives for coalesced groups: shuffle and voting
data transfer: memcpy_async, wait and wait_prior

Noteworthy missing functionality:

implicit groups: clusters, and multi-grid groups (which are deprecated)
explicit groups: tiling and partitioning

CUDA.CG.coalesced_group — Type

coalesced_group <: thread_group

A group representing the current set of converged threads in a warp. The size of the group is not guaranteed and it may return a group of only one thread (itself).

This group exposes warp-synchronous builtins. Constructed via coalesced_threads.

CUDA.CG.grid_group — Type

grid_group <: thread_group

Threads within this this group are guaranteed to be co-resident on the same device within the same launched kernel. To use this group, the kernel must have been launched with @cuda cooperative=true, and the device must support it (queryable device attribute).

Constructed via this_grid.

CUDA.CG.thread_block — Type

thread_block <: thread_group

Every GPU kernel is executed by a grid of thread blocks, and threads within each block are guaranteed to reside on the same streaming multiprocessor. A thread_block represents a thread block whose dimensions are not known until runtime.

Constructed via this_thread_block

CUDA.CG.barrier_arrive — Function

barrier_arrive(group)

Arrive on the barrier, returns a token that needs to be passed into barrier_wait.

CUDA.CG.barrier_wait — Function

barrier_wait(group, token)

Wait on the barrier, takes arrival token returned from barrier_arrive.

CUDA.CG.block_index — Method

block_index(gg::grid_group)

3-Dimensional index of the block within the launched grid.

CUDA.CG.block_rank — Method

block_rank(gg::grid_group)

Rank of the calling block within [0, num_blocks)

CUDA.CG.coalesced_threads — Method

coalesced_threads()

Constructs a coalesced_group.

CUDA.CG.dim_blocks — Method

dim_blocks(gg::grid_group)

Dimensions of the launched grid in units of blocks.

CUDA.CG.dim_threads — Method

dim_threads(tb::thread_block)

Dimensions of the launched block in units of threads.

CUDA.CG.group_index — Method

group_index(tb::thread_block)

3-Dimensional index of the block within the launched grid.

CUDA.CG.is_valid — Method

is_valid(gg::grid_group)

Returns whether the grid_group can synchronize

CUDA.CG.memcpy_async — Function

memcpy_async(group, dst, src, bytes)

Perform a group-wide collective memory copy from src to dst of bytes bytes. This operation may be performed asynchronously, so you should wait or wait_prior before using the data. It is only supported by thread blocks and coalesced groups.

For this operation to be performed asynchronously, the following conditions must be met:

the source and destination memory should be aligned to 4, 8 or 16 bytes. this will be deduced from the datatype, but can also be specified explicitly using CUDA.align.
the source should be global memory, and the destination should be shared memory.
the device should have compute capability 8.0 or higher.

CUDA.CG.meta_group_rank — Method

meta_group_rank(cg::coalesced_group)

Rank of this group in the upper level of the hierarchy.

CUDA.CG.meta_group_size — Method

meta_group_size(cg::coalesced_group)

Total number of partitions created out of all CTAs when the group was created.

CUDA.CG.num_blocks — Method

num_blocks(gg::grid_group)

Total number of blocks in the group.

CUDA.CG.num_threads — Function

num_threads(group)

Returns the total number of threads in the group.

CUDA.CG.sync — Function

sync(group)

Synchronize the threads named in the group, equivalent to calling barrier_wait and barrier_arrive in sequence.

CUDA.CG.this_grid — Method

this_grid()

Constructs a grid_group.

CUDA.CG.this_thread_block — Method

this_thread_block()

Constructs a thread_block group

CUDA.CG.thread_index — Method

thread_index(tb::thread_block)

3-Dimensional index of the thread within the launched block.

CUDA.CG.thread_rank — Function

thread_rank(group)

Returns the linearized rank of the calling thread along the interval [1, num_threads()].

CUDA.CG.wait — Method

wait(group)

Make all threads in this group wait for all previously submitted memcpy_async operations to complete.

CUDA.CG.wait_prior — Method

wait_prior(group, stage)

Make all threads in this group wait for all but stage previously submitted memcpy_async operations to complete.

CUDA.WMMA.ColMajor — Type

WMMA.ColMajor

Type that represents a matrix stored in column major (Julia style) order.

CUDA.WMMA.Config — Type

WMMA.Config{M, N, K, d_type}

Type that contains all information for WMMA operations that cannot be inferred from the argument's types.

WMMA instructions calculate the matrix multiply-accumulate operation $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

d_type refers to the type of the elements of matrix $D$, and can be either Float16 or Float32.

All WMMA operations take a Config as their final argument.

Examples

julia> config = WMMA.Config{16, 16, 16, Float32}
CUDA.WMMA.Config{16, 16, 16, Float32}

CUDA.WMMA.Fragment — Type

WMMA.Fragment

Type that represents per-thread intermediate results of WMMA operations.

You can access individual elements using the x member or [] operator, but beware that the exact ordering of elements is unspecified.

CUDA.WMMA.FragmentLayout — Type

WMMA.FragmentLayout

Abstract type that specifies the storage layout of a matrix.

Possible values are WMMA.RowMajor, WMMA.ColMajor and WMMA.Unspecified.

CUDA.WMMA.RowMajor — Type

WMMA.RowMajor

Type that represents a matrix stored in row major (C style) order.

CUDA.WMMA.Unspecified — Type

WMMA.Unspecified

Type that represents a matrix stored in an unspecified order.

Warning

This storage format is not valid for all WMMA operations!

CUDA.WMMA.fill_c — Function

WMMA.fill_c(value, config)

Return a WMMA.Fragment filled with the value value.

This operation is useful if you want to implement a matrix multiplication (and thus want to set $C = O$).

Arguments

value: The value used to fill the fragment. Can be a Float16 or Float32.
config: The WMMA configuration that should be used for this WMMA operation. See WMMA.Config.

CUDA.WMMA.llvm_wmma_load — Method

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m16n16k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m32n8k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_col_m8n32k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m16n16k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m32n8k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_a_row_m8n32k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m16n16k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m32n8k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_col_m8n32k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m16n16k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m32n8k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_global_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_global_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_shared_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_shared_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_stride_s8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_b_row_m8n32k16_stride_u8 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m16n16k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m32n8k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_col_m8n32k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m16n16k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m32n8k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_c_row_m8n32k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m16n16k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m32n8k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_col_m8n32k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m16n16k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m32n8k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_global_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_global_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_shared_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_shared_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_stride_f32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_load_d_row_m8n32k16_stride_s32 — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_mma — Method

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type} For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m16n16k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m32n8k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_col_m8n32k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m16n16k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m32n8k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_col_row_m8n32k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m16n16k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m32n8k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_col_m8n32k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m16n16k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m32n8k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f16_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f16_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f32_f16 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_f32_f32 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_s8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_mma_row_row_m8n32k16_u8 — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

CUDA.WMMA.llvm_wmma_store — Method

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_global_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_global_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_shared_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_shared_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m16n16k16_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_global_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_global_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_shared_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_shared_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m32n8k16_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_global_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_global_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_shared_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_shared_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_col_m8n32k16_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_global_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_global_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_global_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_shared_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_shared_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_shared_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m16n16k16_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_global_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_global_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_global_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_shared_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_shared_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_shared_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m32n8k16_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_global_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_global_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_global_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_shared_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_shared_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_shared_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_stride_f16 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_stride_f32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.llvm_wmma_store_d_row_m8n32k16_stride_s32 — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA.WMMA.load_a — Function

WMMA.load_a(addr, stride, layout, config)
WMMA.load_b(addr, stride, layout, config)
WMMA.load_c(addr, stride, layout, config)

Load the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.

Arguments

addr: The address to load the matrix from.
stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
config: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.

Warning

All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

CUDA.WMMA.load_b — Function

WMMA.load_a(addr, stride, layout, config)
WMMA.load_b(addr, stride, layout, config)
WMMA.load_c(addr, stride, layout, config)

Load the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.

Arguments

addr: The address to load the matrix from.
stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
config: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.

Warning

All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

CUDA.WMMA.load_c — Function

WMMA.load_a(addr, stride, layout, config)
WMMA.load_b(addr, stride, layout, config)
WMMA.load_c(addr, stride, layout, config)

Load the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.

Arguments

addr: The address to load the matrix from.
stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
config: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.

Warning

All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

CUDA.WMMA.mma — Function

WMMA.mma(a, b, c, conf)

Perform the matrix multiply-accumulate operation $D = A \cdot B + C$.

Arguments

a: The WMMA.Fragment corresponding to the matrix $A$.
b: The WMMA.Fragment corresponding to the matrix $B$.
c: The WMMA.Fragment corresponding to the matrix $C$.
conf: The WMMA.Config that should be used in this WMMA operation.

Warning

All threads in a warp MUST execute the mma operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

CUDA.WMMA.store_d — Function

WMMA.store_d(addr, d, stride, layout, config)

Store the result matrix d to the memory location indicated by addr.

Arguments

addr: The address to store the matrix to.
d: The WMMA.Fragment corresponding to the d matrix.
stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
config: The WMMA configuration that should be used for storing this matrix. See WMMA.Config.

Warning

All threads in a warp MUST execute the store operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.