WMMA

This section details CUDAnative's interface to CUDA's warp matrix multiply-accumulate (WMMA) operations. This interface enables programmatic access to Tensor Cores, a new hardware feature in Volta that performs mixed precision matrix MAC operations.

Access to WMMA using CUDAnative is available in two levels: low level wrappers around the LLVM intrinsics, and a higher-level API, similar to that of CUDA C.

Note that to use the WMMA intrinsics, you need a sufficiently recent version of Julia: v1.4.0-DEV.666 or later. You can check this by running the following in the REPL:

VERSION >= v"1.4.0-DEV.666"
Note

If you're running into any of following errors while using the WMMA interfaces:

LLVM error: Do not know how to split the result of this operator!

or

CUDA error: a PTX JIT compilation failed (code 218, ERROR_INVALID_PTX)
ptxas application ptx input, line <line>; error   : .aligned modifier required for instruction '<instr>'

then make sure you are running Julia v1.4.0-DEV.666 or later!

Note

For optimal performance, you should use Julia v1.5.0-DEV.324 or later.

Introduction of Terminology

The WMMA operations perform a matrix multiply-accumulate. More concretely, it calculates $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ is a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

Note that not all values of $M$, $N$ and $K$ are allowed. The tuple $(M, N, K)$ is often called the "shape" of the multiply accumulate operation.

The multiply-accumulate consists of the following steps:

  • Load the matrices $A$, $B$ and $C$ from memory to registers using a WMMA load operation.
  • Perform the matrix multiply-accumulate of $A$, $B$ and $C$ to obtain $D$ using a WMMA MMA operation. $D$ is stored in hardware registers after this step.
  • Store the result $D$ back to memory using a WMMA store operation.

Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep. Failure to do so will result in undefined behaviour.

Each thread in a warp will hold a part of the matrix in its registers. In WMMA parlance, this part is referred to as a "fragment". Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.

Finally, it is important to note that the resultant $D$ matrix can be used as a $C$ matrix for a subsequent multiply-accumulate. This is useful if one needs to calculate a sum of the form $\sum_{i=0}^{n} A_i B_i$, where $A_i$ and $B_i$ are matrices of the correct dimension.

LLVM Intrinsics

The LLVM intrinsics are accessible by using the one-to-one Julia wrappers. The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic. For example, LLVM's [8 x <2 x half>] becomes NTuple{8, NTuple{2, VecElement{Float16}}} in Julia. In essence, these wrappers return the SSA values returned by the LLVM intrinsic. Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.

These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend. For more information about the PTX instructions, please refer to the PTX Instruction Set Architecture Manual.

The LLVM intrinsics are subdivided in three categories: load, store and multiply-accumulate. In what follows, each of these will be discussed.

Load matrix

Missing docstring.

Missing docstring for CUDAnative.WMMA.llvm_wmma_load. Check Documenter's build log for details.

Perform multiply-accumulate

Missing docstring.

Missing docstring for CUDAnative.WMMA.llvm_wmma_mma. Check Documenter's build log for details.

Store matrix

Missing docstring.

Missing docstring for CUDAnative.WMMA.llvm_wmma_store. Check Documenter's build log for details.

Example

using Test

using CUDAnative
const CuArray = CUDAnative.CuHostArray  # real applications: use CuArrays.jl

# Generate input matrices
a     = rand(Float16, (16, 16))
a_dev = CuArray(a)
b     = rand(Float16, (16, 16))
b_dev = CuArray(b)
c     = rand(Float32, (16, 16))
c_dev = CuArray(c)

# Allocate space for result
d_dev = similar(c_dev)

# Matrix multiply-accumulate kernel (D = A * B + C)
function kernel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16(pointer(a_dev), 16)
    b_frag = WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16(pointer(b_dev), 16)
    c_frag = WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32(pointer(c_dev), 16)

    d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32(a_frag, b_frag, c_frag)

    WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32(pointer(d_dev), d_frag, 16)
    return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
@test all(isapprox.(a * b + c, Array(d_dev); rtol=0.01))

CUDA C-like API

The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA. For example, it ensures that the $A$ fragment argument to the MMA instruction was obtained by a load_a call, and not by a load_b or load_c. Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.

The CUDA C-like API heavily uses Julia's dispatch mechanism. As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.

Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration. All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument. As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.

In contrast, the API in Julia separates the WMMA storage (WMMA.Fragment) and configuration (WMMA.Config). Instead of taking the resultant fragment by reference, the Julia functions just return it. This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction. Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA. This is accomplished by a separate "WMMA configuration" (see WMMA.Config) that you create once, and then give as an argument to all intrinsics.

Fragment

Missing docstring.

Missing docstring for CUDAnative.WMMA.FragmentLayout. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.WMMA.RowMajor. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.WMMA.ColMajor. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.WMMA.Unspecified. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.WMMA.Fragment. Check Documenter's build log for details.

WMMA configuration

Missing docstring.

Missing docstring for CUDAnative.WMMA.Config. Check Documenter's build log for details.

Load matrix

Missing docstring.

Missing docstring for CUDAnative.WMMA.load_a. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.WMMA.load_b. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CUDAnative.WMMA.load_c. Check Documenter's build log for details.

Perform multiply-accumulate

Missing docstring.

Missing docstring for CUDAnative.WMMA.mma. Check Documenter's build log for details.

Store matrix

Missing docstring.

Missing docstring for CUDAnative.WMMA.store_d. Check Documenter's build log for details.

Fill fragment

Missing docstring.

Missing docstring for CUDAnative.WMMA.fill_c. Check Documenter's build log for details.

Element access and broadcasting

Similar to the CUDA C++ WMMA API, WMMA.Fragments have an x member that can be used to access individual elements. Note that, in contrast to the values returned by the LLVM intrinsics, the x member is flattened. For example, while the Float16 variants of the load_a instrinsics return NTuple{8, NTuple{2, VecElement{Float16}}}, the x member has type NTuple{16, Float16}.

Typically, you will only need to access the x member to perform elementwise operations. This can be more succinctly expressed using Julia's broadcast mechanism. For example, to double each element in a fragment, you can simply use:

frag = 2.0f0 .* frag

Example

using Test

using CUDAnative
const CuArray = CUDAnative.CuHostArray  # real applications: use CuArrays.jl

a     = rand(Float16, (16, 16))
b     = rand(Float16, (16, 16))
c     = rand(Float32, (16, 16))

a_dev = CuArray(a)
b_dev = CuArray(b)
c_dev = CuArray(c)
d_dev = similar(c_dev)

function kernel(a_dev, b_dev, c_dev, d_dev)
    conf = WMMA.Config{16, 16, 16, Float32}

    a_frag = WMMA.load_a(pointer(a_dev), 16, WMMA.ColMajor, conf)
    b_frag = WMMA.load_b(pointer(b_dev), 16, WMMA.ColMajor, conf)
    c_frag = WMMA.load_c(pointer(c_dev), 16, WMMA.ColMajor, conf)

    c_frag = 0.5f0 .* c_frag

    d_frag = WMMA.mma(a_frag, b_frag, c_frag, conf)

    WMMA.store_d(pointer(d_dev), d_frag, 16, WMMA.ColMajor, conf)

    return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
d = Array(d_dev)

@test all(isapprox.(a * b + 0.5 * c, d; rtol=0.01))