Nvidia only provides batched versions of the BLAS gemv, gemm, and trsm functions. Further, they only support floating point, and alpha and beta can only be scalars. BatchedBLAS.jl extends support for batched arrays by (currently) providing batched versions of dot, gemv, symv, spmv, ger, syr, and spr that work with arrays of AbstractFloats and Integers, and scaling coefficients which can be scalars or Vectors.

In addition to the type flexibility, there is a performance benefit for symmetric and packed symmetric matrices, where execution times for syr and spr are faster than the equivalent batched gemm. Benchmarks on an A100 follow. The dashed lines are for the transposed version of gemv and the upper-triangle versions of all other functions. Lower numbers are better.

benchmarks

Example usage:

julia> using CUDA, SymmetricFormats, BatchedBLAS

julia> L = 4  # the matrix size
4

julia> N = 6  # the batch dimension
6

julia> A = reshape(1:L*L*N, L, L, N)
4×4×6 reshape(::UnitRange{Int64}, 4, 4, 6) with eltype Int64:
[:, :, 1] =
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

[:, :, 2] =
 17  21  25  29
 18  22  26  30
 19  23  27  31
 20  24  28  32

[:, :, 3] =
 33  37  41  45
 34  38  42  46
 35  39  43  47
 36  40  44  48

[:, :, 4] =
 49  53  57  61
 50  54  58  62
 51  55  59  63
 52  56  60  64

[:, :, 5] =
 65  69  73  77
 66  70  74  78
 67  71  75  79
 68  72  76  80

[:, :, 6] =
 81  85  89  93
 82  86  90  94
 83  87  91  95
 84  88  92  96

julia> SP = CuArray{Float64}(undef, packedsize(A), N);

julia> for i in 1:N
           SP[:,i] = SymmetricPacked(view(A,:,:,i), :U).tri
       end

julia> SP
10×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
  1.0  17.0  33.0  49.0  65.0  81.0
  5.0  21.0  37.0  53.0  69.0  85.0
  6.0  22.0  38.0  54.0  70.0  86.0
  9.0  25.0  41.0  57.0  73.0  89.0
 10.0  26.0  42.0  58.0  74.0  90.0
 11.0  27.0  43.0  59.0  75.0  91.0
 13.0  29.0  45.0  61.0  77.0  93.0
 14.0  30.0  46.0  62.0  78.0  94.0
 15.0  31.0  47.0  63.0  79.0  95.0
 16.0  32.0  48.0  64.0  80.0  96.0

julia> x = CuArray{Float64}(reshape(1:L*N, L, N))
4×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0  5.0   9.0  13.0  17.0  21.0
 2.0  6.0  10.0  14.0  18.0  22.0
 3.0  7.0  11.0  15.0  19.0  23.0
 4.0  8.0  12.0  16.0  20.0  24.0

julia> batched_spr!('U', 1.0, x, SP)

julia> SP
10×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
  2.0  42.0  114.0  218.0  354.0  522.0
  7.0  51.0  127.0  235.0  375.0  547.0
 10.0  58.0  138.0  250.0  394.0  570.0
 12.0  60.0  140.0  252.0  396.0  572.0
 16.0  68.0  152.0  268.0  416.0  596.0
 20.0  76.0  164.0  284.0  436.0  620.0
 17.0  69.0  153.0  269.0  417.0  597.0
 22.0  78.0  166.0  286.0  438.0  622.0
 27.0  87.0  179.0  303.0  459.0  647.0
 32.0  96.0  192.0  320.0  480.0  672.0

julia> SymmetricPacked(SP[:,1])
4×4 SymmetricPacked{Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}:
  2.0   7.0  12.0  17.0
  7.0  10.0  16.0  22.0
 12.0  16.0  20.0  27.0
 17.0  22.0  27.0  32.0

julia> SymmetricPacked(A[:,:,1] .+ Array(x[:,1]*transpose(x[:,1])))
4×4 SymmetricPacked{Float64, Matrix{Float64}}:
  2.0   7.0  12.0  17.0
  7.0  10.0  16.0  22.0
 12.0  16.0  20.0  27.0
 17.0  22.0  27.0  32.0