# CUSOLVERRF.jl

This package is a thin Julia wrapper for cusolverRF It extends CUSOLVER.jl, and is compatible with the rest of the Julia CUDA ecosystem.

## What is cusolverRF?

cusolverRF is a package to perform fast sparse refactorization on CUDA GPU. The LU factorization of a sparse matrix usually happens in two steps.

- During the symbolic factorization, the matrix is reordered to minimize the fill-in, and the sparsity patterns of the
`L`

and`U`

factor are computed. - During the numerical factorization, the coefficients of the factors
`L`

and`U`

are computed.

The first step is challenging to compute on the GPU, as most sparse matrices have unstructured sparsity pattern. cusolver implements a default LU factorization, but it transfers the sparse matrix on the host under the hood to perform the symbolic factorization. This impairs the performance when the same matrix has to be factorized multiple times.

`cusolverRF`

uses a different approach: it computes the symbolic factorization on the
CPU and then transfers it on the device. If the coefficient of the sparse matrix
are updated **without affecting the sparsity pattern of the matrix**, then there
is no need to recompute the symbolic factorization and we can compute the
numerical factorization entirely on the device, **without data transfer between
the host and the device**.

Hence, `cusolverRF`

is very efficient when the same sparse matrix has to be factorized
multiple times. The package is also efficient to solve a sparse linear system
with multiple right-hand-side entirely on the GPU.

## How to use cusolverRF?

Suppose we have a sparse matrix `dA`

instantiated on the GPU.

using SparseArrays, CUDA, LinearAlgebra
using CUDA.CUSPARSE
# Generate a random example
n = 10
A = sprand(n, n, .2) + I
# Move A to the GPU
dA = CuSparseMatrixCSR(A)

Computing the LU factorization of the sparse matrix `dA`

with `cusolverRF`

simply amount to

using CUSOLVERRF
rf = CUSOLVERRF.RFLU(dA; symbolic=:RF)

In this step, the matrix is moved back to the host to compute the
symbolic factorization. The factors `L`

and `U`

are then deported on
the device, and can be accessed as a matrix `M = L + U`

rf.M

The symbolic factorization is computed by default (`symbolic=:RF`

) using `cusolver`

internal
routines, which can be inefficient.
As an alternative, CUSOLVERRF.jl allows to compute the symbolic factorization
with KLU (`symbolic=:KLU`

). However, this second option is still experimental.

Then, computing the solution of the linear system $Ax = b$ translates to

b = rand(n) # random RHS
db = CuVector(b) # move RHS to the device
ldiv!(rf, db) # compute solution inplace

Suppose now we change the coefficients of the matrix `dA`

, **without
modifying its sparsity pattern**. Then, the new system can be
solved entirely on the device as:

# update coefficients (use whichever update you want here)
dA.nzVal .*= 2.0
# update factorization on the device
lu!(rf, dA)
# resolve Ax = b
copyto!(db, b)
ldiv!(rf, b)

## How to solve a system with multiple right-hand-side?

Any `RFLU`

instance stores different buffers to avoid unnecessary
allocations when calling `lu!`

and `ldiv!`

. To solve a system
with `k`

multiple right-hand-side $AX=B$, one has
to instantiate a `RFLU`

with the proper buffers, as:

k = 64
rf_batched = CUSOLVERRF.RFLU(dA; nrhs=k)

Then, solving the system $AX=B$ on the GPU simply amounts to

B = rand(n, k)
dX = CuMatrix(B)
# Compute solution inplace
ldiv!(rf_batched, dX)

## How to use the low-level wrapper?

Someone, one prefers to have full control on the factorization,
without any boilerplate code. CUSOLVERRF provides a direct interface
to `cusolverRF`

for this purpose. In that case, a `cusolverRF`

instance
is instantiated as

rf_lowlevel = CUSOLVERRF.RFLowLevel(
dA; # matrix to factorize
fast_mode=true, # fast mode is recommended
ordering=:AMD, # :AMD, :MDQ, :METIS or :RCM
check=true,
factorization_algo=CUSOLVERRF.CUSOLVERRF_FACTORIZATION_ALG0,
triangular_algo=CUSOLVERRF.CUSOLVERRF_TRIANGULAR_SOLVE_ALG1,
)

Once the matrix factorized, solving the linear system $Ax =b$ translates to

b = rand(n)
dx = CuVector(b)
CUSOLVERRF.rf_solve!(rf_lowlevel, dx)

And refactorizing the matrix inplace on the device:

dA.nzVal .*= 2.0
CUSOLVERRF.rf_refactor!(rf_lowlevel, dA)

## Funding

This package was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative.