MultiPrecisionArrays.jl v0.0.7

MultiPrecisionArrays.jl [1] is a package for iterative refinement.

These docs are enough to get you started. The complete version with a better account of the theory is [2].

This package uses SIAMFANLEquations.jl [3], the solver package associated with a book [4] and suite of IJulia notebooks [5].

This package provides data structures and solvers for several variants of iterative refinement (IR). It will become much more useful when half precision (aka Float16) is fully supported in LAPACK/BLAS. For now, it's only general-purpose application is classical iterative refinement with double precision equations and single precision factorizations.

The half precision stuff is good for those of us doing research in this field. Half precision performance has progressed to the point where you can actually get things done. On an Apple M2-Pro, a half precision LU only costs 3–5 times what a double precision LU costs. This may be as good as it gets unless someone wants to duplicate the LAPACK implementation and get the benefits from blocking, recursion, and clever cache management.

We use a hack-job LU factorization for half precision. Look at the source for hlu!.jl.

The idea is to solve $Ax=b$ in high precision using a factorization in lower precision.

IR is a perfect example of a storage/time tradeoff. To solve a linear system $Ax=b$ with IR, one incurs the storage penalty of making a low precision copy of and reaps the benefit of only having to factor the low precision copy.

Here is the textbook version [6] using the LU factorization.

IR(A, b)

Initialize: $x = 0$, $r = b$
Factor $A = LU$ in a lower precision
While $\| r \|$ is too large
- Compute the defect $d = (LU)^{-1} r$
- Correct the solution $x = x + d$
- Update the residual $r = b - Ax$
end

In Julia, a code to do this would solve the linear system $A x = b$ in double precision by using a factorization in a lower precision, say single, within a residual correction iteration. This means that one would need to allocate storage for a copy of $A$ is the lower precision and factor that copy.

Then one has to determine what the line $d = (LU)^{-1} r$ means. Do you cast $r$ into the lower precision before the solve or not? MultiPrecisionArrays.jl provides data structures and solvers to manage this.

Here's a simple Julia function for IR that

"""
IR(A,b)
Simple minded iterative refinement
Solve Ax=b
"""
function IR(A, b)
    x = zeros(length(b))
    r = copy(b)
    tol = 100.0 * eps(Float64)
    #
    # Allocate a single precision copy of A and factor in place
    #
    A32 = Float32.(A)
    AF = lu!(A32)
    #
    # Give IR at most ten iterations, which it should not need
    # in this case
    #
    itcount = 0
    # The while loop will get more attention later.
    while (norm(r) > tol * norm(b)) && (itcount < 10)
        #
        # Store r and d = AF\r in the same place.
        #
        ldiv!(AF, r)
        x .+= r
        r .= b - A * x
        itcount += 1
    end
    return x
end

The MPArray structure contains both $A$, the low precision copy, and a vector for the residual. This lets you allocate the data in advance and reuse the structure for other right hand sides without rebuilding (or refactoring!) the low precision copy.

As written in the function, the defect uses ldiv! to compute AF\r. This means that the two triangular factors are stored in single precision and interprecision transfers are done with each step in the factorization. While that on the fly interprecision transfer is an option, and is needed in many situations, the default is to downcast $r$ to low precision, do the solve entirely in low precision, and the upcast the result. The code for that looks like

normr=norm(r)
ds=Float32.(r)/normr
ldiv!(AF, ds)
r .= Float64.(ds)*normr

The scaling by 1.0/normr avoids underflow, which is most important when the low precision is Float16. We will discuss interprecision transfer costs later.

Integral Equations Example

The submodule MultiPrecisionArrays.Examples has an example which we will use for most of the documentation. The function Gmat(N) returns the trapeziod rule discretization of the Greens operator for $-d^2/dx^2$ on $[0,1]$ with homogeneous Dirichlet boundary conditions.

\[G u(x) = \int_0^1 g(x,y) u(y) \, dy \]

where

\[g(x,y) = \left\{\begin{array}{c} y (1-x) ; \ x > y\\ x (1-y) ; \ x \le y \end{array}\right.\]

The eigenvalues of $G$ are $1/(n^2 \pi^2)$ for $n = 1, 2, \dots$.

The code for this is in the /src/Examples directory. The file is Gmat.jl.

In the example we will build a matrix $A = I - \alpha G$. In the examples we will use $\alpha=1.0$, a very well conditioned case, and $\alpha=800.0$ This latter case is very near singularity.

We will solve a linear system with both double precision $LU$ and an MPArray and compare execution time and the quality of the results.

The example below compares the cost of a double precision factorization to a MPArray factorization. The MPArray structure has a high precision and a low precision matrix. The structure we will start with is

struct MPArray{TH<:AbstractFloat,TL<:AbstractFloat}
    AH::Array{TH,2}
    AL::Array{TL,2}
    residual::Vector{TH}
    onthefly::Bool
end

The structure also stores the residual. The onthefly Boolean tells the solver how to do the interprecision transfers. The easy way to get started is to use the mplu command directly on the matrix. That will build the MPArray, follow that with the factorization, and put in all in a structure that you can use with \.

Now we will see how the results look. In this example we compare the result with iterative refinement with A\b, which is LAPACK's LU. As you can see the results are equally good. Note that the factorization object MPF is the output of mplu. This is analogous to AF=lu(A) in LAPACK.

julia> using MultiPrecisionArrays

julia> using MultiPrecisionArrays.Examples

julia> using BenchmarkTools

julia> N=4096;

julia> G=Gmat(N);

julia> A = I - G;

julia> MPF=mplu(A); AF=lu(A);

julia> z=MPF\b; w=AF\b;

julia> ze=norm(z-x,Inf); zr=norm(b-A*z,Inf)/norm(b,Inf);

julia> we=norm(w-x,Inf); wr=norm(b-A*w,Inf)/norm(b,Inf);

julia> println("Errors: $ze, $we. Residuals: $zr, $wr")
Errors: 8.88178e-16, 7.41629e-14. Residuals: 1.33243e-15, 7.40609e-14

So the resuts are equally good.

The compute time for mplu should be half that of lu.

julia> @belapsed mplu($A)
8.55328e-02

julia> @belapsed lu($A)
1.49645e-01

So the single precision factorization is roughly half the cost of the double precision one. Now for the solves. Both lu and mplu produce a factorization object and \ works with both.

A few subtleties in the example

Here is the source for mplu

"""
mplu(A::Array{Float64,2}; TL=Float32, onthefly=false)

Combines the constructor of the multiprecision array with the
factorization.
"""
function mplu(A::Array{TH,2}; TL=Float32, onthefly=nothing) where TH <: Real
#
# If the high precision matrix is single, the low precision must be half.
#
(TH == Float32) && (TL = Float16)
#
# Unless you tell me otherwise, onthefly is true if low precision is half
# and false if low precision is single.
#
(onthefly == nothing ) && (onthefly = (TL==Float16))
MPA=MPArray(A; TL=TL, onthefly=onthefly)
MPF=mplu!(MPA)
return MPF
end

The function mplu has two keyword arguments. The easy one to understand is TL which is the precision of the factoriztion. Julia has support for single (Float32) and half (Float16) precisions. If you set TL=Float16 then low precision will be half. Don't do that unless you know what you're doing. Using half precision is a fast way to get incorrect results. Look at the section on half precision in this Readme for a bit more bad news.

The other keyword arguemnt is onthefly. That keyword controls how the triangular solvers from the factorization work. When you solve

\[LU d = r\]

The LU factors are in low precision and the residual $r$ is in high precision. If you let Julia and LAPACK figure out what to do, then the solves will be done in high precision and the entries in the LU factors will be comverted to high precision with each binary operation. The output $d$ will be in high precision. This is called interprecision transfer on-the-fly and onthefly = true will tell the solvers to do it that way. You have $N^2$ interprecsion transfers with each solve and, as we will see, that can have a non-trivial cost.

When low precision is Float32, then the default is (onthefly = false). This converts $r$ to low precision, does the solve entirely in low precision, and then promotes $d$ to high precision. You need to be careful to avoid overflow and, more importantly, underflow when you do that and we scale $r$ to be a unit vector before conversion to low precisiion and reverse the scaling when we promote $d$. We take care of this for you.

mplu calls the constructor for the multiprecision array and then factors the low precision matrix. In some cases, such as nonlinear solvers, you will want to separate the constructor and the factorization. When you do that remember that mplu! overwrites the low precision copy of A with the factors, so you can't resuse the multiprecision array for other problems unless you restore the low precision copy.

MultiPrecisionArrays.jl v0.0.7

What is iterative refinement.

Integral Equations Example

A few subtleties in the example