Code Optimization for Small Nonlinear Systems in Julia

General Code Optimization in Julia

Before starting this tutorial, we recommend the reader to check out one of the many tutorials for optimization Julia code. The following is an incomplete list:

User-side optimizations are important because, for sufficiently difficult problems, most time will be spent inside your f function, the function you are trying to solve. “Efficient solvers" are those that reduce the required number of f calls to hit the error tolerance. The main ideas for optimizing your nonlinear solver code, or any Julia function, are the following:

  • Make it non-allocating
  • Use StaticArrays for small arrays
  • Use broadcast fusion
  • Make it type-stable
  • Reduce redundant calculations
  • Make use of BLAS calls
  • Optimize algorithm choice

We'll discuss these strategies in the context of nonlinear solvers. Let's start with small systems.

Optimizing Nonlinear Solver Code for Small Systems

Take for example a prototypical small nonlinear solver code in its out-of-place form:

using NonlinearSolve

function f(u, p)
    u .* u .- p
end
u0 = [1.0, 1.0]
p = 2.0
prob = NonlinearProblem(f, u0, p)
sol = solve(prob, NewtonRaphson())
retcode: Success
u: 2-element Vector{Float64}:
 1.4142135623730951
 1.4142135623730951

We can use BenchmarkTools.jl to get more precise timings:

using BenchmarkTools

@benchmark solve(prob, NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  16.889 μs83.924 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     17.898 μs               GC (median):    0.00%
 Time  (mean ± σ):   18.272 μs ±  1.916 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▁▇█                                                     
  ▂▄███▆▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▂▁▂▁▂▁▂▂▂▂▂ ▃
  16.9 μs         Histogram: frequency by time        27.9 μs <

 Memory estimate: 13.38 KiB, allocs estimate: 249.

Note that this way of writing the function is a shorthand for:

function f(u, p)
    [u[1] * u[1] - p, u[2] * u[2] - p]
end
f (generic function with 1 method)

where the function f returns an array. This is a common pattern from things like MATLAB's fzero or SciPy's scipy.optimize.fsolve. However, by design it's very slow. In the benchmark you can see that there are many allocations. These allocations cause the memory allocator to take more time than the actual numerics itself, which is one of the reasons why codes from MATLAB and Python end up slow.

In order to avoid this issue, we can use a non-allocating "in-place" approach. Written out by hand, this looks like:

function f(du, u, p)
    du[1] = u[1] * u[1] - p
    du[2] = u[2] * u[2] - p
    nothing
end

prob = NonlinearProblem(f, u0, p)
@benchmark sol = solve(prob, NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  16.789 μs82.410 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     18.270 μs               GC (median):    0.00%
 Time  (mean ± σ):   19.008 μs ±  2.679 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▅█▄                                                         
  ▄████▅▄▆▆▄▃▃▄▄▄▄▃▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  16.8 μs         Histogram: frequency by time        29.5 μs <

 Memory estimate: 12.59 KiB, allocs estimate: 243.

Notice how much faster this already runs! We can make this code even simpler by using the .= in-place broadcasting.

function f(du, u, p)
    du .= u .* u .- p
end

@benchmark sol = solve(prob, NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  16.823 μs86.488 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     19.203 μs               GC (median):    0.00%
 Time  (mean ± σ):   19.858 μs ±  3.081 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▃▆▅▁ ▁▇█▆▂▁                                              
  ▄████▇███████▇▅▅▅▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  16.8 μs         Histogram: frequency by time        31.5 μs <

 Memory estimate: 12.59 KiB, allocs estimate: 243.

Further Optimizations for Small Nonlinear Solves with Static Arrays and SimpleNonlinearSolve

Allocations are only expensive if they are “heap allocations”. For a more in-depth definition of heap allocations, there are many sources online. But a good working definition is that heap allocations are variable-sized slabs of memory which have to be pointed to, and this pointer indirection costs time. Additionally, the heap has to be managed, and the garbage controllers has to actively keep track of what's on the heap.

However, there's an alternative to heap allocations, known as stack allocations. The stack is statically-sized (known at compile time) and thus its accesses are quick. Additionally, the exact block of memory is known in advance by the compiler, and thus re-using the memory is cheap. This means that allocating on the stack has essentially no cost!

Arrays have to be heap allocated because their size (and thus the amount of memory they take up) is determined at runtime. But there are structures in Julia which are stack-allocated. structs for example are stack-allocated “value-type”s. Tuples are a stack-allocated collection. The most useful data structure for NonlinearSolve though is the StaticArray from the package StaticArrays.jl. These arrays have their length determined at compile-time. They are created using macros attached to normal array expressions, for example:

using StaticArrays
A = SA[2.0, 3.0, 5.0]
typeof(A)
SVector{3, Float64} (alias for StaticArraysCore.SArray{Tuple{3}, Float64, 1, 3})

Notice that the 3 after SVector gives the size of the SVector. It cannot be changed. Additionally, SVectors are immutable, so we have to create a new SVector to change values. But remember, we don't have to worry about allocations because this data structure is stack-allocated. SArrays have numerous extra optimizations as well: they have fast matrix multiplication, fast QR factorizations, etc. which directly make use of the information about the size of the array. Thus, when possible, they should be used.

Unfortunately, static arrays can only be used for sufficiently small arrays. After a certain size, they are forced to heap allocate after some instructions and their compile time balloons. Thus, static arrays shouldn't be used if your system has more than ~20 variables. Additionally, only the native Julia algorithms can fully utilize static arrays.

Let's ***optimize our nonlinear solve using static arrays***. Note that in this case, we want to use the out-of-place allocating form, but this time we want to output a static array. Doing it with broadcasting looks like:

function f_SA(u, p)
    u .* u .- p
end
u0 = SA[1.0, 1.0]
p = 2.0
prob = NonlinearProblem(f_SA, u0, p)
@benchmark solve(prob, NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (minmax):   8.421 μs 5.607 ms   GC (min … max): 0.00% … 99.44%
 Time  (median):      9.652 μs               GC (median):    0.00%
 Time  (mean ± σ):   10.428 μs ± 55.992 μs   GC (mean ± σ):  5.35% ±  0.99%

   ▂█▆▂                                                        
  ▃████▇▄▄▄▄▄▇▇▆▆▆▅▄▄▄▃▃▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  8.42 μs         Histogram: frequency by time        15.2 μs <

 Memory estimate: 9.80 KiB, allocs estimate: 163.

Note that only change here is that u0 is made into a StaticArray! If we needed to write f out for a more complex nonlinear case, then we'd simply do the following:

function f_SA(u, p)
    SA[u[1] * u[1] - p, u[2] * u[2] - p]
end

@benchmark solve(prob, NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (minmax):   8.349 μs 7.043 ms   GC (min … max): 0.00% … 99.57%
 Time  (median):      9.743 μs               GC (median):    0.00%
 Time  (mean ± σ):   10.777 μs ± 70.353 μs   GC (mean ± σ):  6.51% ±  1.00%

   ▅█▂                                                         
  ▄███▅▄▄▄▇▇▇▆▅▅▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  8.35 μs         Histogram: frequency by time        18.1 μs <

 Memory estimate: 9.80 KiB, allocs estimate: 163.

However, notice that this did not give us a speedup but rather a slowdown. This is because many of the methods in NonlinearSolve.jl are designed to scale to larger problems, and thus aggressively do things like caching which is good for large problems but not good for these smaller problems and static arrays. In order to see the full benefit, we need to move to one of the methods from SimpleNonlinearSolve.jl which are designed for these small-scale static examples. Let's now use SimpleNewtonRaphson:

@benchmark solve(prob, SimpleNewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 870 evaluations.
 Range (minmax):  137.099 ns 29.335 μs   GC (min … max): 0.00% … 99.13%
 Time  (median):     144.776 ns                GC (median):    0.00%
 Time  (mean ± σ):   169.705 ns ± 677.971 ns   GC (mean ± σ):  9.74% ±  2.43%

  ▄▇█▆▅▅▄▄▄▃▃▃▃▃▃▄▄▃▃▃▂▂▂▂▂▂▂▁▁                               ▂
  █████████████████████████████████████▇▇▇█▇▇▇█▇█▆▆▆▇▅▆▅▅▆▅▄▅ █
  137 ns        Histogram: log(frequency) by time        221 ns <

 Memory estimate: 128 bytes, allocs estimate: 2.

And there we go, around 40ns from our starting point of almost 4μs!