Study on Optimizers

Study on Optimizers

When working with BFGS/LBFGS, there are some important aspects of the algorithm, which affect the convergence of the optimizer.

\[\phi(\alpha) \leq \phi(0) + c_1 \alpha \phi'(0)\]

Here $c_1=10^{-4}$.

One interesting observation is that if we apply BFGS/LBFGS directly, we usually cannot make any progress. There are many reasons for this phenomenon:

First-order methods, such as Adam optimizer, usually does not suffer from these difficulties. Therefore, one solution to this problem is via "warm start" by running a first order optimizer (Adam) for a few iterations. Additionally, for BFGS, we can build Hessian approximations while we are running the Adam optimizer. In this way, we can use the historic information as much as possible.

Our algorithm is as follows (BFGS+Adam+Hessian):

  1. Run the Adam optimizer for the first few iterations, and build the approximate Hessian matrix at the same time. That is, in each iteration, we update an approximate Hessian $B$ using

    $B_{k+1} = \left(I - \frac{s_k y_k^T}{y_k^Ts_k}\right)B_k\left(I - \frac{ y_ks_k^T}{y_k^Ts_k}\right) + \frac{s_ks_k^T}{y_k^Ts_k}$

    Here $y_k = \nabla f(x_{k+1}) - \nabla f(x_{k}), s_k = x_{k+1}-x_k, B_0 = I$. Note $B_k$ is not used in the Adam optimizer.

  2. Run the BFGS optimizer and use the last Hessian matrix $B_k$ built in Step 1 as the initial guess.

We compare our algorithm with those without approximated Hessian (BFGS+Adam) or warm start (BFGS). Additionally, we also compare our algorithm with LBFGS counterparts.

In the test case II and III, a physical field, $f$, is approximated using deep neural networks, $f_\theta$, where $\theta$ is the weights and biases. The neural network maps coordinates to a scalar value. $f_\theta$ is coupled with a DAE:

\[F(u', u, Du, D^2u, \ldots ;f_\theta) = 0 \tag{1}\]

Here $u'$ represents the time derivative of $u$, and $Du$, $D^2u$, $\ldots$ are first-, second-, $\ldots$ order spatial gradients. Typically we can observe $u$ on some discrete points $\{\mathbf{x}_k\}_{k\in \mathcal{I}}$. To train the neural network, we consider the following optimization problem

\[\min_\theta \sum_{k\in \mathcal{I}} (u(\mathbf{x}_i) - u_\theta(\mathbf{x}_i))\]

Here $u_\theta$ is the solution from Eq. 1.

Test Case I

In the first case, we train a 4 layer network with 20 neurons per layer, and tanh activation functions. The data set is $\{x_i, sin(x_i)\}_{i=1}^{100}$, where $x_i$ are randomly generated from $\mathcal{U}(0,1)$. The neural network $f_\theta$ is trained by solving the following optimization problem:

\[\min_\theta \sum_{i=1}^{100} (\sin(x_i) - f_\theta(x_i))^2\]

The neural network is small, and thus we can use BFGS/LBFGS to train. In fact, the following plot shows that BFGS/LBFGS is much more accurate and effective than the commonly used Adam optimizer. Considering the wide range of computational engineering applications, where a small neural network is sufficient, this result implies that BFGS/LBFGS for training neural networks should receive far more attention than what it does nowadays.

Also, in many applications, the major runtime is doing physical simulations instead of updating neural network parameters or approximate Hessian matrices, these "expensive" BFGS/LBFGS optimization algorithms should be considered a good way to leverage as much history information as possible, so as to reduce the total number of iterations (simulations).

Test Case II

In this test case, we consider solving a Poisson's equation in this post.

The exact $\kappa$ and the corresponding solution $u$ is shown below

We run different optimization algorithms and obtain the following loss function profiles. We see that BFGS/LBFGS without Adam warm start terminates early. BFGS in general has much better accuracy than LBFGS. An extra benefit of BFGS+Adam+Hessian compared to BFGS+Adam is that we can achieve much better accuracy.

Loss FunctionZoom-in View

We also show the mean squared error for $\kappa$, which confirms that BFGS+Adam+Hessian achieves the best test error. The error is calculated using the formula

\[\frac{1}{n} (\kappa_{\text{true}}(\textbf{x}_i) - \kappa_\theta(\textbf{x}_i))^2\]

Here $\textbf{x}_i$ is defined on Gauss points.

AlgorithmAdamBFGS+Adam+HessianBFGS+AdamBFGSLBFGS+AdamLBFGS
MSE0.0131.00E-111.70E-101.10E+040.0002397000

Test Case III

In the last example, we consider the linear elasticity. The problem description can be found here.

We fix the random seed for neural network initialization and run different optimization algorithms. The initial guess for the Young's modulus and the reference one are shown in the following plot

The corresponding velocity and stress fields are

We perform the optimization using different algorithms. In the case where Adam is used as warm start, we run Adam optimization for 50 iterations. We run the optimization for at most 500 iterations (so there is at most 500 evaluations of gradients) The loss function is shown below

We see that BFGS+Adam+Hessian achieves the smallest loss functions among all algorithms. We also show the MSE for $E$, i.e., $\frac{1}{n} (E_{\text{true}}(\textbf{x}_i) - E_\theta(\textbf{x}_i))^2$, where $\textbf{x}_i$ is defined on Gauss points.

AlgorithmAdamBFGS+Adam+HessianBFGS+AdamBFGSLBFGS+AdamLBFGS
MSE0.00331.90E-074.00E-066.20E-060.00310.0013

This confirms that BFGS+Adam+Hessian indeed generates a much more accurate result.

We also compare the results for the BFGS+Adam+Hessian and Adam algorithms:

We see the Adam optimizer achieves reasonable result but is challenging to deliver high accurate estimation within 500 iterations.