# The Emulate stage

Emulation is performed through the construction of an `Emulator`

object, which has two components

- A wrapper for any statistical emulator,
- Data-processing and dimensionality reduction functionality.

## Typical construction from `Lorenz_example.jl`

First, obtain data in a `PairedDataContainer`

, for example, get this from an `EnsembleKalmanProcess`

`ekpobj`

generated during the `Calibrate`

stage, or see the constructor here

```
using CalibrateEmulateSample.Utilities
input_output_pairs = Utilities.get_training_points(ekpobj, 5) # use first 5 iterations as data
```

Wrapping a predefined machine learning tool, e.g. a Gaussian process `gauss_proc`

, the `Emulator`

can then be built:

```
emulator = Emulator(
gauss_proc,
input_output_pairs; # optional arguments after this
obs_noise_cov = Γy,
normalize_inputs = true,
standardize_outputs = true,
standardize_outputs_factors = factor_vector,
retained_svd_frac = 0.95,
)
```

The optional arguments above relate to the data processing.

### Emulator Training

The emulator is trained when we combine the machine learning tool and the data into the `Emulator`

above. For any machine learning tool, hyperparameters are optimized.

`optimize_hyperparameters!(emulator)`

For some machine learning packages however, this may be completed during construction automatically, and for others this will not. If automatic construction took place, the `optimize_hyperparameters!`

line does not perform any new task, so may be safely called. In the Lorenz example, this line learns the hyperparameters of the Gaussian process, which depend on the choice of kernel, and the choice of GP package. Predictions at new inputs can then be made using

`y, cov = Emulator.predict(emulator, new_inputs)`

This returns both a mean value and a covariance.

## Data processing

Some effects of the following are outlined in a practical setting in the results and appendices of Howland, Dunbar, Schneider, (2022).

### Diagonalization and output dimension reduction

This arises from the optional arguments

`obs_noise_cov = Γy`

(default:`nothing`

)

We always use singular value decomposition to diagonalize the output space, requiring output covariance `Γy`

. *Why?* If we need to train a $\mathbb{R}^{10} \to \mathbb{R}^{100}$ emulator, diagonalization allows us to instead train 100 $\mathbb{R}^{10} \to \mathbb{R}^{1}$ emulators (far cheaper).

`retained_svd_frac = 0.95`

(default`1.0`

)

Performance is increased further by throwing away less informative output dimensions, if 95% of the information (i.e., variance) is in the first 40 diagonalized output dimensions then setting `retained_svd_frac=0.95`

will train only 40 emulators.

Diagonalization is an approximation. It is however a good approximation when the observational covariance varies slowly in the parameter space.

Severe approximation errors can occur if `obs_noise_cov`

is not provided.

### Normalization and standardization

This arises from the optional arguments

`normalize_inputs = true`

(default:`true`

)

We normalize the input data in a standard way by centering, and scaling with the empirical covariance

`standardize_outputs = true`

(default:`false`

)`standardize_outputs_factors = factor_vector`

(default:`nothing`

)

To help with poor conditioning of the covariance matrix, users can also standardize each output dimension with by a multiplicative factor given by the elements of `factor_vector`

.

## Modular interface

Developers may contribute new tools by performing the following

- Create
`MyMLToolName.jl`

, and include "MyMLToolName.jl" in`Emulators.jl`

- Create a struct
`MyMLTool <: MachineLearningTool`

, containing any arguments or optimizer options - Create the following three methods to build, train, and predict with your tool (use
`GaussianProcess.jl`

as a guide)

```
build_models!(mlt::MyMLTool, iopairs::PairedDataContainer) -> Nothing
optimize_hyperparameters!(mlt::MyMLTool, args...; kwargs...) -> Nothing
function predict(mlt::MyMLTool, new_inputs::Matrix; kwargs...) -> Matrix, Union{Matrix, Array{,3}
```

The `predict`

method takes as input, an `input_dim`

-by-`N_new`

matrix. It return both a predicted mean and a predicted (co)variance at new inputs. (i) for scalar-output methods relying on diagonalization, return `output_dim`

-by-`N_new`

matrices for mean and variance, (ii) For vector-output methods, return `output_dim`

-by-`N_new`

for mean and `output_dim`

-by-`output_dim`

-by-`N_new`

for covariances.

Please get in touch with our development team when contributing new statistical emulators, to help us ensure the smoothest interface with any new tools.