DINCAE.DINCAEModule

DINCAE (Data-Interpolating Convolutional Auto-Encoder) is a neural network to reconstruct missing data in satellite observations.

For most applications it is sufficient to call the function DINCAE.reconstruct directly.

The code is available at: https://github.com/gher-uliege/DINCAE.jl

DINCAE.NCDataMethod
dd = NCData(lon,lat,time,data_full,missingmask,ndims;
            train = false,
            obs_err_std = fill(1.,size(data_full,3)),
            jitter_std = fill(0.05,size(data_full,3)),
            mask = trues(size(data_full)[1:2]),

)

Return a structure holding the data for training (train = true) or testing (train = false) the neural network. obs_err_std is the error standard deviation of the observations. The variable lon is the longitude in degrees east, lat is the latitude in degrees north, time is a DateTime vector, data_full is a 3-d array with the data and missingmask is a boolean mask where true means the data is missing. jitter_std is the standard deviation of the noise to be added to the data during training.

DINCAE.interp_adjn!Method
all positions should be within the domain. exclusive upper bound

for all i and n (1 <= pos[i][n] < sz[n])

DINCAE.interpnd!Method

interpnd! and interp_adj! are adjoints

vec must be zero initially

DINCAE.load_gridded_ncMethod
lon,lat,time,data,missingmask,mask = load_gridded_nc(fname,varname; minfrac = 0.05)

Load the variable varname from the NetCDF file fname. The variable lon is the longitude in degrees east, lat is the latitude in degrees north, time is a DateTime vector, data_full is a 3-d array with the data, missingmask is a boolean mask where true means the data is missing and mask is a boolean mask where true means the data location is valid, e.g. sea points for sea surface temperature.

At the bare-minimum a NetCDF file should have the following variables and attributes:

netcdf file.nc {
dimensions:
        time = UNLIMITED ; // (5266 currently)
        lat = 112 ;
        lon = 112 ;
variables:
        double lon(lon) ;
        double lat(lat) ;
        double time(time) ;
                time:units = "days since 1900-01-01 00:00:00" ;
        int mask(lat, lon) ;
        float SST(time, lat, lon) ;
                SST:_FillValue = -9999.f ;
}

The the netCDF mask is 0 for invalid (e.g. land for an ocean application) and 1 for pixels (e.g. ocean).

DINCAE.reconstructMethod
reconstruct(Atype,data_all,fnames_rec;...)

Train a neural network to reconstruct missing data using the training data set and periodically run the neural network on the test dataset. The data is assumed to be available on a regular longitude/latitude grid (which is the case of L3 satellite data).

Mandatory parameters

  • Atype: array type to use
  • data_all: list of named tuples. Every tuple should have filename and varname. data_all[1] will be used for training (and perturbed to prevent overfitting). All others entries data_all[2:end] will be reconstructed using the training network

at the epochs defined by save_epochs.

  • fnames_rec: vector of filenames corresponding to the entries data_all[2:end]

Optional parameters:

  • epochs: the number of epochs (default 1000)
  • batch_size: the size of a mini-batch (default 50)
  • enc_nfilter_internal: number of filters of the internal encoding layers (default [16,24,36,54])
  • skipconnections: list of layers with skip connections (default 2:(length(enc_nfilter_internal)+1))
  • clip_grad: maximum allowed gradient. Elements of the gradients larger than this values will be clipped (default 5.0).
  • regularization_L2_beta: Parameter for L2 regularization (default 0, i.e. no regularization)
  • save_epochs: list of epochs where the results should be saved (default 200:10:epochs)
  • is3D: Switch to apply 2D (is3D == false) or 3D (is3D == true) convolutions (default false)
  • upsampling_method: interpolation method during upsampling which can be either :nearest or :bilinear (default :nearest)
  • ntime_win: number of time instances within the time window. This number should be odd. (default 3)
  • learning_rate: initial learning rate of the ADAM optimizer (default 0.001)
  • learning_rate_decay_epoch: the exponential decay rate of the learning rate. After learning_rate_decay_epoch the learning rate is halved. The learning rate is computed as learning_rate * 0.5^(epoch / learning_rate_decay_epoch). learning_rate_decay_epoch can be Inf for a constant learning rate (default)
  • min_std_err: minimum error standard deviation preventing a division close to zero (default exp(-5) = 0.006737946999085467)
  • loss_weights_refine: the weigh of the individual refinement layers using in the cost function. If loss_weights_refine has a single element, then there is no refinement. (default (1.,))
Note

Note that also the optional parameters should be to tuned for a particular application.

Internally the time mean is removed (per default) from the data before it is reconstructed. The time mean is also added back when the file is saved. However, the mean is undefined for for are pixels in the data defined as valid (sea) by the mask which do not have any valid data in the training dataset.

See DINCAE.load_gridded_nc for more information about the netCDF file.

DINCAE.reconstruct_pointsMethod
DINCAE.reconstruct_points(T,Atype,filename,varname,grid,fnames_rec )

Mandatory parameters:

  • T: Float32 or Float64: float-type used by the neural network
  • Array{T} or KnetArray{T}: array-type used by the neural network.
  • filename: NetCDF file in the format described below.
  • varname: name of the primary variable in the NetCDF file.
  • grid: tuple of ranges with the grid in the longitude and latitude direction e.g. (-180:1:180,-90:1:90).
  • fnames_rec: NetCDF file names of the reconstruction.

Optional parameters:

  • jitter_std_pos: standard deviation of the noise to be added to the position of the observations (default (5,5))
  • auxdata_files: gridded auxiliary data file for a multivariate reconstruction. auxdata_files is an array of named tuples with the fields (filename, the file name of the NetCDF file, varname the NetCDF name of the primary variable and errvarname the NetCDF name of the expected standard deviation error). For example:
  • probability_skip_for_training: For a given time step n, every track from the same time step n will be skipped by this probability during training (default 0.2). This does not affect the tracks from previous (n-1,n-2,..) and following time steps (n+1,n+2,...). The goal of this parameter is to force the neural network to learn to interpolate the data in time.
  • paramfile: the path of the file (netCDF) where the parameter values are stored (default: nothing).

For example, a single entry of auxdata_files could be:

auxdata_files = [
  (filename = "big-sst-file.nc"),
   varname = "SST",
   errvarname = "SST_error")]

The data in the file should already be interpolated on the targed grid. The file structure of the NetCDF file is described in DINCAE.load_gridded_nc. The fields defined in this file should not have any missing value (see DIVAnd.ufill).

See DINCAE.reconstruct for other optional parameters.

An (minimal) example of the NetCDF file is:

netcdf all-sla.train {
dimensions:
	time_instances = 9628 ;
	obs = 7445528 ;
variables:
	int64 size(time_instances) ;
		size:sample_dimension = "obs" ;
	double dates(time_instances) ;
		dates:units = "days since 1900-01-01 00:00:00" ;
	float sla(obs) ;
	float lon(obs) ;
	float lat(obs) ;
	int64 id(obs) ;
	double dtime(obs) ;
		dtime:long_name = "time of measurement" ;
		dtime:units = "days since 1900-01-01 00:00:00" ;
}

The file should contain the variables lon (longitude), lat (latitude), dtime (time of measurement) and id (numeric identifier, only used by post processing scripts) and dates (time instance of the gridded field). The file should be in the contiguous ragged array representation as specified by the CF convention allowing to group data points into "features" (e.g. tracks for altimetry). Every feature can also contain a single data point.