DataHandling

The DataHandling module is responsible for reading data from files and resampling it onto the simulation grid.

This is no trivial task. Among the challenges:

  • data can be large and cannot be read all in one go and/or held in memory,
  • regridding onto the simulation grid can be very expensive,
  • IO can be very expensive,
  • CPU/GPU communication can be a bottleneck.

The DataHandling takes the divide and conquer approach: the various core tasks and features and split into other independent modules (chiefly FileReaders, and Regridders). Such modules can be developed, tested, and extended independently (as long as they maintain a consistent interface). For instance, if need arises, the DataHandler can be used (almost) directly to process files with a different format from NetCDF.

The key struct in DataHandling is the DataHandler. The DataHandler contains a FileReader, a Regridder, and other metadata necessary to perform its operations (e.g., target ClimaCore.Space). The DataHandler can be used for static or temporal data, and exposes the following key functions:

  • regridded_snapshot(time): to obtain the regridded field at the given time. time has to be available in the data.
  • available_times (available_dates): to list all the times (dates) over which the data is defined.
  • previous_time(time/date) (next_time(time/date)): to obtain the time of the snapshot before the given time or date. This can be used to compute the interpolation weight for linear interpolation, or in combination with regridded_snapshot to read a particular snapshot

Most DataHandling functions take either time or date, with the difference being that time is intended as "simulation time" and is expected to be in seconds; date is a calendar date (from Dates.DateTime). Conversion between time and date is performed using the reference date and simulation starting time provided to the DataHandler.

The DataHandler has a caching mechanism in place: once a field is read and regridded, it is stored in the local cache to be used again (if needed). This is a least-recently-used (LRU) cache implemented in DataStructures, which removes the least-recently-used data when its maximum size is reached. The default maximum size is 128.

While the reading backend could be generic, at the moment, this module uses only the NCFileReader.

This extension is loaded when loading ClimaCore and NCDatasets are loaded. In addition to this, a Regridder is needed (which might require importing additional packages) - see Regridders for more information.

It is possible to pass down keyword arguments to underlying constructors in DataHandler with the regridder_kwargs and file_reader_kwargs. These have to be a named tuple or a dictionary that maps Symbols to values.

Example

As an example, let us implement a simple linear interpolation for a variable u defined in the era5_example.nc NetCDF file. The file contains monthly averages starting from the year 2000.

import ClimaUtilities.DataHandling
import ClimaCore
import NCDatasets
# Loading ClimaCore and Interpolations automatically loads DataHandling
import Interpolations
# This will load InterpolationsRegridder

import Dates

unit_conversion_func = (data) -> 1000. * data

data_handler = DataHandling.DataHandler("era5_example.nc",
                                        "u",
                                        target_space,
                                        reference_date = Dates.DateTime(2000, 1, 1),
                                        regridder_type = :InterpolationsRegridder,
                                        file_reader_kwargs = (; preprocess_func = unit_conversion_func))

function linear_interpolation(data_handler, time)
    # Time is assumed to be "simulation time", ie seconds starting from reference_date

    time_of_prev_snapshot = DataHandling.previous_time(data_handler, time)
    time_of_next_snapshot = DataHandling.next_time(data_handler, time)

    prev_snapshot = DataHandling.regridded_snaphsot(data_handler, time_of_prev_snapshot)
    next_snapshot = DataHandling.regridded_snaphsot(data_handler, time_of_next_snapshot)

    # prev and next snapshots are ClimaCore.Fields defined on the target_space

    return @. prev_snapshot + (next_snapshot - prev_snapshot) *
        (time - time_of_prev_snapshot) / (time_of_next_snapshot - time_of_prev_snapshot)
end

API

ClimaUtilities.DataHandling.DataHandlerFunction
DataHandler(file_path::AbstractString,
            varname::AbstractString,
            target_space::ClimaCore.Spaces.AbstractSpace;
            reference_date::Dates.DateTime = Dates.DateTime(1979, 1, 1),
            t_start::AbstractFloat = 0.0,
            regridder_type = nothing,
            cache_max_size::Int = 128,
            regridder_kwargs = (),
            file_reader_kwargs = ())

Create a DataHandler to read varname from file_path and remap it to target_space.

The DataHandler maintains an LRU cache of Fields that were previously computed.

Positional arguments

  • file_path: Path of the NetCDF file that contains the data.
  • varname: Name of the dataset in the NetCDF that has to be read and processed.
  • target_space: Space where the simulation is run, where the data has to be regridded to.

Keyword arguments

Time/date information will be ignored for static input files. (They are still set to make everything more type stable.)

  • reference_date: Calendar date corresponding to the start of the simulation.
  • t_start: Simulation time at the beginning of the simulation. Typically this is 0 (seconds), but if might be different if the simulation was restarted.
  • regridder_type: What type of regridding to perform. Currently, the ones implemented are :TempestRegridder (using TempestRemap) and :InterpolationsRegridder (using Interpolations.jl). TempestRemap regrids everything ahead of time and saves the result to HDF5 files. Interpolations.jl is online and GPU compatible but not conservative. If the regridder type is not specified by the user, and multiple are available, the default :InterpolationsRegridder regridder is used.
  • cache_max_size: Maximum number of regridded fields to store in the cache. If the cache is full, the least recently used field is removed.
  • regridder_kwargs: Additional keywords to be passed to the constructor of the regridder. It can be a NamedTuple, or a Dictionary that maps Symbols to values.
  • file_reader_kwargs: Additional keywords to be passed to the constructor of the file reader. It can be a NamedTuple, or a Dictionary that maps Symbols to values.
ClimaUtilities.DataHandling.available_timesFunction
available_times(data_handler::DataHandler)

Return the time in seconds of the snapshots in the data, measured considering the starting time of the simulation and the reference date

ClimaUtilities.DataHandling.previous_timeFunction
previous_time(data_handler::DataHandler, time::AbstractFloat)
previous_time(data_handler::DataHandler, date::Dates.DateTime)

Return the time in seconds of the snapshot before the given time. If time is one of the snapshots, return itself.

ClimaUtilities.DataHandling.next_timeFunction
next_time(data_handler::DataHandler, time::AbstractFloat)
next_time(data_handler::DataHandler, date::Dates.DateTime)

Return the time in seconds of the snapshot after the given time. If time is one of the snapshots, return the next time.

ClimaUtilities.DataHandling.regridded_snapshotFunction
regridded_snapshot(data_handler::DataHandler, date::Dates.DateTime)
regridded_snapshot(data_handler::DataHandler, time::AbstractFloat)
regridded_snapshot(data_handler::DataHandler)

Return the regridded snapshot from data_handler associated to the given time (if relevant).

The time has to be available in the data_handler.

regridded_snapshot potentially modifies the internal state of data_handler and it might be a very expensive operation.

Missing docstring.

Missing docstring for ClimaUtilities.DataHandling.regridded_snapshot!. Check Documenter's build log for details.