DataHandling
The DataHandling
module is responsible for reading data from files and resampling it onto the simulation grid.
This is no trivial task. Among the challenges:
- data can be large and cannot be read all in one go and/or held in memory,
- regridding onto the simulation grid can be very expensive,
- IO can be very expensive,
- CPU/GPU communication can be a bottleneck.
The DataHandling
takes the divide and conquer approach: the various core tasks and features and split into other independent modules (chiefly FileReaders
, and Regridders
). Such modules can be developed, tested, and extended independently (as long as they maintain a consistent interface). For instance, if need arises, the DataHandler
can be used (almost) directly to process files with a different format from NetCDF.
The key struct in DataHandling
is the DataHandler
. The DataHandler
contains a FileReader
, a Regridder
, and other metadata necessary to perform its operations (e.g., target ClimaCore.Space
). The DataHandler
can be used for static or temporal data, and exposes the following key functions:
regridded_snapshot(time)
: to obtain the regridded field at the giventime
.time
has to be available in the data.available_times
(available_dates
): to list all thetimes
(dates
) over which the data is defined.previous_time(time/date)
(next_time(time/date)
): to obtain the time of the snapshot before the giventime
ordate
. This can be used to compute the interpolation weight for linear interpolation, or in combination withregridded_snapshot
to read a particular snapshot
Most DataHandling
functions take either time
or date
, with the difference being that time
is intended as "simulation time" and is expected to be in seconds; date
is a calendar date (from Dates.DateTime
). Conversion between time and date is performed using the reference date and simulation starting time provided to the DataHandler
.
The DataHandler
has a caching mechanism in place: once a field is read and regridded, it is stored in the local cache to be used again (if needed). This is a least-recently-used (LRU) cache implemented in DataStructures
, which removes the least-recently-used data when its maximum size is reached. The default maximum size is 128.
While the reading backend could be generic, at the moment, this module uses only the NCFileReader
.
This extension is loaded when loading
ClimaCore
andNCDatasets
are loaded. In addition to this, aRegridder
is needed (which might require importing additional packages) - seeRegridders
for more information.
It is possible to pass down keyword arguments to underlying constructors in DataHandler
with the regridder_kwargs
and file_reader_kwargs
. These have to be a named tuple or a dictionary that maps Symbol
s to values.
Example
As an example, let us implement a simple linear interpolation for a variable u
defined in the era5_example.nc
NetCDF file. The file contains monthly averages starting from the year 2000.
import ClimaUtilities.DataHandling
import ClimaCore
import NCDatasets
# Loading ClimaCore and Interpolations automatically loads DataHandling
import Interpolations
# This will load InterpolationsRegridder
import Dates
unit_conversion_func = (data) -> 1000. * data
data_handler = DataHandling.DataHandler("era5_example.nc",
"u",
target_space,
reference_date = Dates.DateTime(2000, 1, 1),
regridder_type = :InterpolationsRegridder,
file_reader_kwargs = (; preprocess_func = unit_conversion_func))
function linear_interpolation(data_handler, time)
# Time is assumed to be "simulation time", ie seconds starting from reference_date
time_of_prev_snapshot = DataHandling.previous_time(data_handler, time)
time_of_next_snapshot = DataHandling.next_time(data_handler, time)
prev_snapshot = DataHandling.regridded_snaphsot(data_handler, time_of_prev_snapshot)
next_snapshot = DataHandling.regridded_snaphsot(data_handler, time_of_next_snapshot)
# prev and next snapshots are ClimaCore.Fields defined on the target_space
return @. prev_snapshot + (next_snapshot - prev_snapshot) *
(time - time_of_prev_snapshot) / (time_of_next_snapshot - time_of_prev_snapshot)
end
API
ClimaUtilities.DataHandling.DataHandler
— FunctionDataHandler(file_path::AbstractString,
varname::AbstractString,
target_space::ClimaCore.Spaces.AbstractSpace;
reference_date::Dates.DateTime = Dates.DateTime(1979, 1, 1),
t_start::AbstractFloat = 0.0,
regridder_type = nothing,
cache_max_size::Int = 128,
regridder_kwargs = (),
file_reader_kwargs = ())
Create a DataHandler
to read varname
from file_path
and remap it to target_space
.
The DataHandler maintains an LRU cache of Fields that were previously computed.
Positional arguments
file_path
: Path of the NetCDF file that contains the data.varname
: Name of the dataset in the NetCDF that has to be read and processed.target_space
: Space where the simulation is run, where the data has to be regridded to.
Keyword arguments
Time/date information will be ignored for static input files. (They are still set to make everything more type stable.)
reference_date
: Calendar date corresponding to the start of the simulation.t_start
: Simulation time at the beginning of the simulation. Typically this is 0 (seconds), but if might be different if the simulation was restarted.regridder_type
: What type of regridding to perform. Currently, the ones implemented are:TempestRegridder
(usingTempestRemap
) and:InterpolationsRegridder
(usingInterpolations.jl
).TempestRemap
regrids everything ahead of time and saves the result to HDF5 files.Interpolations.jl
is online and GPU compatible but not conservative. If the regridder type is not specified by the user, and multiple are available, the default:InterpolationsRegridder
regridder is used.cache_max_size
: Maximum number of regridded fields to store in the cache. If the cache is full, the least recently used field is removed.regridder_kwargs
: Additional keywords to be passed to the constructor of the regridder. It can be a NamedTuple, or a Dictionary that maps Symbols to values.file_reader_kwargs
: Additional keywords to be passed to the constructor of the file reader. It can be a NamedTuple, or a Dictionary that maps Symbols to values.
ClimaUtilities.DataHandling.available_times
— Functionavailable_times(data_handler::DataHandler)
Return the time in seconds of the snapshots in the data, measured considering the starting time of the simulation and the reference date
ClimaUtilities.DataHandling.available_dates
— Functionavailable_dates(data_handler::DataHandler)
Return the dates of the snapshots in the data.
ClimaUtilities.DataHandling.previous_time
— Functionprevious_time(data_handler::DataHandler, time::AbstractFloat)
previous_time(data_handler::DataHandler, date::Dates.DateTime)
Return the time in seconds of the snapshot before the given time
. If time
is one of the snapshots, return itself.
ClimaUtilities.DataHandling.next_time
— Functionnext_time(data_handler::DataHandler, time::AbstractFloat)
next_time(data_handler::DataHandler, date::Dates.DateTime)
Return the time in seconds of the snapshot after the given time
. If time
is one of the snapshots, return the next time.
ClimaUtilities.DataHandling.regridded_snapshot
— Functionregridded_snapshot(data_handler::DataHandler, date::Dates.DateTime)
regridded_snapshot(data_handler::DataHandler, time::AbstractFloat)
regridded_snapshot(data_handler::DataHandler)
Return the regridded snapshot from data_handler
associated to the given time
(if relevant).
The time
has to be available in the data_handler
.
regridded_snapshot
potentially modifies the internal state of data_handler
and it might be a very expensive operation.
Missing docstring for ClimaUtilities.DataHandling.regridded_snapshot!
. Check Documenter's build log for details.