FileReaders

Reading files is a common need for most scientific projects. This can come with a series of problems that have to be solved, from performance (accessing can be a very computationally expensive operation), to dealing with multiple files that are logically connected. The FileReaders provides an abstraction layer to decouple the scientific needs with the technical implementation so that file processing can be optimized and extended independently of the rest of the model.

At this point, the implemented FileReaders are always linked to a specific variable and they come with a caching system to avoid unnecessary reads.

Future extensions might include:

  • dealing with multiple files containing the same variables (e.g. time series when the dates are split in different files);
  • doing chunked reads;
  • async reads.

NCFileReaders

This extension is loaded when loading NCDatasets

The only file reader currently implemented is the NCFileReader, used to read NetCDF files. Each NCFileReader is associated to one particular file and variable (but multiple NCFileReaders can share the same file).

Once created, NCFileReader is accessed with the read(file_reader, date) function, which returns the Array associated to given date (if available). The date can be omitted if the data is static.

NCFileReaders implement two additional features: (1) optional preprocessing, and (2) cache reads. NCFileReaders can be created with a preprocessing_func keyword argument, function is applied to the read datasets when reading. preprocessing_func should be a lightweight function, such as removing NaNs or changing units. Every time read(file_reader, date) is called, the NCFileReader checks if the date is currently stored in the cache. If yes, it just returns the value (without accessing the disk). If not, it reads and process the data and adds it to the cache. This uses a least-recently-used (LRU) cache implemented in DataStructures, which removes the least-recently-used data stored in the cache when its maximum size is reached (the default max size is 128).

It is good practice to always close the NCFileReaders when they are no longer needed. The function close_all_ncfiles closes all the ones that are currently open.

Example

Assume you have a file era5_2000.nc, which contains two variables u and v, defined for the year 2000.

import ClimaUtilities.FileReaders
import NCDatasets
# Loading NCDatasets automatically loads `NCFileReaders`

u_var = FileReaders.NCFileReader("era5_2000.nc", "u")
# Change units for v
v_var = FileReaders.NCFileReader("era5_2000.nc", "u", preprocess_func = x -> 1000x)

dates = FileReaders.available_dates(u_var)
# dates is a vector of Dates.DateTime

first_date = dates[begin]

# The first time we call read, the file is accessed and read
u_array = FileReaders.read(u_var, first_date)
# As the name suggests, u_array is an Array

# All the other times, we access the cache, so no IO operation is involved
u_array_again = FileReaders.read(u_var, first_date)

close(u_var)
close(v_var)
# Alternatively: FileReaders.close_all_ncfiles()

API

ClimaUtilities.FileReaders.NCFileReaderFunction
FileReaders.NCFileReader(
    file_path::AbstractString,
    varname::AbstractString;
    preprocess_func = identity,
    cache_max_size:Int = 128,
)

A struct to efficiently read and process NetCDF files.

ClimaUtilities.FileReaders.readFunction
read(file_reader::NCFileReader, date::Dates.DateTime)

Read and preprocess the data at the given date.

read(file_reader::NCFileReader)

Read and preprocess data (for static datasets).

Base.closeFunction
close(time_varying_input::TimeVaryingInputs.AbstractTimeVaryingInput)

Close files associated to the time_varying_input.

close(time_varying_input::InterpolatingTimeVaryingInput23D)

Close files associated to the time_varying_input.

close(data_handler::DataHandler)

Close any file associated to the given data_handler.

close(file_reader::NCFileReader)

Close NCFileReader. If no other NCFileReader is using the same file, close the NetCDF file.