utils

The utils module contains useful functions to handle the raw data and the output results.

The data_to_arrays function takes the tidy dataframes and converts it into the set of arrays used as input for all the models in the model module. This function has different options to build the slight variations needed for each of the models.

BarBay.utils.data_to_arrays — Function

data_to_arrays(data; kwargs)

Function to preprocess the tidy dataframe data into the corresponding inputs for the models in the model submodule.

Arguments

data::DataFrames.AbstractDataFrame: Tidy dataframe with the data to be used for sampling the model posterior distribution.

Optional Keyword Arguments

id_col::Symbol=:barcode: Name of the column in data containing the barcode identifier. The column may include any type of entry.
time_col::Symbol=:time: Name of the column in data defining the time point at which measurements were done. The column may contain any type of entry as long as sort will result in time-ordered names.
count_col::Symbol=:count: Name of the column in data containing the raw barcode count. The column must contain entries of type Int64.
neutral_col::Symbol=:neutral: Name of the column in data defining whether the barcode belongs to a neutral lineage. The column must contain entries of type Bool.
rep_col::Union{Nothing,Symbol}=nothing: Column indicating the experimental replicate each measurement belongs to. Default is nothing.
env_col::Union{Nothing,Symbol}=nothing: Column indicating the environment in which each measurement was performed. Default is nothing.
genotype_col::Union{Nothing,Symbol}=nothing: Column indicating the genotype each barcode belongs to when fitting a hierarchical model on genotypes. Default is nothing.
rm_T0::Bool=false: Optional argument to remove the first time point from the inference. The data from this first time point is commonly of much lower quality. Therefore, removing this first time point might result in a better inference.
verbose::Bool=true: Boolean indicating if printing statements should be made.

Returns

data_arrays::Dict: Dictionary with the following elements:
- bc_ids: List of barcode IDs in the order they are used for the inference.
- neutral_ids: List of neutral barcode IDs in the order they are used for the inference.
- bc_count: Count time series for each barcode. The options can be:
  - Matrix{Int64}: (ntime) × (nbc) matrix with counts. Rows are time points, and columns are barcodes.
  - Array{Int64, 3}: The same as the matrix, except the third dimension represents multiple experimental replicates.
  - Vector{Matrix{Int64}}: List of matrices, one for each experimental replicate. This is when replicates have a different number of time points.
- bc_total: Total number of barcodes per time point. The options can be:
  - Vector{Int64}: Equivalent to summing each matrix row.
  - Matrix{Int64}: Equivalent to summing each row of each slice of the tensor.
  - Vector{Vector{Int64}}: Equivalent to summing each matrix row.
- n_rep: Number of experimental replicates.
- n_time: Number of time points. The options can be:
  - Int64: Number of time points on a single replicate or multiple replicates.
  - Vector{Int64}: Number of time points per replicate when replicates have different lengths.
- envs: List of environments. The options can be:
  - String: Single placeholder env1
  - Vector{<:Any}: Environments in the order they were measured.
  - vector{Vector{<:Any}}: Environments per replicate when replicates have a different number of time points.
- n_env: Number of environmental conditions.
- genotypes: List of genotypes for each of the non-neutral barcodes. The options can be:
  - N/A: String when no genotype information is given.
  - Vector{<:Any}: Vector of the corresponding genotype for each of the non-neutral barcodes in the order they are used for the inference.
- n_geno: Number of genotypes. When no genotype information is provided, this defaults to zero.

The advi_to_df function takes the output when fitting a model performing variational inference using the mean-field approximation, i.e., assuming a diagonal covariance matrix.

BarBay.utils.advi_to_df — Function

advi_to_df(data::DataFrames.AbstractDataFrame, dist::Distribution.Sampleable, vars::Vector{<:Any}; kwargs)

Convert the output of automatic differentiation variational inference (ADVI) to a tidy dataframe.

Arguments

data::DataFrames.AbstractDataFrame: Tidy dataframe used to perform the ADVI inference. See BarBay.vi module for the dataframe requirements.
dist::Distributions.Sampleable: The ADVI posterior sampleable distribution object.
vars::Vector{<:Any}: Vector of variable/parameter names from the ADVI run.

Optional Keyword Arguments

id_col::Symbol=:barcode: Name of the column in data containing the barcode identifier. The column may contain any type of entry.
time_col::Symbol=:time: Name of the column in data defining the time point at which measurements were done. The column may contain any type of entry as long as sort will resulted in time-ordered names.
count_col::Symbol=:count: Name of the column in data containing the raw barcode count. The column must contain entries of type Int64.
neutral_col::Symbol=:neutral: Name of the column in data defining whether the barcode belongs to a neutral lineage or not. The column must contain entries of type Bool.
rm_T0::Bool=false: Optional argument to remove the first time point from the inference. The data from this first time point is commonly of much lower quality. Therefore, removing this first time point might result in a better inference.
n_samples::Int=10_000: Number of posterior samples to draw used for hierarchical models. Default is 10,000.

Returns

df::DataFrames.DataFrame: DataFrame containing summary statistics of

posterior samples for each parameter. Columns include: - mean, std: posterior mean and standard deviation for each variable. - varname: parameter name from the ADVI posterior distribution. - vartype: Description of the type of parameter. The types are: - pop_mean_fitness: Population mean fitness value s̲ₜ. - pop_std: (Nuisance parameter) Log of standard deviation in the likelihood function for the neutral lineages. - bc_fitness: Mutant relative fitness s⁽ᵐ⁾. - bc_hyperfitness: For hierarchical models, mutant hyperparameter that connects the fitness over multiple experimental replicates or multiple genotypes θ⁽ᵐ⁾. - bc_noncenter: (Nuisance parameter) For hierarchical models, non-centered samples used to connect the experimental replicates to the hyperparameter θ̃⁽ᵐ⁾. - bc_deviations: (Nuisance parameter) For hierarchical models, samples that define the log of the deviation from the hyperparameter fitness value logτ⁽ᵐ⁾. - bc_std: (Nuisance parameter) Log of standard deviation in the likelihood function for the mutant lineages. - freq: (Nuisance parameter) Log of the Poisson parameter used to define the frequency of each lineage. - rep: Experimental replicate number. - env: Environment for each parameter. - id: Mutant or neutral strain ID.

Notes

Converts multivariate posterior into summarized dataframe format.
Adds metadata like parameter type, replicate, strain ID, etc.
Can handle models with multiple replicates and environments.
Can handle models with hierarchical structure on genotypes.
Useful for post-processing ADVI results for further analysis and plotting.