# stats

The `stats`

module includes useful functions to compute statistical characterizations of the data and the output results.

## Posterior Predictive Checks

`BarBay.stats.logfreq_ratio_popmean_ppc`

— Function`logfreq_ratio_popmean_ppc(df, n_ppc; kwargs)`

Function to compute the **posterior predictive checks** (better called the posterior retrodictive checks) for the barcode log-frequency ratio for neutral lineages.

**Model**

The functional form that connects the barcode frequency at tie $t+1$ based on the frequency at time $t$ for **neutral barcodes** is of the form

\[ f_{t+1}^{(n)} = f_{t}^{(n)} \exp\left[ \left( - \bar{s}_t \right) \tau \right],\]

where $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. Our inference model assumes that

\[ \frac{f_{t+1}^{(n)}}{f_{t}^{(n)}} \sim \log-\mathcal{N}\left( - \bar{s}_t, \sigma^{(n)} \right),\]

where $\sigma^{(n)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution.

**Arguments**

`df::DataFrames.DataFrame`

: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.

`n_ppc::Int`

: Number of samples to generate per set of parameters.

**Optional Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables

in the mcmc chain defining the following variables: - `population_mean_fitness`

: Common pattern in all population mean fitness variables. - `population_std_fitness`

: Common pattern in all standard deviations estimates for the likelihood.

`flatten::Bool=true`

: Boolean indicating whether to flatten the output of multiple chain into a single column.

**Returns**

`log(fₜ₊₁ / fₜ)= s⁽ᵐ⁾ - s̅ₜ::Array{Float64}`

: Evaluation of the log frequency ratio posterior predictive checks at all times for each MCMC sample.

`logfreq_ratio_popmean_ppc(chain, n_ppc; kwargs)`

Function to compute the **posterior predictive checks** (better called the posterior retrodictive checks) for the barcode log-frequency ratio for neutral lineages.

**Model**

The functional form that connects the barcode frequency at tie $t+1$ based on the frequency at time $t$ for **neutral barcodes** is of the form

\[ f_{t+1}^{(n)} = f_{t}^{(n)} \exp\left[ \left( - \bar{s}_t \right) \tau \right],\]

where $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. Our inference model assumes that

\[ \frac{f_{t+1}^{(n)}}{f_{t}^{(n)}} \sim \log-\mathcal{N}\left( - \bar{s}_t, \sigma^{(n)} \right),\]

where $\sigma^{(n)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution.

**Arguments**

`chain::MCMCChains.Chains`

: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.

`n_ppc::Int`

: Number of samples to generate per set of parameters.

**Optional Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables

in the mcmc chain defining the following variables: - `population_mean_fitness`

: Common pattern in all population mean fitness variables. - `population_std_fitness`

: Common pattern in all standard deviations estimates for the likelihood.

`model::Symbol=:lognormal`

: Either`:normal`

or`:lognormal`

to indicate if the model used a normal or lognormal distribution for the likelihood. This is because when using a normal distribution, the nuisance parameters are sampled in log scale and need to be exponentiated.`flatten::Bool=true`

: Boolean indicating whether to flatten the output of multiple chain into a single column.

**Returns**

`log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}`

: Evaluation of the log frequency ratio posterior predictive checks at all times for each MCMC sample.

`BarBay.stats.logfreq_ratio_bc_ppc`

— Function`logfreq_ratio_bc_ppc(df, n_ppc; kwargs)`

Function to compute the **posterior predictive checks** for the barcode log-frequency ratio for adaptive mutants.

**Model**

The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form

\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]

where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that

\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]

where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.

**Arguments**

`df::DataFrames.DataFrame`

: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- population mean fitness values. NOTE: The number of columns containing population mean fitness values determines the number of datapoints where the ppc are evaluated.
- (log)-normal likelihood standard deviation.

`n_ppc::Int`

: Number of samples to generate per set of parameters.

**Optional Keyword Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables

in the mcmc chain defining the following variables:

`:bc_mean_fitness`

: Variable defining the inferred mutant fitness value`s⁽ᵐ⁾`

.`:bc_std_fitness`

: Variable defining the standard defining the inferred standard deviation on the likelihood function`σ⁽ᵐ⁾`

.`population_mean_fitness`

: Common pattern in all population mean fitness variables.`flatten::Bool=true`

: Boolean indicating whether to flatten the output of

multiple chain into a single column.

**Returns**

`log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}`

: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.

`logfreq_ratio_bc_ppc(chain, n_ppc; kwargs)`

Function to compute the **posterior predictive checks** for the barcode log-frequency ratio for adaptive mutants.

**Model**

The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form

\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]

where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that

\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]

where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.

**Arguments**

`chain::MCMCChains.Chains`

: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- (log)-normal likelihood standard deviation.

`n_ppc::Int`

: Number of samples to generate per set of parameters.

**Optional Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables

in the mcmc chain defining the following variables:

`:bc_mean_fitness`

: Variable defining the inferred mutant fitness value`s⁽ᵐ⁾`

.`:bc_std_fitness`

: Variable defining the standard defining the inferred standard deviation on the likelihood function`σ⁽ᵐ⁾`

.`population_mean_fitness`

: Common pattern in all population mean fitness variables.`flatten::Bool=true`

: Boolean indicating whether to flatten the output of

multiple chain into a single column.

**Returns**

`log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}`

: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.

`BarBay.stats.logfreq_ratio_multienv_ppc`

— Function`logfreq_ratio_mutienv_ppc(df, n_ppc; kwargs)`

Function to compute the **posterior predictive checks** for the barcode log-frequency ratio for adaptive mutants.

**Model**

The functional form that connects the barcode frequency at time $t+1$ with the frequency at time $t$ is of the form

\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]

where $s^{(m)}$ is the mutant relative fitness, $\bar{s}_t$ is the population mean fitness between time $t$ and $t+1$, and $\tau$ is the time interval between time $t$ and $t+1$. The statistical models in this package assume that

\[ \log\left(\frac{f_{t+1}^{(m)}}{f_{t}^{(m)}}\right) \sim \mathcal{N}\left( s^{(m)} - \bar{s}_t, \sigma^{(m)} \right),\]

where $\sigma^{(m)}$ is the inferred standard deviation for the model. This function generates samples out of this distribution to produce the posterior predictive checks.

**Arguments**

`df::DataFrames.DataFrame`

: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- (log)-normal likelihood standard deviation.

`n_ppc::Int`

: Number of samples to generate per set of parameters.`envs::Vector{<:Any}`

: List of environments in experiment. This is used to index the corresponding fitness from the chain. NOTE: The list of environments should be the name or corresponding label of the environemnt; the index is generated internally.

**Optional Keyword Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables

in the mcmc chain defining the following variables:

`:bc_mean_fitness`

: Variable defining the inferred mutant fitness value`s⁽ᵐ⁾`

.`:bc_std_fitness`

: Variable defining the standard defining the inferred standard deviation on the likelihood function`σ⁽ᵐ⁾`

.`population_mean_fitness`

: Common pattern in all population mean fitness variables.`flatten::Bool=true`

: Boolean indicating whether to flatten the output of

multiple chain into a single column.

**Returns**

`log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}`

: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.

logfreq*ratio*mutienv*ppc(chain, n*ppc; kwargs)

Function to compute the **posterior predictive checks** (better called the posterior retrodictive checks) for the barcode log-frequency ratio for neutral lineages.

**Model**

\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]

**Arguments**

`chain::MCMCChains.Chains`

: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- (log)-normal likelihood standard deviation.

`n_ppc::Int`

: Number of samples to generate per set of parameters.

**Optional Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables

in the mcmc chain defining the following variables: - `population_mean_fitness`

: Common pattern in all population mean fitness variables. - `population_std_fitness`

: Common pattern in all standard deviations estimates for the likelihood.

`model::Symbol=:lognormal`

: Either`:normal`

or`:lognormal`

to indicate if the model used a normal or lognormal distribution for the likelihood. This is because when using a normal distribution, the nuisance parameters are sampled in log scale and need to be exponentiated.`flatten::Bool=true`

: Boolean indicating whether to flatten the output of multiple chain into a single column.

**Returns**

`log(fₜ₊₁ / fₜ) = s⁽ᵐ⁾ - s̅ₜ::Array{Float64}`

: Evaluation of the log frequency ratio posterior predictive checks at all times for each MCMC sample.

`BarBay.stats.freq_bc_ppc`

— Function`freq_bc_ppc(df, n_ppc; kwargs)`

Function to compute the **posterior predictive checks** for the barcode frequency for adaptive mutants.

**Model**

\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]

**Arguments**

`df::DataFrames.DataFrame`

: Dataframe containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- (log)-normal likelihood standard deviation.
- mutant initial frequency.

`n_ppc::Int`

: Number of samples to generate per set of parameters.

**Optional Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables in the mcmc chain defining the following variables:`:bc_mean_fitness`

: Variable defining the inferred mutant fitness value`s⁽ᵐ⁾`

.`:bc_std_fitness`

: Variable defining the standard defining the inferred standard deviation on the likelihood function`σ⁽ᵐ⁾`

.`bc_freq`

: Variable defining the inferred initial frequency for the mutant.`population_mean_fitness`

: Common pattern in all population mean fitness variables.

`model::Symbol=:lognormal`

: Either`:normal`

or`:lognormal`

to indicate if the model used a normal or lognormal distribution for the likelihood. This is because when using a normal distribution, the nuisance parameters are sampled in log scale and need to be exponentiated.`flatten::Bool=true`

: Boolean indicating whether to flatten the output of multiple chain into a single column.

**Returns**

`fₜ₊₁ = fₜ × exp(s⁽ᵐ⁾ - s̅ₜ)::Array{Float64}`

: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.

`freq_bc_ppc(chain, n_ppc; kwargs)`

Function to compute the **posterior predictive checks** for the barcode frequency for adaptive mutants.

**Model**

\[ f_{t+1}^{(m)} = f_{t}^{(m)} \exp\left[ \left( s^{(m)} - \bar{s}_t \right) \tau \right],\]

**Arguments**

`chain::MCMCChains.Chains`

: Chain containing the MCMC samples for the variables needed to compute the posterior predictive checks. The dataframe should have MCMC samples for- mutant relative fitness values.
- (log)-normal likelihood standard deviation.
- mutant initial frequency.

`n_ppc::Int`

: Number of samples to generate per set of parameters.

**Optional Arguments**

`param::Dict{Symbol, Symbol}`

: Dictionary indicating the name of the variables in the mcmc chain defining the following variables:`:bc_mean_fitness`

: Variable defining the inferred mutant fitness value`s⁽ᵐ⁾`

.`:bc_std_fitness`

: Variable defining the standard defining the inferred standard deviation on the likelihood function`σ⁽ᵐ⁾`

.`bc_freq`

: Variable defining the inferred initial frequency for the mutant.`population_mean_fitness`

: Common pattern in all population mean fitness variables.

`flatten::Bool=true`

: Boolean indicating whether to flatten the output of multiple chain into a single column.

**Returns**

`fₜ₊₁ = fₜ × exp(s⁽ᵐ⁾ - s̅ₜ)::Array{Float64}`

: Evaluation of the frequency posterior predictive checks at all times for each MCMC sample.

## Naive estimates

`BarBay.stats.naive_prior`

— Function`naive_prior(data; kwargs)`

Function to compute a naive set of parameters for the prior distributions of the population mean fitness `s̲ₜ`

values, the nuisance parameters in the log-likelihood functions for the frequency ratios `logσ̲ₜ`

, and the log of the Poisson parameters for the observation model `logΛ̲̲`

This function expects the data in a **tidy** format. This means that every row represents **a single observation**. For example, if we measure barcode `i`

in 4 different time points, each of these four measurements gets an individual row. Furthermore, measurements of barcode `j`

over time also get their own individual rows.

The `DataFrame`

must contain at least the following columns:

`id_col`

: Column identifying the ID of the barcode. This can the barcode sequence, for example.`time_col`

: Column defining the measurement time point.`count_col`

: Column with the raw barcode count.`neutral_col`

: Column indicating whether the barcode is from a neutral lineage

or not.

`rep_col`

: (Optional) For hierarchical models to be build with multiple experimental replicates, this column defines which observations belong to which replicate.

**Arguments**

`data::DataFrames.AbstractDataFrame`

:**Tidy dataframe**with the data to be

used to sample from the population mean fitness posterior distribution.

**Optional Keyword Arguments**

`id_col::Symbol=:barcode`

: Name of the column in`data`

containing the barcode identifier. The column may contain any type of entry.`time_col::Symbol=:time`

: Name of the column in`data`

defining the time point at which measurements were done. The column may contain any type of entry as long as`sort`

will resulted in time-ordered names.`count_col::Symbol=:count`

: Name of the column in`data`

containing raw counts per barcode. The column must contain entries of type`<: Int`

.`neutral_col::Symbol=:neutral`

: Name of the column in`data`

defining whether the barcode belongs to a neutral lineage or not. The column must contain entries of type`Bool`

.`rep_col::Union{Nothing,Symbol}=nothing`

: (Optional) Column indicating the experimental replicates each point belongs to.`pseudocount::Int=1`

: Pseudo counts to add to raw counts to avoid dividing by zero. This is useful if some of the barcodes go extinct.`rm_T0::Bool=false`

: Optional argument to remove the first time point from the

inference. Commonly, the data from this first time point is of much lower quality. Therefore, removing this first time point might result in a better inference.

**Returns**

`prior_params::Dict`

: Dictionary with**two**entries:`s_pop_prior`

:**Mean**value of the population mean fitness.**NOTE**: This naive empirical method cannot make statements about the expected standard deviation of the population mean fitness. It is up to the researcher to determine this value.`logσ_pop_prior`

:**Mean**value on the nuisance parameter for the log-likelihood functions on the log-frequency ratios. In other words, the mean for the (log)-Normal distribution on the frequency ratios.**NOTE**: This naive empirical method cannot make statements about the expected standard deviation of the population mean fitness. It is up to the researcher to determine this value.**NOTE**: Typically, one can use the same estimate for both the neutral and the mutant lineages.`logλ_prior`

:**Mean**value of the nuisance parameter for the Poisson observation model parameter.**NOTE**: This naive empirical method cannot make statements about the expected standard deviation of the population mean fitness. It is up to the researcher to determine this value.

`BarBay.stats.naive_fitness`

— Function`naive_fitness(data; id_col, time_col, count_col, neutral_col, pseudocount)`

Function to compute a naive estimate of mutant fitness data based on counts. The fitness estimate is computed as

⟨log(f⁽ᵐ⁾ₜ₊₁ / f⁽ᵐ⁾ₜ) - log(f⁽ⁿ⁾ₜ₊₁ / f⁽ⁿ⁾ₜ))⟩ = s⁽ᵐ⁾

**Arguments**

`data::DataFrames.AbstractDataFrame`

:**Tidy dataframe**with the data to be

used to infer the fitness values on mutants. The `DataFrame`

must contain at least the following columns: - `id_col`

: Column identifying the ID of the barcode. This can the barcode sequence, for example. - `time_col`

: Column defining the measurement time point. - `count_col`

: Column with the raw barcode count. - `neutral_col`

: Column indicating whether the barcode is from a neutral lineage or not.

**Optional Keyword Arguments**

`id_col::Symbol=:barcode`

: Name of the column in`data`

containing the barcode identifier. The column may contain any type of entry.`time_col::Symbol=:time`

: Name of the column in`data`

defining the time point at which measurements were done. The column may contain any type of entry as long as`sort`

will resulted in time-ordered names.`count_col::Symbol=:count`

: Name of the column in`data`

containing the raw barcode count. The column must contain entries of type`Int64`

.`neutral_col::Symbol=:neutral`

: Name of the column in`data`

defining whether the barcode belongs to a neutral lineage or not. The column must contain entries of type`Bool`

.`rm_T0::Bool=false`

: Optional argument to remove the first time point from the

inference. Commonly, the data from this first time point is of much lower quality. Therefore, removing this first time point might result in a better inference.

`pseudocount::Int=1`

: Pseudo count number to add to all counts. This is useful to avoid divisions by zero.

**Returns**

`DataFrames.DataFrame`

: Data frame with two columns:`id_col`

: Column indicating the strain ID.`fitness`

: Naive fitness estimate.

## Miscellaneous statistical functions

`BarBay.stats.build_getq`

— FunctionFunction to build a full-rank distribution to be used for ADVI optimization. The code in this function comes from (`Turing.jl tutorial`

)[https://turinglang.org/v0.28/tutorials/09-variational-inference/]

**Arguments**

`dim::Int`

: Dimensionality of parameter space.`model::DynamicPPL.model`

: Turing model to be fit using ADVI.

**Returns**

Initialized distribution to be used when fitting a full-rank variational model.

`BarBay.stats.matrix_quantile_range`

— Functionmatrix*quantile*range(quantile, matrix; dim=2)

Function to compute the quantile ranges of matrix `matrix`

over dimension `dim`

.

For example, if `quantile[1] = 0.95`

, this function returns the `0.025`

and `0.975`

quantiles that capture 95 percent of the entries in the matrix.

**Arguments**

`quantile::Vector{<:AbstractFloat}`

: List of quantiles to extract from the posterior predictive checks.`matrix::Matrix{<:Real}`

: Array over which to compute quantile ranges.

**Keyword Arguments**

`dim::Int=2`

: Dimension over which to compute quantiles. Default is 2, i.e. columns.

**Returns**

`qs`

: Matrix with requested quantiles over specified dimension.