# Bayesian inference of posterior probabilities

After obtaining the PDFs for the LLRs from real data and the null hypotheses, we can convert LLR values into posterior probabilities $P({\mathcal H}_{\mathrm{alt}}\mid \mathrm{LLR})$. We use a similar technique as in ^{[Chen2007]}, which itself was based on a more general framework to estimate local FDRs in genome-wide studies ^{[Storey2003]}. This framework assumes that the real distribution of a certain test statistic forms a mixture distribution of null and alternative hypotheses. After estimating the null distribution, either analytically or by simulation, it can be compared against the real distribution to determine the proportion of null hypotheses, and consequently the posterior probability that the alternative hypothesis is true at any value of the statistic.

To be precise, consider an arbitrary likelihood ratio test. The fundamental assumption is that in the limit $\mathrm{LLR}\rightarrow 0^+$, all test cases come from the null hypothesis (${\mathcal H}_{\mathrm{null}}$), whilst as $\mathrm{LLR}$ increases, the proportion of alternative hypotheses (${\mathcal H}_{\mathrm{alt}}$) also grows. The mixture distribution of real LLR values is assumed to have a PDF as

\[p(\mathrm{LLR}) = P({\mathcal H}_{\mathrm{null}}) p(\mathrm{LLR}\mid{\mathcal H}_{\mathrm{null}}) + P({\mathcal H}_{\mathrm{alt}})p(\mathrm{LLR}\mid{\mathcal H}_{\mathrm{alt}}).\]

The priors $P({\mathcal H}_{\mathrm{null}})$ and $P({\mathcal H}_{\mathrm{alt}})$ sum to unity and correspond to the proportions of null and alternative hypotheses in the mixture distribution. For any test $i=0,\dots,5$, Bayes' theorem then yields its posterior probability as

\[\begin{aligned} P({\mathcal H}_{\mathrm{alt}}^{(i)}\mid\mathrm{LLR}^{(i)}) &= \frac{p(\mathrm{LLR}^{(i)}\mid{\mathcal H}_{\mathrm{alt}}^{(i)})}{p(\mathrm{LLR}^{(i)})}P({\mathcal H}_{\mathrm{alt}}^{(i)})\\ &= 1 - \frac{p(\mathrm{LLR}^{(i)}\mid{\mathcal H}_{\mathrm{null}}^{(i)})}{p(\mathrm{LLR}^{(i)})}P({\mathcal H}_{\mathrm{null}}^{(i)}) \end{aligned}\]

Based on this, we can define the posterior probabilities of the selected hypotheses according to the table in the Likelihood ratio tests section, i.e. the alternative for tests 0, 1, 2, 4, 5 and the null for test 3 as

\[P_i\equiv \begin{cases} P({\mathcal H}_{\mathrm{alt}}^{(i)}\mid\mathrm{LLR}^{(i)}), & i=0,1,2,4,5\\ P({\mathcal H}_{\mathrm{null}}^{(i)}\mid\mathrm{LLR}^{(i)}),&i=3 \end{cases}\]

Distributions can be estimated either separately for every $(E,A)$ pair or by pooling across all $(E,A)$ pairs. In practice, we test on the order of $10^3$ to $10^4$ candidate targets ("$B$") for every $(E,A)$ such that a separate conversion of LLR values to posterior probabilities is both feasible and recommended, as it accounts for different roles of every gene, especially hub genes, through different rates of alternative hypotheses.

Lastly, in a typical application of BioFindr, inputs of $(E,A)$ pairs will have been pre-determined as the set of significant eQTL-gene pairs from a genome-wide eQTL associaton analysis. In such cases, we may naturally assume $P_1=1$ for all considered pairs, and skip the primary test.

Posterior probabilities are computed in the `pprob_col`

function:

`BioFindr.pprob_col`

— Function`pprob_col(Y::Matrix{T},Ycol::Vector{T}; method="moments") where T<:AbstractFloat`

Compute the posterior probabilities for BioFindr test 0 (**correlation test**) for a given column vector `Ycol`

against all columns of matrix `Y`

.

`Y`

and `Ycol`

are assumed to have undergone supernormalization with each column having mean zero and variance one. The LLRs are scaled by the number of rows (samples).

The optional parameter `method`

determines the mixture distribution fitting method and can be either `moments`

(default) for the method of moments, or `kde`

for kernel-based density estimation.

See also `supernormalize`

, `fit_mixdist_mom`

, `fit_mixdist_KDE`

.

`pprob_col(Y::Matrix{T},Ycol::Vector{T},E::Vector{S}; method="moments") where {T<:AbstractFloat, S<:Integer}`

Compute the posterior probabilities for the BioFindr causal tests for a given column vector `Ycol`

with categorical instrument `E`

against all columns of matrix `Y`

:

- Test 2 (
**Linkage test**) - Test 3 (
**Mediation test**) - Test 4 (
**Relevance test**) - Test 5 (
**Pleiotropy test**)

`Y`

is assumed to have undergone supernormalization with each column having mean zero and variance one.

For test 2, 4, and 5 the posterior probabilities are the probabilities of the alternative hypothesis being true. For test 3 they are the probabilities of the null hypothesis being true.

The optional parameter `method`

determines the mixture distribution fitting method and can be either `moments`

(default) for the method of moments, or `kde`

for kernel-based density estimation.

See also `supernormalize`

, `fit_mixdist_mom`

, `fit_mixdist_KDE`

.

`pprob_col(Y::Matrix{T},E::Vector{S}; method="moments") where {T<:AbstractFloat, S<:Integer}`

Compute the posterior probabilities for differential expression of columns of matrix `Y`

in the groups defined by categorical vector `E`

using BioFindr test 2 (**Linkage test**)

`Y`

is assumed to have undergone supernormalization with each column having mean zero and variance one.

The optional parameter `method`

determines the mixture distribution fitting method and can be either `moments`

(default) for the method of moments, or `kde`

for kernel-based density estimation.

See also `supernormalize`

, `fit_mixdist_mom`

, `fit_mixdist_KDE`

.

Computation of the posterior probabilities requires estimation of the prior probability $P({\mathcal H}_{\mathrm{null}})$ and fitting the mixture distribution $p(\mathrm{LLR})$ from a set of random samples $x_1, x_2,\dots,x_m$.

## Estimating $P({\mathcal H}_{\mathrm{null}})$

To estimate $\pi_0\equiv P({\mathcal H}_{\mathrm{null}})$ we convert the samples $x_1, x_2,\dots,x_m$ to p-values under the null hypothesis,

\[p_i = P(X\geq x_i \mid {\mathcal H}_{\mathrm{null}}) = \int_{x_i}^\infty p(x \mid \alpha_0,\beta_0) = \mathrm{ccdf}(x_i\mid \alpha_0,\beta_0) \]

where $\mathrm{ccdf}$ denotes the complementary CDF. The histogram of p-values would show the characteristic shape of a set of anti-conservative p-values and $\pi_0$ can be estimated by a robust bootstrap procedure:

`BioFindr.pi0est`

— Function`pi0est(pval)`

Estimate the proportion $\pi_0$ of truly null features in a vector `pval`

of p-values using a bootstrap method.

## Method of moments estimation of the mixture distribution

To fit the mixture distribution, the original method^{[Wang2017]} approximated PDFs with histograms. This requires proper choices of histogram bin widths, $P({\mathcal H}_{\mathrm{null}})$, and techniques to ensure the conversion from LLR to posterior probability is monotonically increasing and smooth.

BioFindr uses an alternative, parametric approach, where the basic assumption is that the alternative distribution $p(\mathrm{LLR}\mid{\mathcal H}_{\mathrm{alt}})$ for each test also follows an `LBeta`

distribution, that is, belongs to the same 2-parameter family of distributions as the null distributions.

Let us denote the random variable $X=\mathrm{LLR}$ taking values $x\geq 0$, and $\pi_0=P({\mathcal H}_{\mathrm{null}})$. Our assumption is that

\[X \sim \pi_0 \mathcal{D}(\alpha_0,\beta_0) + (1-\pi_0) \mathcal{D}(\alpha,\beta)\]

or

\[p(x) = \pi_0 p(x \mid \alpha_0,\beta_0) + (1-\pi_0) p(x\mid \alpha,\beta)\]

where $\mathcal{D}$ denotes an `LBeta`

distribution with its PDF $p(x\mid \alpha,\beta)$ defined in Null distributions of the log-likelihood ratios.

Since the parameters $\alpha_0$ and $\beta_0$ of the null distribution are know exactly (see Null distributions of the log-likelihood ratios) and $\pi_0$ can be estimated from the null p-values (see `pi0est`

), it remains to estimate $\alpha$, and $\beta$ from a set of random samples $x_1, x_2,\dots,x_m$ of $X$.

To estimate $\alpha$ and $\beta$ we make use of the relation between the `LBeta`

and Beta distributions (see Null distributions of the log-likelihood ratios). Define the random variable

\[Y = 1 - e^{-2 X}\]

Since by assumption each component of the mixture distribution of $X$ follows an `LBeta`

distribution, it follows that $Y$ follows a mixture of Beta distributions,

\[Y \sim \pi_0 B\Bigl(\frac{\alpha_0}{2},\frac{\beta_0}{2}\Bigr) + (1 - \pi_0) B\Bigl(\frac{\alpha}{2},\frac{\beta}{2}\Bigr)\]

This result can be derived as follows. The function $g(x)= 1-e^{-2x}$ increases monotonically for $x\geq 0$ and has the inverse $g^{-1}(y)=-\frac{1}{2}\ln(1-y)$. By the usual rules of transforming random variables:

\[\begin{aligned} P(Y\leq y) &= P(1 - e^{-2 X} \leq y) \\ &= P(X \leq g^{-1}(y))\\ &= \pi_0 P(X \leq g^{-1}(y) \mid \alpha_0,\beta_0) + (1-\pi_0) P(X \leq g^{-1}(y) \mid \alpha,\beta)\\ &= \pi_0 F_0(g^{-1}(y)) + (1-\pi_0) F_1(g^{-1}(y))\\ &= \pi_0 G_0(y) + (1-\pi_0) G_1(y) \end{aligned}\]

where $F_0$ and $F_1$ are the CDFs of random variables with distributions $\mathcal{D}(\alpha_0,\beta_0)$ and $\mathcal{D}(\alpha,\beta)$, respectively, and $G_0$ and $G_1$ are the CDFs of random variables with distributions $B(\alpha_0/2,\beta_0/2)$ and $B(\alpha/2,\beta/2)$. The last step used the transformation between `LBeta`

and Beta distributions derived in Null distributions of the log-likelihood ratios.

The parameters of a Beta distribution relate to its first and second moment as follows: if $Z\sim B(a,b)$, then

\[\begin{aligned} \mathbb{E}(Z) &= \frac{a}{a+b} \equiv m_1(a,b)\\ \mathbb{E}(Z^2) &= \mathbb{E}(Z) \frac{a+1}{a+b+1} \equiv m_2(a,b) \end{aligned}\]

Conversely, if first and second moments $m_1$ and $m_2$ are given, these equations can be inverted to give the parameters

\[\begin{aligned} a &= \frac{m_1(m_1-m_2)}{(m_2-m_1^2)}\\ b &= \frac{(1-m_1)(m_1-m_2)}{(m_2-m_1^2)}, \end{aligned}\]

which requires $m_1>m_2>m_1^2$

Multiplying by two then gives the parameters of an `LBeta`

distribution given the moments of the corresponding Beta distribution:

`BioFindr.fit_mom`

— Function`fit_mom(LBeta, m1, m2)`

Fit an `LBeta`

distribution to given first and second moments `m1`

and `m2`

of the corresponding Beta distribution. This requires that `m1`

and `m2`

satisfy the following relations (see also the Beta distribution wiki):

$m_1>m_2 \wedge m_2>m_1^2$

An AssertionError is thrown if the condition evaluates to `false`

.

For the mixture distributed $Y$, we have

\[\mathbb{E}(Y^k) = \pi_0 \mathbb{E}(Y^k \mid \alpha_0,\beta_0) + (1-\pi_0) \mathbb{E}(Y^k \mid \alpha,\beta)\]

and hence

\[\begin{aligned} \mathbb{E}(Y) &= \pi_0 m_1\bigl(\alpha_0/2,\beta_0/2\bigr) + (1-\pi_0) m_1\bigl(\alpha/2,\beta/2\bigr)\\ \mathbb{E}(Y^2) &= \pi_0 m_2\bigl(\alpha_0/2,\beta_0/2\bigr) + (1-\pi_0) m_2\bigl(\alpha/2,\beta/2\bigr) \end{aligned}\]

or

\[\begin{aligned} m_1\bigl(\alpha/2,\beta/2\bigr) &= \frac{\mathbb{E}(Y) - \pi_0 m_1\bigl(\alpha_0/2,\beta_0/2\bigr)}{(1-\pi_0)}\\ m_2\bigl(\alpha/2,\beta/2\bigr) &= \frac{\mathbb{E}(Y^2) - \pi_0 m_2\bigl(\alpha_0/2,\beta_0/2\bigr)}{(1-\pi_0)} \end{aligned}\]

Given samples $x_1, x_2,\dots,x_m$ we transform them to

\[y_i = 1 - e^{-2 x_i}\]

and replace $\mathbb{E}(Y)$ and $\mathbb{E}(Y^2)$ in these equations by their estimates $\frac{1}{n}\sum_i y_i$ and $\frac{1}{n}\sum_i y_i^2$, respectively. Likewise we replace $\pi_0$ by its estimate $\hat{\pi}_0$ obtained as explained above (see `pi0est`

). Since $\alpha_0$ and $\beta_0$ are known exactly, we obtain estimates $\hat{m}_1$ and $\hat{m}_2$ for the moments of a Beta distribution with parameters $\alpha/2$ and $\beta/2$. Plugging these moment estimates in the `fit_mom`

function gives estimates $\hat{\alpha}$ and $\hat{\beta}$ for the corresponding `LBeta`

distribution.

To make this estimated distribution a valid alternative distribution component of the LLRs mixture distribution, we need to ensure the validity of two edge cases:

- In the limit $\mathrm{LLR}\to 0^+$, all cases must come from the null distribution. This will be the case if $\hat{\alpha} \geq \alpha_0$, and if this is not the case, we set $\hat{\alpha} = \alpha_0$.
- In the limit $\mathrm{LLR}\to \infty$, all cases must come from the alternative distribution. This will be the case if $\hat{\beta} \leq \beta_0$, and if this is not the case, we set $\hat{\beta} = \beta_0$.

The entire procedure is implemented in the `fit_mixdist_mom`

function:

`BioFindr.fit_mixdist_mom`

— Function`fit_mixdist_mom(llr,ns,ng=1,test=:corr)`

Fit a two-component mixture distribution of two LBeta distributions to a vector of log-likelihood ratios `llr`

using a method-of-moments algorithm. The first component is the true null distribution for a given BioFindr `test`

with sample size `ns`

and number of genotype groups `ng`

. The second component is the alternative distribution, assumed to follow an `LBeta`

distribution. The prior probability `pi0`

of an observation belonging to the null component is fixed and determined by the `pi0est`

function. Hence only the parameters of the alternative component need to be estimated.

The input variable `test`

can take the values:

- :corr -
**correlation test**(test 0) - :link -
**linkage test**(test 1/2) - :med -
**mediation test**(test 3) - :relev -
**relevance test**(test 4) - :pleio -
**pleiotropy test**(test 5)

With two input arguments, the correlation test with `ns`

samples is used. With three input arguments, or with four arguments and `test`

equal to `:corr`

, the correlation test with `ns`

samples is used and the third argument is ignored.

## Kernel density estimation of the mixture distribution

The method of moments mixture distribution works very well in practice (by visual inspection of histograms and fitted distributions), but there is no theory to support it. Moreover the condition on the moments to produce valid distribution parameters (See `fit_mom`

) is sometimes violated (typically when $\pi_0\approx 1$). Hence BioFindr also implements an alternative kernel density estimation method, comparable to the one implemented in the qvalue package for R.

`BioFindr.fit_mixdist_KDE`

— Function`fit_mixdist_KDE(llr,ns,[ng,test])`

Return posterior probabilities for a vector of log-likelihood ratio values `llr`

for a given BioFindr `test`

with sample size `ns`

and number of genotype groups `ng`

. The input variable `test`

can take the values:

- ':corr' -
**correlation test**(test 0) - ':link' -
**linkage test**(test 1/2) - ':med' -
**mediation test**(test 3) - ':relev' -
**relevance test**(test 4) - ':pleio' -
**pleiotropy test**(test 5)

With two input arguments, the correlation test with `ns`

samples is used. With three input arguments, or with four arguments and `test`

equal to ":corr", the correlation test with `ns`

samples is used and the third argument is ignored.

To mitigate edge effects due to LLRs being restricted to the half-open interval $[0,\infty)$, the kernel density estimation of $p(\mathrm{LLR})$ operates on transformed values:

`BioFindr.fit_kde`

— Function`fit_kde(llr)`

Fit a distribution function to a vector of log-likelihood ratio values `llr`

using kernel density estimation. To avoid boundary effects in the KDE, the log-likelihoods (which take values in $[0,\infty)$) are first transformed to a vector

$z = \log \left( e^{2 \mathrm{LLR}} - 1 \right)$

which takes values in $(-\infty,\infty)$. KDE is applied to the transformed variables, and the probability density function (pdf) for `llr`

is obtained from the pdf for $z$ by the usual transformation rule for functions of random variables.

Nevertheless, the KDE of $p(\mathrm{LLR})$ will still fluctuate wildly near $\mathrm{LLR}\approx 0$. Given a set of samples $x_1,\dots,x_m$ with KDE estimated PDF values $\tilde{p}(x_1), \dots, \tilde{p}(x_m)$, `fit_mixdist_KDE`

calculates smoothened and monotonically increasing estimayes $p(x_1),\dots, p(x_m)$ as

\[p(x_i) = \max\left(0, \min_{j\geq i} \tilde{p}(x_j) \right)\]

- Chen2007Chen L, Emmert-Streib F, Storey J. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol 8, R219 (2007).
- Storey2003Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences. 2003;100(16):9440–9445.
- Wang2017Wang L, Michoel T (2017) Efficient and accurate causal inference with hidden confounders from genome-transcriptome variation data. PLoS Comput Biol 13(8): e1005703.