Bayesian inference of posterior probabilities

After obtaining the PDFs for the LLRs from real data and the null hypotheses, we can convert LLR values into posterior probabilities $P({\mathcal H}_{\mathrm{alt}}\mid \mathrm{LLR})$. We use a similar technique as in ^[Chen2007], which itself was based on a more general framework to estimate local FDRs in genome-wide studies ^[Storey2003]. This framework assumes that the real distribution of a certain test statistic forms a mixture distribution of null and alternative hypotheses. After estimating the null distribution, either analytically or by simulation, it can be compared against the real distribution to determine the proportion of null hypotheses, and consequently the posterior probability that the alternative hypothesis is true at any value of the statistic.

To be precise, consider an arbitrary likelihood ratio test. The fundamental assumption is that in the limit $\mathrm{LLR}\rightarrow 0^+$, all test cases come from the null hypothesis (${\mathcal H}_{\mathrm{null}}$), whilst as $\mathrm{LLR}$ increases, the proportion of alternative hypotheses (${\mathcal H}_{\mathrm{alt}}$) also grows. The mixture distribution of real LLR values is assumed to have a PDF as

\[p(\mathrm{LLR}) = P({\mathcal H}_{\mathrm{null}}) p(\mathrm{LLR}\mid{\mathcal H}_{\mathrm{null}}) + P({\mathcal H}_{\mathrm{alt}})p(\mathrm{LLR}\mid{\mathcal H}_{\mathrm{alt}}).\]

The priors $P({\mathcal H}_{\mathrm{null}})$ and $P({\mathcal H}_{\mathrm{alt}})$ sum to unity and correspond to the proportions of null and alternative hypotheses in the mixture distribution. For any test $i=0,\dots,5$, Bayes' theorem then yields its posterior probability as

\[\begin{aligned} P({\mathcal H}_{\mathrm{alt}}^{(i)}\mid\mathrm{LLR}^{(i)}) &= \frac{p(\mathrm{LLR}^{(i)}\mid{\mathcal H}_{\mathrm{alt}}^{(i)})}{p(\mathrm{LLR}^{(i)})}P({\mathcal H}_{\mathrm{alt}}^{(i)})\\ &= 1 - \frac{p(\mathrm{LLR}^{(i)}\mid{\mathcal H}_{\mathrm{null}}^{(i)})}{p(\mathrm{LLR}^{(i)})}P({\mathcal H}_{\mathrm{null}}^{(i)}) \end{aligned}\]

Based on this, we can define the posterior probabilities of the selected hypotheses according to the table in the Likelihood ratio tests section, i.e. the alternative for tests 0, 1, 2, 4, 5 and the null for test 3 as

\[P_i\equiv \begin{cases} P({\mathcal H}_{\mathrm{alt}}^{(i)}\mid\mathrm{LLR}^{(i)}), & i=0,1,2,4,5\\ P({\mathcal H}_{\mathrm{null}}^{(i)}\mid\mathrm{LLR}^{(i)}),&i=3 \end{cases}\]

Distributions can be estimated either separately for every $(E,A)$ pair or by pooling across all $(E,A)$ pairs. In practice, we test on the order of $10^3$ to $10^4$ candidate targets ("$B$") for every $(E,A)$ such that a separate conversion of LLR values to posterior probabilities is both feasible and recommended, as it accounts for different roles of every gene, especially hub genes, through different rates of alternative hypotheses.

Lastly, in a typical application of BioFindr, inputs of $(E,A)$ pairs will have been pre-determined as the set of significant eQTL-gene pairs from a genome-wide eQTL associaton analysis. In such cases, we may naturally assume $P_1=1$ for all considered pairs, and skip the primary test.

Posterior probabilities are computed in the pprob_col function:

BioFindr.pprob_col — Function

pprob_col(Y::Matrix{T},Ycol::Vector{T}; method="moments") where T<:AbstractFloat

Compute the posterior probabilities for BioFindr test 0 (correlation test) for a given column vector Ycol against all columns of matrix Y.

Y and Ycol are assumed to have undergone supernormalization with each column having mean zero and variance one. The LLRs are scaled by the number of rows (samples).

The optional parameter method determines the mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

pprob_col(Y::Matrix{T},Ycol::Vector{T},E::Vector{S}; method="moments") where {T<:AbstractFloat, S<:Integer}

Compute the posterior probabilities for the BioFindr causal tests for a given column vector Ycol with categorical instrument E against all columns of matrix Y:

Test 2 (Linkage test)
Test 3 (Mediation test)
Test 4 (Relevance test)
Test 5 (Pleiotropy test)

Y is assumed to have undergone supernormalization with each column having mean zero and variance one.

For test 2, 4, and 5 the posterior probabilities are the probabilities of the alternative hypothesis being true. For test 3 they are the probabilities of the null hypothesis being true.

The optional parameter method determines the mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

pprob_col(Y::Matrix{T},E::Vector{S}; method="moments") where {T<:AbstractFloat, S<:Integer}

Compute the posterior probabilities for differential expression of columns of matrix Y in the groups defined by categorical vector E using BioFindr test 2 (Linkage test)

Y is assumed to have undergone supernormalization with each column having mean zero and variance one.

The optional parameter method determines the mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

Computation of the posterior probabilities requires estimation of the prior probability $P({\mathcal H}_{\mathrm{null}})$ and fitting the mixture distribution $p(\mathrm{LLR})$ from a set of random samples $x_1, x_2,\dots,x_m$.

Estimating $P({\mathcal H}_{\mathrm{null}})$

To estimate $\pi_0\equiv P({\mathcal H}_{\mathrm{null}})$ we convert the samples $x_1, x_2,\dots,x_m$ to p-values under the null hypothesis,

\[p_i = P(X\geq x_i \mid {\mathcal H}_{\mathrm{null}}) = \int_{x_i}^\infty p(x \mid \alpha_0,\beta_0) = \mathrm{ccdf}(x_i\mid \alpha_0,\beta_0) \]

where $\mathrm{ccdf}$ denotes the complementary CDF. The histogram of p-values would show the characteristic shape of a set of anti-conservative p-values and $\pi_0$ can be estimated by a robust bootstrap procedure:

BioFindr.pi0est — Function

pi0est(pval)

Estimate the proportion $\pi_0$ of truly null features in a vector pval of p-values using a bootstrap method.

Method of moments estimation of the mixture distribution

To fit the mixture distribution, the original method^[Wang2017] approximated PDFs with histograms. This requires proper choices of histogram bin widths, $P({\mathcal H}_{\mathrm{null}})$, and techniques to ensure the conversion from LLR to posterior probability is monotonically increasing and smooth.

BioFindr uses an alternative, parametric approach, where the basic assumption is that the alternative distribution $p(\mathrm{LLR}\mid{\mathcal H}_{\mathrm{alt}})$ for each test also follows an LBeta distribution, that is, belongs to the same 2-parameter family of distributions as the null distributions.

Let us denote the random variable $X=\mathrm{LLR}$ taking values $x\geq 0$, and $\pi_0=P({\mathcal H}_{\mathrm{null}})$. Our assumption is that

\[X \sim \pi_0 \mathcal{D}(\alpha_0,\beta_0) + (1-\pi_0) \mathcal{D}(\alpha,\beta)\]

\[p(x) = \pi_0 p(x \mid \alpha_0,\beta_0) + (1-\pi_0) p(x\mid \alpha,\beta)\]

where $\mathcal{D}$ denotes an LBeta distribution with its PDF $p(x\mid \alpha,\beta)$ defined in Null distributions of the log-likelihood ratios.

Since the parameters $\alpha_0$ and $\beta_0$ of the null distribution are know exactly (see Null distributions of the log-likelihood ratios) and $\pi_0$ can be estimated from the null p-values (see pi0est), it remains to estimate $\alpha$, and $\beta$ from a set of random samples $x_1, x_2,\dots,x_m$ of $X$.

To estimate $\alpha$ and $\beta$ we make use of the relation between the LBeta and Beta distributions (see Null distributions of the log-likelihood ratios). Define the random variable

\[Y = 1 - e^{-2 X}\]

Since by assumption each component of the mixture distribution of $X$ follows an LBeta distribution, it follows that $Y$ follows a mixture of Beta distributions,

\[Y \sim \pi_0 B\Bigl(\frac{\alpha_0}{2},\frac{\beta_0}{2}\Bigr) + (1 - \pi_0) B\Bigl(\frac{\alpha}{2},\frac{\beta}{2}\Bigr)\]

This result can be derived as follows. The function $g(x)= 1-e^{-2x}$ increases monotonically for $x\geq 0$ and has the inverse $g^{-1}(y)=-\frac{1}{2}\ln(1-y)$. By the usual rules of transforming random variables:

\[\begin{aligned} P(Y\leq y) &= P(1 - e^{-2 X} \leq y) \\ &= P(X \leq g^{-1}(y))\\ &= \pi_0 P(X \leq g^{-1}(y) \mid \alpha_0,\beta_0) + (1-\pi_0) P(X \leq g^{-1}(y) \mid \alpha,\beta)\\ &= \pi_0 F_0(g^{-1}(y)) + (1-\pi_0) F_1(g^{-1}(y))\\ &= \pi_0 G_0(y) + (1-\pi_0) G_1(y) \end{aligned}\]

where $F_0$ and $F_1$ are the CDFs of random variables with distributions $\mathcal{D}(\alpha_0,\beta_0)$ and $\mathcal{D}(\alpha,\beta)$, respectively, and $G_0$ and $G_1$ are the CDFs of random variables with distributions $B(\alpha_0/2,\beta_0/2)$ and $B(\alpha/2,\beta/2)$. The last step used the transformation between LBeta and Beta distributions derived in Null distributions of the log-likelihood ratios.

The parameters of a Beta distribution relate to its first and second moment as follows: if $Z\sim B(a,b)$, then

\[\begin{aligned} \mathbb{E}(Z) &= \frac{a}{a+b} \equiv m_1(a,b)\\ \mathbb{E}(Z^2) &= \mathbb{E}(Z) \frac{a+1}{a+b+1} \equiv m_2(a,b) \end{aligned}\]

Conversely, if first and second moments $m_1$ and $m_2$ are given, these equations can be inverted to give the parameters

\[\begin{aligned} a &= \frac{m_1(m_1-m_2)}{(m_2-m_1^2)}\\ b &= \frac{(1-m_1)(m_1-m_2)}{(m_2-m_1^2)}, \end{aligned}\]

which requires $m_1>m_2>m_1^2$

Multiplying by two then gives the parameters of an LBeta distribution given the moments of the corresponding Beta distribution:

BioFindr.fit_mom — Function

fit_mom(LBeta, m1, m2)

Fit an LBeta distribution to given first and second moments m1 and m2 of the corresponding Beta distribution. This requires that m1 and m2 satisfy the following relations (see also the Beta distribution wiki):

$m_1>m_2 \wedge m_2>m_1^2$

An AssertionError is thrown if the condition evaluates to false.

For the mixture distributed $Y$, we have

\[\mathbb{E}(Y^k) = \pi_0 \mathbb{E}(Y^k \mid \alpha_0,\beta_0) + (1-\pi_0) \mathbb{E}(Y^k \mid \alpha,\beta)\]

and hence

\[\begin{aligned} \mathbb{E}(Y) &= \pi_0 m_1\bigl(\alpha_0/2,\beta_0/2\bigr) + (1-\pi_0) m_1\bigl(\alpha/2,\beta/2\bigr)\\ \mathbb{E}(Y^2) &= \pi_0 m_2\bigl(\alpha_0/2,\beta_0/2\bigr) + (1-\pi_0) m_2\bigl(\alpha/2,\beta/2\bigr) \end{aligned}\]

\[\begin{aligned} m_1\bigl(\alpha/2,\beta/2\bigr) &= \frac{\mathbb{E}(Y) - \pi_0 m_1\bigl(\alpha_0/2,\beta_0/2\bigr)}{(1-\pi_0)}\\ m_2\bigl(\alpha/2,\beta/2\bigr) &= \frac{\mathbb{E}(Y^2) - \pi_0 m_2\bigl(\alpha_0/2,\beta_0/2\bigr)}{(1-\pi_0)} \end{aligned}\]

Given samples $x_1, x_2,\dots,x_m$ we transform them to

\[y_i = 1 - e^{-2 x_i}\]

and replace $\mathbb{E}(Y)$ and $\mathbb{E}(Y^2)$ in these equations by their estimates $\frac{1}{n}\sum_i y_i$ and $\frac{1}{n}\sum_i y_i^2$, respectively. Likewise we replace $\pi_0$ by its estimate $\hat{\pi}_0$ obtained as explained above (see pi0est). Since $\alpha_0$ and $\beta_0$ are known exactly, we obtain estimates $\hat{m}_1$ and $\hat{m}_2$ for the moments of a Beta distribution with parameters $\alpha/2$ and $\beta/2$. Plugging these moment estimates in the fit_mom function gives estimates $\hat{\alpha}$ and $\hat{\beta}$ for the corresponding LBeta distribution.

To make this estimated distribution a valid alternative distribution component of the LLRs mixture distribution, we need to ensure the validity of two edge cases:

In the limit $\mathrm{LLR}\to 0^+$, all cases must come from the null distribution. This will be the case if $\hat{\alpha} \geq \alpha_0$, and if this is not the case, we set $\hat{\alpha} = \alpha_0$.
In the limit $\mathrm{LLR}\to \infty$, all cases must come from the alternative distribution. This will be the case if $\hat{\beta} \leq \beta_0$, and if this is not the case, we set $\hat{\beta} = \beta_0$.

The entire procedure is implemented in the fit_mixdist_mom function:

BioFindr.fit_mixdist_mom — Function

fit_mixdist_mom(llr,ns,ng=1,test=:corr)

Fit a two-component mixture distribution of two LBeta distributions to a vector of log-likelihood ratios llr using a method-of-moments algorithm. The first component is the true null distribution for a given BioFindr test with sample size ns and number of genotype groups ng. The second component is the alternative distribution, assumed to follow an LBeta distribution. The prior probability pi0 of an observation belonging to the null component is fixed and determined by the pi0est function. Hence only the parameters of the alternative component need to be estimated.

The input variable test can take the values:

:corr - correlation test (test 0)
:link - linkage test (test 1/2)
:med - mediation test (test 3)
:relev - relevance test (test 4)
:pleio - pleiotropy test (test 5)

With two input arguments, the correlation test with ns samples is used. With three input arguments, or with four arguments and test equal to :corr, the correlation test with ns samples is used and the third argument is ignored.

Kernel density estimation of the mixture distribution

The method of moments mixture distribution works very well in practice (by visual inspection of histograms and fitted distributions), but there is no theory to support it. Moreover the condition on the moments to produce valid distribution parameters (See fit_mom) is sometimes violated (typically when $\pi_0\approx 1$). Hence BioFindr also implements an alternative kernel density estimation method, comparable to the one implemented in the qvalue package for R.

BioFindr.fit_mixdist_KDE — Function

fit_mixdist_KDE(llr,ns,[ng,test])

Return posterior probabilities for a vector of log-likelihood ratio values llr for a given BioFindr test with sample size ns and number of genotype groups ng. The input variable test can take the values:

':corr' - correlation test (test 0)
':link' - linkage test (test 1/2)
':med' - mediation test (test 3)
':relev' - relevance test (test 4)
':pleio' - pleiotropy test (test 5)

With two input arguments, the correlation test with ns samples is used. With three input arguments, or with four arguments and test equal to ":corr", the correlation test with ns samples is used and the third argument is ignored.

See also pi0est, fit_kde, nulldist, nullpval.

To mitigate edge effects due to LLRs being restricted to the half-open interval $[0,\infty)$, the kernel density estimation of $p(\mathrm{LLR})$ operates on transformed values:

BioFindr.fit_kde — Function

fit_kde(llr)

Fit a distribution function to a vector of log-likelihood ratio values llr using kernel density estimation. To avoid boundary effects in the KDE, the log-likelihoods (which take values in $[0,\infty)$) are first transformed to a vector

$z = \log \left( e^{2 \mathrm{LLR}} - 1 \right)$

which takes values in $(-\infty,\infty)$. KDE is applied to the transformed variables, and the probability density function (pdf) for llr is obtained from the pdf for $z$ by the usual transformation rule for functions of random variables.

Nevertheless, the KDE of $p(\mathrm{LLR})$ will still fluctuate wildly near $\mathrm{LLR}\approx 0$. Given a set of samples $x_1,\dots,x_m$ with KDE estimated PDF values $\tilde{p}(x_1), \dots, \tilde{p}(x_m)$, fit_mixdist_KDE calculates smoothened and monotonically increasing estimayes $p(x_1),\dots, p(x_m)$ as

\[p(x_i) = \max\left(0, \min_{j\geq i} \tilde{p}(x_j) \right)\]

Chen2007Chen L, Emmert-Streib F, Storey J. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol 8, R219 (2007).
Storey2003Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences. 2003;100(16):9440–9445.
Wang2017Wang L, Michoel T (2017) Efficient and accurate causal inference with hidden confounders from genome-transcriptome variation data. PLoS Comput Biol 13(8): e1005703.