Tests to evaluate

Based on the six Likelihood ratio tests, we use the following tests and test combinations for the inference of genetic regulations:

Coexpression analysis

The correlation test is introduced as a benchmark, against which we can compare other methods involving genotype information. Pairwise correlation is a simple measure for the probability of two genes being functionally related either through direct or indirect regulation, or through coregulation by a third factor. Bayesian inference additionally considers different gene roles. Its predicted posterior probability for regulation is $P_0$.

Correlation analysis can be performed by calling findr with one argument, a matrix or dataframe of gene expression values:

BioFindr.findr — Method

findr(X::Matrix{T}; cols=[], method="moments", combination="none") where T<:AbstractFloat

Compute posterior probabilities for nonzero pairwise correlations between columns of input matrix X. The probabilities are directed (asymmetric) in the sense that they are estimated from a column-specific background distribution.

The optional parameter cols (vector of integers) determines whether we consider all columns of X as source nodes (cols=[], default), or only a subset of columns determined by the indices in the vector cols.

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter combination determines whether the output must be symmetrized. Possible values are none (default), prod, mean, or anti. If the optional parameter cols is non-empty, symmetrization makes no sense and an error will be thrown unless combination="none".

BioFindr.findr — Method

findr(dX::T; colnames=[], method="moments", FDR=1.0, sorted=true, combination="none") where T<:AbstractDataFrame

Wrapper for findr(Matrix(dX)) when the input dX is in the form of a DataFrame. The output is then also wrapped in a DataFrame with Source, Target, (Posterior) Probability, and qvalue columns.

The optional parameter colnames (vector of strings) determines whether we consider all columns of dX as source nodes (colnames=[], default), or only a subset of columns determined by the variable names in the vector colnames.

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

The optional parameter sorted determines if the output must be sorted by increasing q-value / decreasing posterior probability (sorted=true, the default) or by causal factor (column names of dX) (sorted=false)

The optional parameter combination determines whether the output must be symmetrized. Possible values are none (default), prod, mean, or anti. If the optional parameter colnames is non-empty, symmetrization makes no sense and an error will be thrown unless combination="none".

Association analysis

The secondary linkage test is introduced to test association between genetic variants and gene expression levels, and can be used more generally to analyze differential expression of genes across groups defined by any kind of categorical variable. Its predicted posterior probability for differential expression is $P_2$.

Association analysis can be performed by calling findr with two arguments, matrices or dataframes of continuous gene expression values and categorical genotype or more general grouping values, respectively:

BioFindr.findr — Method

findr(X::Matrix{T},G::Array{S}; method="moments") where {T<:AbstractFloat, S<:Integer}

Compute posterior probabilities for nonzero differential expression of colunns of input matrix X across groups defined by one or more categorical variables (columns of G).

Return a matrix of size ncols(X) x ncols(G)

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

Note

G is currently assumed to be an array (vector or matrix) of integers. CategoricalArrays will be supported in the future.

BioFindr.findr — Method

findr(dX::T, dG::T; method="moments", FDR=1.0, sorted=true) where T<:AbstractDataFrame

Wrapper for findr(Matrix(dX), Matrix(dG)) when the inputs dX and dG are in the form of a DataFrame. The output is then also wrapped in a DataFrame with Source, Target (Posterior) Probability, and qvalue columns.

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

Note that depending on the type of Matrix(dG), different matrix-based methods are called. If Matrix(dG) consists of Floats, posterior probabilities for nonzero pairwise correlations between the variables in dG and variables in dX are computed. If Matrix(dG) consists of integers, posterior probabilities for nonzero differential expression of variables in dX across groups defined by the variables in dG are computed

See also [findr(::Matrix,::Array)], stackprobs, globalfdr!.

Causal inference

Mediation

The traditional causal inference test, as explained in ^[Chen2007], suggested that the regulatory relation $E\to A\to B$can be confirmed with the combination of three separate tests: $E$ regulates $A$, $E$ regulates $B$, and $E$ only regulates $B$ through $A$ (i.e. $E$ and $B$ become independent when conditioning on $A$). They correspond to the primary, secondary, and conditional independence tests respectively. The regulatory relation $E\to A\to B$ is regarded positive only when all three tests return positive. The three tests filter the initial hypothesis space of all possible relations between $E$, $A$, and $B$, sequentially to $E\to A$ (primary test), $E\to A \wedge E\to B$ (secondary test), and $E\to A\to B \wedge$ (no confounder for $A$ and $B$) (conditional independence test). The resulting test is stronger than $E\to A\to B$ by disallowing confounders for A and B. So its probability can be broken down as

\[P_{\text{med}} \equiv P_1P_2P_3\]

BioFindr expects a set of significant eQTLs and their associated genes as input, and therefore $P_1=1$ is assured and not calculated separately in BioFindr. Note that $P_{\text{med}}$ is the estimated local precision, i.e. the probability that tests 2 and 3 are both true. Correspondinly, its local FDR (the probability that one of them is false) is $1-P_{\text{med}}$.

Instrumental variables

The pleiotropy test is introduced to test if an $E\to B$ association is not independent of the $E\to A$ association, that is, if an independent pleiotropic effect of $E$ on both genes can be excluded. If $E$ regulates $A$ (is a cis-eQTL for $A$), and $E$ regulates $B$, and $B$ and $A$ are not independent given $E$, then we can regard $E$ as a proxy or instrumental variable for $A$ and infer a regulatory relation $E\to A\to B$ from the $E\to B$ association. The three tests verify the hypothesis that $B \leftarrow E \to A \wedge \lnot(A ⫫ B | E)$, a superset of $E\to A\to B$. Its probability can be broken down as

\[P_{\text{IV}} \equiv P_1P_2P_5\]

As before, $P_1=1$ is assured and not calculated separately in BioFindr. $P_{\text{IV}}$ is again the estimated local precision, i.e. the probability that tests 2 and 5 are both true, and its local FDR (the probability that one of them is false) is $1-P_{\text{IV}}$.

Relevance

The relevance test is introduced to address weak interactions that are undetectable by the secondary test from existing data ($P_2$ close to 0). This term still grants higher-than-null significance to weak interactions, and verifies that $E\to A \wedge (E\to B \vee A - B)$, also a superset of $E\to A\to B$. Its probability can be broken down as

\[P_{\text{relev}} \equiv P_1P_4\]

The original Findr paper proposed to combine the instrumental variable and relevance tests in a novel test whose probability can be broken down as

\[P_{\text{orig}} \equiv \frac{1}{2} P_1 \bigl( P_4 + P_2P_5) = \frac{1}{2}\bigl( P_{\text{relev}} + P_{\text{IV}} \bigr)\]

In the extreme undetectable limit where $P_2 = 0$ but $P_4 \neq 0$, the novel test automatically reduces to one half of the relevance test, which assumes equal probability of either direction and assigns half of the relevance test probability to $A \to B$.

The composite design of the novel test aims not to miss any genuine regulation whilst distinguishing the full spectrum of possible interactions. When the signal level is too weak for tests 2 and 5, we expect $P_4$ to still provide distinguishing power better than random predictions. When the interaction is strong, $P_2 P_5$ is then able to pick up true targets regardless of the existence of hidden confounders.

Implementation

Causal inference can be performed by calling findr with three arguments, matrices or dataframes of gene expression and genotype values, and a mapping of matching $(E,A)$ pairs; the preferred test can be set through the combination parameter:

BioFindr.findr — Method

findr(X::Matrix{T},G::Matrix{S},pairGX::Matrix{S}; method="moments", combination="none") where {T<:AbstractFloat, S<:Integer}

Compute posterior probabilities for nonzero causal relations between columns of input matrix X. The probabilities are estimated for relations going from a subset of columns of X that have a (discrete) instrumental variable in input matrix G to all columns of X, while excluding self-interactions (given default value 1). The matching between columns of X and columns of G is given by pairGX, a two-column array where the first column corresponds to a column index in G and the second to a column index in X.

Posterior probabilities are computed for the following tests

Test 2 (Linkage test)
Test 3 (Mediation test)
Test 4 (Relevance test)
Test 5 (Pleiotropy test)

which can be combined into the mediation test ($P_2 P_3$; combination="mediation"), the instrumental variable or non-independence test ($P_2 P_5$; combination="IV"), or BioFindr's original combination ($\frac{1}{2}(P_2 P_5 + P_4)$; combination="orig"). By default, individual probability matrices for all tests are returned (combination="none").

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

If combination="none", then the output has size ncols(X) x 4 x ncols(G), where the middle index indexes the tests, and otherwise the output has size ncols(X) x ncols(G).

Note

G is currently assumed to be an array (vector or matrix) of integers. I intend to use CategoricalArrays in the future.

BioFindr.findr — Method

findr(dX::T, dG::T, dE::T; colX=2, colG=1, method="moments", combination="IV", FDR=1.0, sorted=true) where T<:AbstractDataFrame

Wrapper for findr(Matrix(dX), Matrix(dG), pairGX) when the inputs are in the form of a DataFrame. The output is then also wrapped in a DataFrame with Source, Target (Posterior) Probability, and qvalue columns. When DataFrames are used, only combined posterior probabilities can be returned (combination="IV" (default), "mediation", or "orig").

The input dataframes are:

dX - DataFrame with expression data, columns are genes
dG - DataFrame with genotype data, columns are variants (SNPs)
dE - DataFrame with eQTL results, must contain columns with gene and SNP IDs that can be mapped to column names in dX and dG, respectively

The numeric mapping between column indices in Matrix(dG) and Matrix(dX) is obtained from these inputs using the getpairs function and the optional parameters:

colG - name or number of variant ID column in dE, default 1
colX - name or number of gene ID column in dE, default 2
namesX - names of a possible subset of columns in dX to be considered as potential causal regulators (default names(dX))

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

Bipartite causal inference

In the general case, we assume that there is one set of genes, of which the set of $A$-genes (genes with matching instrument $E$) is a subset, and that all possible directed regulations are tested. In some situations we are instead searching for a bipartite network from one set of potential causal factors (e.g. micro-RNAs) to another set of potential targets (e.g. protein-coding genes). In this case, causal inference can be performed by calling findr with four arguments that include separate matrices or dataframes of expression values for the potential causes and targets:

BioFindr.findr — Method

findr(X1::Matrix{T},X2::Array{T},G::Array{S},pairGX::Matrix{R}; method="moments", combination="none")  where {T<:AbstractFloat, S<:Integer}

Compute posterior probabilities for nonzero causal relations from columns of input matrix X2 to columns of input matrix X1. The probabilities are estimated for a subset of columns of X2 that have a (discrete) instrumental variable in input matrix G. The matching between columns of X2 and columns of G is given by pairGX, a two-column array where the first column corresponds to a column index in G and the second to a column index in X2.

Posterior probabilities are computed for the following tests

Test 2 (Linkage test)
Test 3 (Mediation test)
Test 4 (Relevance test)
Test 5 (Pleiotropy test)

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

If combination="none", then the output has size ncols(X1) x 4 x ncols(X2), where the middle index indexes the tests, and otherwise the output has size ncols(X1) x ncols(X2).

Note

G is currently assumed to be an array (vector or matrix) of integers. I intend to use CategoricalArrays in the future.

BioFindr.findr — Method

findr(dX1::T, dX2::T, dG::T, dE::T; colG=1, colX=2, method="moments", combination="IV", FDR=1.0, sorted=true) where T<:AbstractDataFrame

Wrapper for findr(Matrix(dX1), Matrix(dX2), Matrix(dG), pairGX2) when the inputs dX1, dX2, and dG are in the form of a DataFrame. The output is then also wrapped in a DataFrame with Source, Target, (Posterior) Probability, and qvalue columns. When DataFrames are used, only combined posterior probabilities can be returned (combination="IV" (default), "mediation", or "orig").

The numeric mapping between column indices in Matrix(dG) and Matrix(dX2) is obtained from these inputs using the getpairs function and the optional parameters:

colG - name or number of variant ID column in dE, default 1
colX - name or number of gene ID column in dE, default 2
namesX - names of a possible subset of columns in dX to be considered as potential causal regulators (default names(dX))

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

Summary

A summary of all possible calls to the findr function:

BioFindr.findr — Function

findr(X::Matrix{T}; cols=[], method="moments", combination="none") where T<:AbstractFloat

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

findr(dX::T; colnames=[], method="moments", FDR=1.0, sorted=true, combination="none") where T<:AbstractDataFrame

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

The optional parameter combination determines whether the output must be symmetrized. Possible values are none (default), prod, mean, or anti. If the optional parameter colnames is non-empty, symmetrization makes no sense and an error will be thrown unless combination="none".

findr(X1::Matrix{T}, X2::Matrix{T}; method="moments") where T<:AbstractFloat

Compute posterior probabilities for nonzero pairwise correlations between columns of input matrices X1 and X2. The probabilities are directed (asymmetric) from the columns of X2 to the columns of X1 in the sense that they are estimated from a column-specific background distribution for each column of X2.

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

Only use this method if X1 and X2 are distinct (no overlapping columns). For X2 consisting of a subset of columns with indices idx, use findr(X1; cols=idx) instead.

findr(X::Matrix{T},G::Array{S}; method="moments") where {T<:AbstractFloat, S<:Integer}

Compute posterior probabilities for nonzero differential expression of colunns of input matrix X across groups defined by one or more categorical variables (columns of G).

Return a matrix of size ncols(X) x ncols(G)

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

Note

G is currently assumed to be an array (vector or matrix) of integers. CategoricalArrays will be supported in the future.

findr(dX::T, dG::T; method="moments", FDR=1.0, sorted=true) where T<:AbstractDataFrame

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

See also [findr(::Matrix,::Array)], stackprobs, globalfdr!.

findr(X::Matrix{T},G::Matrix{S},pairGX::Matrix{S}; method="moments", combination="none") where {T<:AbstractFloat, S<:Integer}

Posterior probabilities are computed for the following tests

Test 2 (Linkage test)
Test 3 (Mediation test)
Test 4 (Relevance test)
Test 5 (Pleiotropy test)

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

If combination="none", then the output has size ncols(X) x 4 x ncols(G), where the middle index indexes the tests, and otherwise the output has size ncols(X) x ncols(G).

Note

G is currently assumed to be an array (vector or matrix) of integers. I intend to use CategoricalArrays in the future.

findr(dX::T, dG::T, dE::T; colX=2, colG=1, method="moments", combination="IV", FDR=1.0, sorted=true) where T<:AbstractDataFrame

The input dataframes are:

dX - DataFrame with expression data, columns are genes
dG - DataFrame with genotype data, columns are variants (SNPs)
dE - DataFrame with eQTL results, must contain columns with gene and SNP IDs that can be mapped to column names in dX and dG, respectively

The numeric mapping between column indices in Matrix(dG) and Matrix(dX) is obtained from these inputs using the getpairs function and the optional parameters:

colG - name or number of variant ID column in dE, default 1
colX - name or number of gene ID column in dE, default 2
namesX - names of a possible subset of columns in dX to be considered as potential causal regulators (default names(dX))

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

findr(X1::Matrix{T},X2::Array{T},G::Array{S},pairGX::Matrix{R}; method="moments", combination="none")  where {T<:AbstractFloat, S<:Integer}

Posterior probabilities are computed for the following tests

Test 2 (Linkage test)
Test 3 (Mediation test)
Test 4 (Relevance test)
Test 5 (Pleiotropy test)

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

If combination="none", then the output has size ncols(X1) x 4 x ncols(X2), where the middle index indexes the tests, and otherwise the output has size ncols(X1) x ncols(X2).

Note

G is currently assumed to be an array (vector or matrix) of integers. I intend to use CategoricalArrays in the future.

findr(dX1::T, dX2::T, dG::T, dE::T; colG=1, colX=2, method="moments", combination="IV", FDR=1.0, sorted=true) where T<:AbstractDataFrame

The numeric mapping between column indices in Matrix(dG) and Matrix(dX2) is obtained from these inputs using the getpairs function and the optional parameters:

colG - name or number of variant ID column in dE, default 1
colX - name or number of gene ID column in dE, default 2
namesX - names of a possible subset of columns in dX to be considered as potential causal regulators (default names(dX))

The optional parameter method determines the LLR mixture distribution fitting method and can be either moments (default) for the method of moments, or kde for kernel-based density estimation.

The optional parameter FDR can be used to return only a subset of interactions with a desired expected FDR value (q-value threshold) (default 1.0, no filtering).

Chen2007Chen L, Emmert-Streib F, Storey J. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol 8, R219 (2007).