Utility functions which are used when findr
is called with DataFrame inputs, some of which may be useful when manually post-processing the output of findr
calls with matrix-based inputs.
— Functiongetpairs(dX::T, dG::T, dE::T; colG=1, colX=2)
Get pairs of indices of matching columns from dataframes dX
and dG
, with column names that should be matched listed in dataframe dE
. The optional parameters colG
(default value 1) and colX
(default value 2) indicate which columns of dE
need to be used for matching, either as a column number (integer) or column name (string). The optional parameter namesX
can be used to match rows in dE
to only a subset of the column names of dX
— Functionsymprobs(P; combination="prod")
Symmetrize a square matrix of posterior probabilities P
. The optional parameter combination
defines the symmetrization method:
: do nothing (default)prod
: $P'_{ij}=P_{ij}P_{ji}$mean
: $P'_{ij}=\frac{1}{2}(P_{ij} + P_{ji})$anti
: $P'_{ij}=\frac{1}{2}(P_{ij} + 1 - P_{ji})$
Note that the anti
option defines "antisymmetric" probabilities, $P'_{ij} + P'_{ji} = 1$, where evidence for a causal interaction $i\to j$ is also considered evidence against the opposite interaction $j\to i$.
— Functioncombineprobs(P; combination="none")
Combine posterior probabilities P
for multiple likelihood likelihood ratio tests in a single probability (local precision) value.
The optional parameter combination
defines the combination test:
: do nothing, return the inputP
: the mediation test ($P_2 P_3$)IV
: the instrumental variable or non-independence test ($P_2 P_5$)orig
: BioFindr's original combination ($\frac{1}{2}(P_2 P_5 + P_4)$
The input must be a three-dimensional array where the second dimension has size 4 and indexes the individual BioFindr tests (test 2-5). The output is a matrix of size size(P,1) x size(P,3)
— Functionstackprobs(P,colnames,rownames;nodiag=true)
Convert a matrix of pairwise posterior probabilities P
with column and row names colnames
and rownames
, respectively, to a stacked dataframe with Source
, Target
, and Probability
columns, corresponding respectively to a column name, a row name, and the value of P
in the corresponding row and column pair.
The optional parameter nodiag
determines if self-interactions (equal row and column name) are excluded (nodiag=true
, default) or not (nodiag=false
— Functionglobalfdr!(dP::T; FDR=1.0, sorted=true) where T<:AbstractDataFrame
For a DataFrame dP
of posterior probabilities (local precision values), compute their corresponding q-values and keep only the rows with q-value less than a desired global false discovery rate FDR
(default value 1, no selection). dP
is assumed to be the output of a findr
run with columns Source
, Target
, and Probability
. The output DataFrame mirrors the structure of dP
, keeping only the selected rows, and with an additional column qvalue
. The output is sorted by qvalue
if the optional argument sorted
is true
(default). If dP
already contains a column qvalue
, only the filtering and optional sorting are performed.
— Functionglobalfdr(P::Array{T},FDR) where T<:AbstractFloat
For an array (matrix or vector) P
of posterior probabilities (local precision values), compute their corresponding q-values Q
, and return the indices of P
with q-value less than a desired global false discovery rate FDR
See also qvalue
— Functionqvalue(P::Vector{T}) where T<:AbstractFloat
Convert a vector P
of posterior probabilities (local precisions) to a vector of q-values. For a threshold value c
on the posterior probabilities P
, the global FDR, $FDR(c)$ is defined as one minus the average local precision:
$FDR(c) = 1 - \frac{1}{N_c} \sum_{i\colon P_i\leq c} P_i,$
where $N_c=\sharp\{i\colon P_i\leq c\}$ is the number of selected pairs. The q-value of a given index in P
is then defined as the smallest FDR at which this pair is still called significant.
Generating simulated data
BioFindr includes a function generate_test_data
for generating simple simulated data for testing the package:
— Functiongenerate_test_data(nA, nB, fB, ns, ng, maf, bGA, bAB, supernormalize)
Generate test data for BioFindr with nA
causal variables, nB
potential target variables of which a random fraction fB
are true targets for each causal variable, ns
samples, ng
genotype (instrumental variable) groups with minor allele frequence maf
, and effect sizes bGA
and bAB
. Variables are sampled from a linear model with independent Gaussian noise with variance ϵ
and correlated Gaussian noise with variance δ
and covariance δρ
. If supernormalize
is true
, the data is supernormalized.