Utilities

Utility functions which are used when findr is called with DataFrame inputs, some of which may be useful when manually post-processing the output of findr calls with matrix-based inputs.

BioFindr.getpairsFunction
getpairs(dX::T, dG::T, dE::T; colG=1, colX=2)

Get pairs of indices of matching columns from dataframes dX and dG, with column names that should be matched listed in dataframe dE. The optional parameters colG (default value 1) and colX (default value 2) indicate which columns of dE need to be used for matching, either as a column number (integer) or column name (string). The optional parameter namesX can be used to match rows in dE to only a subset of the column names of dX.

BioFindr.symprobsFunction
symprobs(P; combination="prod")

Symmetrize a square matrix of posterior probabilities P. The optional parameter combination defines the symmetrization method:

  • none: do nothing (default)
  • prod: $P'_{ij}=P_{ij}P_{ji}$
  • mean: $P'_{ij}=\frac{1}{2}(P_{ij} + P_{ji})$
  • anti: $P'_{ij}=\frac{1}{2}(P_{ij} + 1 - P_{ji})$

Note that the anti option defines "antisymmetric" probabilities, $P'_{ij} + P'_{ji} = 1$, where evidence for a causal interaction $i\to j$ is also considered evidence against the opposite interaction $j\to i$.

BioFindr.combineprobsFunction
combineprobs(P; combination="none")

Combine posterior probabilities P for multiple likelihood likelihood ratio tests in a single probability (local precision) value.

The optional parameter combination defines the combination test:

  • none: do nothing, return the input P (default)
  • mediation: the mediation test ($P_2 P_3$)
  • IV: the instrumental variable or non-independence test ($P_2 P_5$)
  • orig: BioFindr's original combination ($\frac{1}{2}(P_2 P_5 + P_4)$

The input must be a three-dimensional array where the second dimension has size 4 and indexes the individual BioFindr tests (test 2-5). The output is a matrix of size size(P,1) x size(P,3).

BioFindr.stackprobsFunction
stackprobs(P,colnames,rownames;nodiag=true)

Convert a matrix of pairwise posterior probabilities P with column and row names colnames and rownames, respectively, to a stacked dataframe with Source, Target, and Probability columns, corresponding respectively to a column name, a row name, and the value of P in the corresponding row and column pair.

The optional parameter nodiag determines if self-interactions (equal row and column name) are excluded (nodiag=true, default) or not (nodiag=false).

BioFindr.globalfdr!Function
globalfdr!(dP::T; FDR=1.0, sorted=true) where T<:AbstractDataFrame

For a DataFrame dP of posterior probabilities (local precision values), compute their corresponding q-values and keep only the rows with q-value less than a desired global false discovery rate FDR (default value 1, no selection). dP is assumed to be the output of a findr run with columns Source, Target, and Probability. The output DataFrame mirrors the structure of dP, keeping only the selected rows, and with an additional column qvalue. The output is sorted by qvalue if the optional argument sorted is true (default). If dP already contains a column qvalue, only the filtering and optional sorting are performed.

BioFindr.globalfdrFunction
globalfdr(P::Array{T},FDR) where T<:AbstractFloat

For an array (matrix or vector) P of posterior probabilities (local precision values), compute their corresponding q-values Q, and return the indices of P with q-value less than a desired global false discovery rate FDR.

See also qvalue

BioFindr.qvalueFunction
qvalue(P::Vector{T}) where T<:AbstractFloat

Convert a vector P of posterior probabilities (local precisions) to a vector of q-values. For a threshold value c on the posterior probabilities P, the global FDR, $FDR(c)$ is defined as one minus the average local precision:

$FDR(c) = 1 - \frac{1}{N_c} \sum_{i\colon P_i\leq c} P_i,$

where $N_c=\sharp\{i\colon P_i\leq c\}$ is the number of selected pairs. The q-value of a given index in P is then defined as the smallest FDR at which this pair is still called significant.

Generating simulated data

BioFindr includes a function generate_test_data for generating simple simulated data for testing the package:

BioFindr.generate_test_dataFunction
generate_test_data(nA, nB, fB, ns, ng, maf, bGA, bAB, supernormalize)

Generate test data for BioFindr with nA causal variables, nB potential target variables of which a random fraction fB are true targets for each causal variable, ns samples, ng genotype (instrumental variable) groups with minor allele frequence maf, and effect sizes bGA and bAB. Variables are sampled from a linear model with independent Gaussian noise with variance ϵ and correlated Gaussian noise with variance δ and covariance δρ. If supernormalize is true, the data is supernormalized.