DCAUtils.jl documentation

This package contains some utilities for Direct Coupling Analysis, an unsupervised technique to analyse Multiple Sequence Alignments of protein families.

The code is written in Julia. It requires Julia 1.5 or later.

Installation and Usage

To install the package, use Julia's package manager: from the Julia REPL, type ] to enter the Pkg REPL mode and run:

(v1.5) pkg> add DCAUtils

Then load it with:

julia> using DCAUtils


The functions in this package are written to maximize performance. Most computationally-heavy functions can use multiple threads (start julia with the -t option or set the JULIA_NUM_THREADS environment variable). In most cases such functions need to perform operations over every pair of sequences or every pair of resiudes; since such operations are symmetric, only the upper-triangular part is needed, and a custom parallelization scheme is used for this purpose. During these parallel parts of the code, BLAS parallelization is disabled, and it is restored at the end.

On top of parallelization, big efficiency gains can be had in certain operations by working with a compressed representation of the data. The general idea is that the multiple sequence alignment is first parsed and converted into a Matrix{Int8} representation (see read_fasta_alignment). Then, internally, functions such as compute_weights and compute_dists further explot the assumption that no value larger than 31 will appear, thus requiring 5 bits at the most: thus, they pack each sequence of $N$Int8 values into $⌈N / 12⌉$UInt64 values. Efficient bit-wise operations are then applied to compute the Hamming distance between sequences, processing 12 entries at a time.


read_fasta_alignment(filename::AbstractString, max_gap_fraction::Real) -> Matrix{Int8}

Parses a FASTA file containing a multiple sequence alignment, and returns a matrix of integers that represents one sequence per column.

The mapping between the aminoacid symbols and the integers uses this table:

  A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Any unrecognized capital letter and the gap symbol - are mapped to the value 21; any other symbol or lowercase letter is ignored.

If a sequence contains a fraction of gaps that exceeds max_gap_fraction, it is discarded. Set this value to 1 to keep all the sequences.

The input file can be plaintext (ASCII) or gzip-compressed plaintext (with the extension ".gz")

See also remove_duplicate_sequences.

remove_duplicate_sequences(Z::Matrix{Int8}; verbose::Bool = true) -> (Matrix{Int8}, Vector{Int})

Takes a matrix representing a mutiple sequence alignment (see read_fasta_alignment) and returns a new matrix with all duplicated sequences removed. It also returns a vector of column indices with the positions of the unique sequences in the input matrix.

compute_theta(Z::Matrix{Int8}) -> Float64

This function computes the threshold as used by the :auto setting of compute_weights (but it is more efficient computationally to use the :auto option than to invoke this function and passing the result to compute_weights).

It computes the mean value $ϕ$ of the similarity fraction between all possible pairs of sequences in Z (the similarity fraction is the number of equal entries divided by the length of the sequences).

The result is then computed as $θ = \min(0.5, 0.1216 / ϕ)$.

This function can use multiple threads if available.

compute_weights(Z::Matrix{Int8}, [q::Integer,] θ; verbose::Bool = true) -> (Vector{Float64}, Float64)

This function computes the reweighting vector. It retuns the vector and its sum Meff (the latter represents the number of "effective" sequences).

Z is an $N × M$ multiple sequence aLignment (see read_fasta_alignment). $N$ is the length of each sequence and $M$ the number of sequences.

q is the maximum value in the alphabet, if omitted it's computed from maximum(Z).

θ is the distance threshold: for any sequence, the number $n$ of sequences (including itself) that are at normalized distance smaller than $⌊θN⌋$ is counted, and the weight of that sequence is then $1/n$.

θ can be a real value between 0 and 1, or the symbol :auto, in which case the compute_theta function is used.

This function can use multiple threads if available.

compute_dists(Z::Matrix{Int8}) -> Matrix{Float64}

This function computes the matrix of normalized Hamming distances between sequences of the multiple sequence alignment Z (see read_fasta_alignment).

This function can use multiple threads if available.

compute_weighted_frequencies(Z::Matrix{Int8}, W::Vector{Float64} [, q]) -> (Vector{Float64}, Matrix{Float64})

Given a multiple sequence alignment matrix Z (see read_fasta_alignment), and a vector of weights (see compute_weights), returns the empirical one- and two-point frequencies $P_i$ and $P_{ij}$.

q is the size of the alphabet, if not given it's computed as maximum(Z).

If Z has size $N × M$ (i.e. $M$ sequences of length $N$), the resulting vector $P_i$ has length $N (q-1)$ and contains $N$ blocks (one for each residue position), each block containing the frequencies of the amino-acids, weighted according to W. The frequency of the last symbol, which usually represents the gap, is omitted and can be recovered by normalization. The resulting matrix $P_{ij}$ has size $N (q-1) × N (q-1)$ and it also has a block structure, with $N × N$ blocks, one for each pair of residues (the last row and column of each block are omitted and can be recovered by normalization).

compute_weighted_frequencies(Z::Matrix{Int8}, [q,] θ) -> (Vector{Float64}, Matrix{Float64}, Float64, Vector{Float64})

This form of the function just calls compute_weights with the given values of θ and q and then uses the result to call the version desrcibed above.

Besides returning the one- and two-point frequencies, it also returns the result of compute_weights: the Meff and the reweighting vector.

add_pseudocount(Pi::Vector{Float64}, Pij::Matrix{Float64}, pc::Float64, q::Integer = 21) -> (Vector{Float64}, Matrix{Float64})

This function takes one- and two-points frequencies (see compute_weighted_frequencies) and returns the corresponding frequencies with a pseudocount pc added.

The resulting frequencies are the same that would be obtained by mixing the original data with weight (1-pc) with a uniform distribution with weight pc. So pc must be between 0 (returns a copy of the original data) and 1 (returns the frequencies for the uniform distribution).

The integer q is used to specify the block size of the matrices (each block has size $(q-1)×(q-1)$).

compute_DI_gauss(J::Matrix{Float64}, C::Matrix{Float64}, q::Integer = 21) -> Matrix{Float64}

Compute the Direct Information matrix, assuming a Gaussian model.

C is the covariance matrix $C_{ij} = P_{ij} - P_i P_j$, and J its inverse.

The integer q is used to specify the block size of the matrices (each block has size $(q-1)×(q-1)$). The result has one entry per block.

This function can use multiple threads if available.

compute_FN(J::Matrix{Float64}, q::Integer = 21) -> Matrix{Float64}

Compute the matrix of Frobenius norms.

J is the inverse of the covariance matrix $C_{ij} = P_{ij} - P_i P_j$.

The integer q is used to specify the block size of the matrices (each block has size $(q-1)×(q-1)$). The result has one entry per block.

This function can use multiple threads if available.