# ArDCA

• Autoregressive model learning for protein inference.

## Package Features

• Learn model from multiple sequence alignment
• Sample from the model

See the Index for the complete list of documented functions and types.

## Overview

Protein families are given in form of multiple sequence alignments (MSA) $D = (a^m_i |i = 1,\dots,L;\,m = 1,\dots,M)$ of $M$ proteins of aligned length $L$. The entries $a^m_i$ equal either one of the standard 20 amino acids, or the alignment gap $–$. In total, we have $q = 21$ possible different symbols in D. The aim of unsupervised generative modeling is to earn a statistical model $P(a_1,\dots,a_L)$ of (aligned) full-length sequences, which faithfully reflects the variability found in $D$: sequences belonging to the protein family of interest should have comparably high probabilities, unrelated sequences very small probabilities. Here we propose a computationally efficient approach based on autoregressive models.

We start from the exact decomposition:

$$$P(a_1,\dots,a_L) = P(a_1) \cdot P(a_2|a_1) \cdot \dots \cdot P(a_L|a_1,\dots,a_{L-1})$$$

Here, we use the following parametrization:

$$$P(a_i | a_1,\dots,a_{i-1}) = \frac{\exp \left\{ h_i(a_i) + \sum_{j=1}^{i-1} J_{i,j}(a_i,a_j)\right\} }{z_i(a_1,\dots,a_{i-1})}\,,$$$

where:

$$$z_i(a_1,\dots,a_{i-1})= \sum_{a=1}^{q} \exp \left\{ h_i(a) + \sum_{j=1}^{i-1} J_{i,j}(a,a_j)\right\} \,,$$$

is a the normalization factor. In machine learning, this parametrization is known as soft-max regression, the generalization of logistic regression to multi-class labels.

# Usage

The typical pipeline to use the package is:

• Compute ArDCA parameters from a multiple sequence alignment:
julia> arnet,arvar=ardca(filefasta; kwds...)
• Generate 100 new sequences, and store it in an $L\times 100$ array of integers.
julia> Zgen =  sample(arnet,100);

To fully exploit the the multicore parallel computation, julia should be invoked with

$julia -t nthreads # put here nthreads equal to the number of cores you want to use If you want to set permanently the number of threads to the desired value, you can either create a default environment variable export JULIA_NUM_THREADS=24 in your .bashrc. More information here # Tutorial We will assume we have a Multiple Sequence Alignment (MSA)in FASTA format. We aim at 1. Given a MSA, generate a sample 2. Given a MSA, predict contacts 3. Given a MSA, predict the mutational effect in all (ungapped) position of a given target sequence ## Load ArDCA package The following cell loads the package ArDCA (Warning: the first time it takes a while) • The mypkgdir variable should be set to your path/to/package dir. We will use the PF00014 protein family available in data/PF14/ folder of the package/ mypkgdir = normpath(joinpath(pwd(),"..")) datadir=joinpath(mypkgdir,"data") # put here your path using Pkg Pkg.activate(mypkgdir) using ArDCA ## Learn the autoregressive parameters As a preliminary step, we learn the field and the coupling parameters$h,Jfrom the MSA. To do so we use the ardca method that return the parameters (stored in arnet in the cell below), and the alignment in numerical format and other algorithms variables (stored in arvar in the cell below). The default autoregressive order is set to :ENTROPIC. We set theL_2$regularization to 0.02 for the$J$and 0.001 for the$h. The keyword arguments for the ardca method are (with their default value): • lambdaJ::Real=0.01 coupling L₂ regularization parameter (lagrange multiplier) • lambdaH::Real=0.01 field L₂ regularization parameter (lagrange multiplier) • pc_factor::Real=0 pseudocount factor for calculation of p0 • epsconv::Real=1.0e-5 (convergence parameter) • maxit::Int=1000 (maximum number of iteration - don't change) • verbose::Bool=true (set to false to suppress printing on screen) • method::Symbol=:LD_LBFGS (optimization method) • permorder::Union{Symbol,Vector{Ti}}=:ENTROPIC (permutation order). Possible values are: [:NATURAL, :ENTROPIC, :REV_ENTROPIC, :RANDOM] or a custom permutation vector. arnet,arvar=ardca("data/PF14/PF00014_mgap6.fasta.gz", verbose=false, lambdaJ=0.02,lambdaH=0.001); ## 1. Sequence Generation We now generate M sequences using the sample method: M = 1_000; generated_alignment = sample(arnet,M); The generated alignment has is a 𝐿×𝑀 matrix of integer where 𝐿 is the alignment's length, and 𝑀 the number of samples. Interestingly, we for each sequence we can also compute the likelihood with the samplewithweights method. loglikelihood,generated_alignment = sample_with_weights(arnet,M); ## 2. Contact Prediction We can compute the epistatic score for residue-residue contact prediction. To do so, we can use the epistatic_score method. The epistatic score is computed on any target sequence of the MSA. Empirically, it turns out the the final score does not depend much on the choice of the target sequence. The autput is contained in a Vector of Tuple ranked in descendic score order. Each Tuple containsi,j,s_{ij}$where$s_{ij}$is the epistatic score of the residue pair$i,j. The residue numbering is that of the MSA, and not of the unaligned full sequences. target_sequence = 1 score=epistatic_score(arnet,arvar,target_sequence) ## 3. Predicting mutational effects For any reference sequence, we can easily predict the mutational effect for all single mutants. Of course we can extract this information only for the non-gapped residues of the target sequence. This is done with the dms_single_site method, which returns a q×L matrix D containing\log(P(mut))/\log(P(wt))\$ for all single site mutants of the reference sequence seqid (i.e. the so-called wild type sequence), and idxgap a vector of indices of the residues of the reference sequence that contain gaps (i.e. the 21 amino-acid) for which the score has no sense and is set by convention to +Inf.

A negative value indicate a beneficial mutation, a value 0 indicate the wild-type amino-acid.

target_sequence = 1
D,idxgap=dms_single_site(arnet,arvar,target_sequence)

## Methods Reference

ArDCA.ardcaMethod
ardca(filename::String; kwds...)

Run ardca on the fasta alignment in filename

Return two struct: ::ArNet (containing the inferred hyperparameters) and ::ArVar

Optional arguments:

• max_gap_fraction::Real=0.9 maximum fraction of insert in the sequence
• remove_dups::Bool=true if true remove duplicated sequences
• theta=:auto if :auto compute reweighint automatically. Otherwise set a Float64 value 0 ≤ theta ≤ 1
• lambdaJ::Real=0.01 coupling L₂ regularization parameter (lagrange multiplier)
• lambdaH::Real=0.01 field L₂ regularization parameter (lagrange multiplier)
• pc_factor::Real=1/size(Z,2) pseudocount factor for calculation of p0, defaults to one over the number of sequences.
• epsconv::Real=1.0e-5 convergence value in minimzation
• maxit::Int=1000 maximum number of iteration in minimization
• verbose::Bool=true set to false to stop printing convergence info on stdout
• method::Symbol=:LD_LBFGS optimization strategy see NLopt.jl for other options
• permorder::Union{Symbol,Vector{Ti}}=:ENTROPIC permutation order. Possible values are :NATURAL,:ENTROPIC,:REV_ENTROPIC,:RANDOM or a custom permutation vector

Examples

julia> arnet, arvar =  ardca("pf14.fasta", permorder=:ENTROPIC)
ArDCA.ardcaMethod
ardca(Z::Array{Ti,2},W::Vector{Float64}; kwds...)

Auto-regressive analysis on the L×M alignment Z (numerically encoded in 1,…,21), and the M-dimensional normalized weight vector W.

Return two struct: ::ArNet (containing the inferred hyperparameters) and ::ArVar

Optional arguments:

• lambdaJ::Real=0.01 coupling L₂ regularization parameter (lagrange multiplier)
• lambdaH::Real=0.01 field L₂ regularization parameter (lagrange multiplier)
• pc_factor::Real=0 pseudocount factor for calculation of p0, defaults to one over the number of sequences.
• epsconv::Real=1.0e-5 convergence value in minimzation
• maxit::Int=1000 maximum number of iteration in minimization
• verbose::Bool=true set to false to stop printing convergence info on stdout
• method::Symbol=:LD_LBFGS optimization strategy see NLopt.jl for other options
• permorder::Union{Symbol,Vector{Ti}}=:ENTROPIC permutation order. Possible values are :NATURAL,:ENTROPIC,:REV_ENTROPIC,:RANDOM or a custom permutation vector

Examples

julia> arnet, arvar= ardca(Z,W,lambdaJ=0,lambdaH=0,permorder=:REV_ENTROPIC,epsconv=1e-12);
ArDCA.dms_single_siteMethod
dms_single_site(arnet::ArNet, arvar::ArVar, seqid::Int; pc::Float64=0.1)

Return a q×L matrix of containing -log(P(mut))/log(P(seq)) for all single site mutants of the reference sequence seqid, and a vector of the indices of the residues of the reference sequence that contain gaps (i.e. the 21 amino-acid) for which the score has no sense and is set by convention to +Inf. A negative value indicate a beneficial mutation, a value 0 indicate the wild-type amino-acid.

ArDCA.loglikelihoodMethod
loglikelihood(x0::AbstractVector{T}, arnet::ArNet) where {T<:Integer}

Return the loglikelihood of sequence x0 encoded in integer values in 1:q under the model arnet.

ArDCA.loglikelihoodMethod
loglikelihood(arnet::ArNet, arvar::ArVar)

Return the vector of loglikelihoods computed from arvar.Z under the model arnet. size(arvar.Z) == N,M where N is the sequences length, and M the number of sequences. The returned vector has M elements reweighted by arvar.W

ArDCA.loglikelihoodMethod
loglikelihood(x0::String, arnet::ArNet) where {T<:Integer})

Return the loglikelihood of the String x0 under the model arnet.

ArDCA.loglikelihoodMethod
loglikelihood(x0::Matrix{T}, arnet::ArNet) where {T<:Integer})

Return the vector of loglikelihoods computed from Matrix x0 under the model arnet. size(x0) == N,M where N is the sequences length, and M the number of sequences. The returned vector has M elements.

ArDCA.sampleMethod
sample(arnet::ArNet, msamples::Int)

Return a generated alignment in the form of a N × msamples matrix of type ::Matrix{Int}

Examples

julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Zgen=Zgen=sample(arnet,1000);
ArDCA.sample_subsequenceMethod
sample_subsequence(x::String, arnet::ArNet, msamples)

Return a generated alignment in the form of a N × msamples matrix of type ::Matrix{Int} and the relative probabilities under the model. The alignment is forced to start with with a sequence x (in amino acid single letter alphabet) and then autoregressively generated.

Example

julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_subsequence("MAKG",arnet,1000);
ArDCA.sample_subsequenceMethod
sample_subsequence(x::Vector{T}, arnet::ArNet, msamples)

Return a generated alignment in the form of a N × msamples matrix of type ::Matrix{Int} and the relative probabilities under the model. The alignment is forced to start with with a sequence x (in integer number coding) and then autoregressively generated.

Example

julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_subsequence([11,1,9,6],arnet,1000);
ArDCA.sample_with_weightsMethod
sample_with_weights(arnet::ArNet, msamples::Int)

Return a generated alignment in the form of a N × msamples matrix of type ::Matrix{Int} and the relative probabilities under the module

Examples

julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_with_weights(arnet,1000);
ArDCA.siteloglikelihoodMethod
siteloglikelihood(i::Int,arnet::ArNet, arvar::ArVar)

Return the loglikelihood relative to site i computed from arvar.Z under the model arnet`.