# ArDCA

- Autoregressive model learning for protein inference.

## Package Features

- Learn model from multiple sequence alignment
- Sample from the model

See the Index for the complete list of documented functions and types.

## Overview

Protein families are given in form of multiple sequence alignments (MSA) $D = (a^m_i |i = 1,\dots,L;\,m = 1,\dots,M)$ of $M$ proteins of aligned length $L$. The entries $a^m_i$ equal either one of the standard 20 amino acids, or the alignment gap $–$. In total, we have $q = 21$ possible different symbols in D. The aim of unsupervised generative modeling is to earn a statistical model $P(a_1,\dots,a_L)$ of (aligned) full-length sequences, which faithfully reflects the variability found in $D$: sequences belonging to the protein family of interest should have comparably high probabilities, unrelated sequences very small probabilities. Here we propose a computationally efficient approach based on autoregressive models.

We start from the exact decomposition:

\[P(a_1,\dots,a_L) = P(a_1) \cdot P(a_2|a_1) \cdot \dots \cdot P(a_L|a_1,\dots,a_{L-1})\]

Here, we use the following parametrization:

\[P(a_i | a_1,\dots,a_{i-1}) = \frac{\exp \left\{ h_i(a_i) + \sum_{j=1}^{i-1} J_{i,j}(a_i,a_j)\right\} }{z_i(a_1,\dots,a_{i-1})}\,,\]

where:

\[z_i(a_1,\dots,a_{i-1})= \sum_{a=1}^{q} \exp \left\{ h_i(a) + \sum_{j=1}^{i-1} J_{i,j}(a,a_j)\right\} \,,\]

is a the normalization factor. In machine learning, this parametrization is known as soft-max regression, the generalization of logistic regression to multi-class labels.

# Usage

The typical pipeline to use the package is:

- Compute ArDCA parameters from a multiple sequence alignment:

`julia> arnet,arvar=ardca(filefasta; kwds...)`

- Generate
`100`

new sequences, and store it in an $L\times 100$ array of integers.

`julia> Zgen = sample(arnet,100);`

# Multithreading

To fully exploit the the multicore parallel computation, julia should be invoked with

`$ julia -t nthreads # put here nthreads equal to the number of cores you want to use`

If you want to set permanently the number of threads to the desired value, you can either create a default environment variable `export JULIA_NUM_THREADS=24`

in your `.bashrc`

. More information here

# Tutorial

We will assume we have a Multiple Sequence Alignment (MSA)in FASTA format. We aim at

- Given a MSA, generate a sample
- Given a MSA, predict contacts
- Given a MSA, predict the mutational effect in all (ungapped) position of a given target sequence

## Load ArDCA package

The following cell loads the package `ArDCA`

(*Warning*: the first time it takes a while)

- The
`mypkgdir`

variable should be set to your`path/to/package`

dir.

We will use the PF00014 protein family available in `data/PF14/`

folder of the package/

```
mypkgdir = normpath(joinpath(pwd(),".."))
datadir=joinpath(mypkgdir,"data") # put here your path
using Pkg
Pkg.activate(mypkgdir)
using ArDCA
```

## Learn the autoregressive parameters

As a preliminary step, we learn the field and the coupling parameters $h,J$ from the MSA. To do so we use the `ardca`

method that return the parameters (stored in `arnet`

in the cell below), and the alignment in numerical format and other algorithms variables (stored in `arvar`

in the cell below). The default autoregressive order is set to `:ENTROPIC`

. We set the $L_2$ regularization to 0.02 for the $J$ and 0.001 for the $h$.

The keyword arguments for the `ardca`

method are (with their default value):

`lambdaJ::Real=0.01`

coupling L₂ regularization parameter (lagrange multiplier)`lambdaH::Real=0.01`

field L₂ regularization parameter (lagrange multiplier)`pc_factor::Real=0`

pseudocount factor for calculation of`p0`

`epsconv::Real=1.0e-5`

(convergence parameter)`maxit::Int=1000`

(maximum number of iteration - don't change)`verbose::Bool=true`

(set to`false`

to suppress printing on screen)`method::Symbol=:LD_LBFGS`

(optimization method)`permorder::Union{Symbol,Vector{Ti}}=:ENTROPIC`

(permutation order). Possible values are:`[:NATURAL, :ENTROPIC, :REV_ENTROPIC, :RANDOM]`

or a custom permutation vector.

`arnet,arvar=ardca("data/PF14/PF00014_mgap6.fasta.gz", verbose=false, lambdaJ=0.02,lambdaH=0.001);`

## 1. Sequence Generation

We now generate `M`

sequences using the `sample`

method:

```
M = 1_000;
generated_alignment = sample(arnet,M);
```

The generated alignment has is a 𝐿×𝑀 matrix of integer where 𝐿 is the alignment's length, and 𝑀 the number of samples.

Interestingly, we for each sequence we can also compute the likelihood with the sample*with*weights method.

`loglikelihood,generated_alignment = sample_with_weights(arnet,M);`

## 2. Contact Prediction

We can compute the epistatic score for residue-residue contact prediction. To do so, we can use the `epistatic_score`

method. The epistatic score is computed on any target sequence of the MSA. Empirically, it turns out the the final score does not depend much on the choice of the target sequence.

The autput is contained in a `Vector`

of `Tuple`

ranked in descendic score order. Each `Tuple`

contains $i,j,s_{ij}$ where $s_{ij}$ is the epistatic score of the residue pair $i,j$. The residue numbering is that of the MSA, and not of the unaligned full sequences.

```
target_sequence = 1
score=epistatic_score(arnet,arvar,target_sequence)
```

## 3. Predicting mutational effects

For any reference sequence, we can easily predict the mutational effect for all single mutants. Of course we can extract this information only for the *non-gapped* residues of the target sequence.

This is done with the `dms_single_site`

method, which returns a `q×L`

matrix `D`

containing $\log(P(mut))/\log(P(wt))$ for all single site mutants of the reference sequence `seqid`

(i.e. the so-called wild type sequence), and `idxgap`

a vector of indices of the residues of the reference sequence that contain gaps (i.e. the 21 amino-acid) for which the score has no sense and is set by convention to `+Inf`

.

A negative value indicate a beneficial mutation, a value 0 indicate the wild-type amino-acid.

```
target_sequence = 1
D,idxgap=dms_single_site(arnet,arvar,target_sequence)
```

## Methods Reference

`ArDCA.ardca`

`ArDCA.ardca`

`ArDCA.dms_single_site`

`ArDCA.loglikelihood`

`ArDCA.loglikelihood`

`ArDCA.loglikelihood`

`ArDCA.loglikelihood`

`ArDCA.sample`

`ArDCA.sample_subsequence`

`ArDCA.sample_subsequence`

`ArDCA.sample_with_weights`

`ArDCA.siteloglikelihood`

`ArDCA.softmax!`

`ArDCA.ardca`

— Method`ardca(filename::String; kwds...)`

Run `ardca`

on the fasta alignment in `filename`

Return two `struct`

: `::ArNet`

(containing the inferred hyperparameters) and `::ArVar`

Optional arguments:

`max_gap_fraction::Real=0.9`

maximum fraction of insert in the sequence`remove_dups::Bool=true`

if`true`

remove duplicated sequences`theta=:auto`

if`:auto`

compute reweighint automatically. Otherwise set a`Float64`

value`0 ≤ theta ≤ 1`

`lambdaJ::Real=0.01`

coupling L₂ regularization parameter (lagrange multiplier)`lambdaH::Real=0.01`

field L₂ regularization parameter (lagrange multiplier)`pc_factor::Real=1/size(Z,2)`

pseudocount factor for calculation of`p0`

, defaults to one over the number of sequences.`epsconv::Real=1.0e-5`

convergence value in minimzation`maxit::Int=1000`

maximum number of iteration in minimization`verbose::Bool=true`

set to`false`

to stop printing convergence info on`stdout`

`method::Symbol=:LD_LBFGS`

optimization strategy see`NLopt.jl`

for other options`permorder::Union{Symbol,Vector{Ti}}=:ENTROPIC`

permutation order. Possible values are`:NATURAL,:ENTROPIC,:REV_ENTROPIC,:RANDOM`

or a custom permutation vector

**Examples**

`julia> arnet, arvar = ardca("pf14.fasta", permorder=:ENTROPIC)`

`ArDCA.ardca`

— Method`ardca(Z::Array{Ti,2},W::Vector{Float64}; kwds...)`

Auto-regressive analysis on the L×M alignment `Z`

(numerically encoded in 1,…,21), and the `M`

-dimensional normalized weight vector `W`

.

Return two `struct`

: `::ArNet`

(containing the inferred hyperparameters) and `::ArVar`

Optional arguments:

`lambdaJ::Real=0.01`

coupling L₂ regularization parameter (lagrange multiplier)`lambdaH::Real=0.01`

field L₂ regularization parameter (lagrange multiplier)`pc_factor::Real=0`

pseudocount factor for calculation of`p0`

, defaults to one over the number of sequences.`epsconv::Real=1.0e-5`

convergence value in minimzation`maxit::Int=1000`

maximum number of iteration in minimization`verbose::Bool=true`

set to`false`

to stop printing convergence info on`stdout`

`method::Symbol=:LD_LBFGS`

optimization strategy see`NLopt.jl`

for other options`permorder::Union{Symbol,Vector{Ti}}=:ENTROPIC`

permutation order. Possible values are`:NATURAL,:ENTROPIC,:REV_ENTROPIC,:RANDOM`

or a custom permutation vector

**Examples**

`julia> arnet, arvar= ardca(Z,W,lambdaJ=0,lambdaH=0,permorder=:REV_ENTROPIC,epsconv=1e-12);`

`ArDCA.dms_single_site`

— Method`dms_single_site(arnet::ArNet, arvar::ArVar, seqid::Int; pc::Float64=0.1)`

Return a `q×L`

matrix of containing `-log(P(mut))/log(P(seq))`

for all single site mutants of the reference sequence `seqid`

, and a vector of the indices of the residues of the reference sequence that contain gaps (i.e. the 21 amino-acid) for which the score has no sense and is set by convention to `+Inf`

. A negative value indicate a beneficial mutation, a value 0 indicate the wild-type amino-acid.

`ArDCA.loglikelihood`

— Method`loglikelihood(arnet::ArNet, arvar::ArVar)`

Return the vector of loglikelihoods computed from `arvar.Z`

under the model `arnet`

. `size(arvar.Z) == N,M`

where `N`

is the sequences length, and `M`

the number of sequences. The returned vector has `M`

elements reweighted by `arvar.W`

`ArDCA.loglikelihood`

— Method`loglikelihood(x0::String, arnet::ArNet) where {T<:Integer})`

Return the loglikelihood of the `String`

`x0`

under the model `arnet`

.

`ArDCA.loglikelihood`

— Method`loglikelihood(x0::Matrix{T}, arnet::ArNet) where {T<:Integer})`

Return the vector of loglikelihoods computed from `Matrix`

`x0`

under the model `arnet`

. `size(x0) == N,M`

where `N`

is the sequences length, and `M`

the number of sequences. The returned vector has `M`

elements.

`ArDCA.loglikelihood`

— Method`loglikelihood(x0::Vector{T}, arnet::ArNet) where {T<:Integer})`

Return the loglikelihood of sequence `x0`

encoded in integer values in `1:q`

under the model `arnet`

`.

`ArDCA.sample`

— Method`sample(arnet::ArNet, msamples::Int)`

Return a generated alignment in the form of a `N × msamples`

matrix of type `::Matrix{Int}`

**Examples**

```
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Zgen=Zgen=sample(arnet,1000);
```

`ArDCA.sample_subsequence`

— Method`sample_subsequence(x::String, arnet::ArNet, msamples)`

Return a generated alignment in the form of a `N × msamples`

matrix of type `::Matrix{Int}`

and the relative probabilities under the model. The alignment is forced to start with with a sequence `x`

(in amino acid single letter alphabet) and then autoregressively generated.

**Example**

```
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_subsequence("MAKG",arnet,1000);
```

`ArDCA.sample_subsequence`

— Method`sample_subsequence(x::Vector{T}, arnet::ArNet, msamples)`

Return a generated alignment in the form of a `N × msamples`

matrix of type `::Matrix{Int}`

and the relative probabilities under the model. The alignment is forced to start with with a sequence `x`

(in integer number coding) and then autoregressively generated.

**Example**

```
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_subsequence([11,1,9,6],arnet,1000);
```

`ArDCA.sample_with_weights`

— Method`sample_with_weights(arnet::ArNet, msamples::Int)`

Return a generated alignment in the form of a `N × msamples`

matrix of type `::Matrix{Int}`

and the relative probabilities under the module

**Examples**

```
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_with_weights(arnet,1000);
```

`ArDCA.siteloglikelihood`

— Method`siteloglikelihood(i::Int,arnet::ArNet, arvar::ArVar)`

Return the loglikelihood relative to site i computed from `arvar.Z`

under the model `arnet`

.

`ArDCA.softmax!`

— Method`softmax(x::AbstractArray{<:Real})`

Return the `softmax transformation`

applied to `x`