Docstrings · MIToS.jl

MIToS.Information — Module

The Information module of MIToS defines types and functions useful to calculate information measures (e.g. Mutual Information (MI) and Entropy) over a Multiple Sequence Alignment (MSA). This module was designed to count Residues (defined in the MSA module) in special contingency tables (as fast as possible) and to derive probabilities from this counts. Also, includes methods for applying corrections to that tables, e.g. pseudocounts and pseudo frequencies. Finally, Information allows to use this probabilities and counts to estimate information measures and other frequency based values.

Features

Estimate multi dimensional frequencies and probabilities tables from sequences, MSAs, etc...
Correction for small number of observations
Correction for data redundancy on a MSA
Estimate information measures
Calculate corrected mutual information between residues

using MIToS.Information

MIToS.Information.BLOSUM62_Pi — Constant

BLOSUM62 probabilities P(aa) for each residue on the UngappedAlphabet. SUM: 0.9987

MIToS.Information.BLOSUM62_Pij — Constant

Table with conditional probabilities of residues based on BLOSUM62. The normalization is done row based. The firts row contains the P(aa|A) and so one.

MIToS.Information.AdditiveSmoothing — Type

Additive Smoothing or fixed pseudocount λ for ResidueCount (in order to estimate probabilities when the number of samples is low).

Common values of λ are:

0 : No cell frequency prior, gives you the maximum likelihood estimator.
0.05 is the optimum value for λ found in Buslje et. al. 2009, similar results was obtained for λ in the range [0.025, 0.075].
1 / p : Perks prior (Perks, 1947) where p the number of parameters (i.e. residues, pairs of residues) to estimate. If p is the number of residues (20 without counting gaps), this gives you 0.05.
sqrt(n) / p : Minimax prior (Trybula, 1958) where n is the number of samples and p the number of parameters to estimate. If the number of samples n is 400 (minimum number of sequence clusters for achieve good performance in Buslje et. al. 2009) for estimating 400 parameters (pairs of residues without counting gaps) this gives you 0.05.
0.5 : Jeffreys prior (Jeffreys, 1946).
1 : Bayes-Laplace uniform prior, aka. Laplace smoothing.

MIToS.Information.BLOSUM_Pseudofrequencies — Type

BLOSUM_Pseudofrequencies type. It takes to arguments/fields:

α : Usually the number of sequences or sequence clusters in the MSA.
β : The weight of the pseudofrequencies, a value close to 8.512 when α is the number of sequence clusters.

MIToS.Information.ContingencyTable — Type

A ContingencyTable is a multidimensional array. It stores the contingency matrix, its marginal values and total. The type also has an internal and private temporal array and an alphabet object. It's a parametric type, taking three ordered parameters:

T : The element type of the multidimensional array.
N : It's the dimension of the array and should be an Int.
A : This should be a type, subtype of ResidueAlphabet, i.e.: UngappedAlphabet,

GappedAlphabet or ReducedAlphabet.

A ContingencyTable can be created from an alphabet if all the parameters are given. Otherwise, you need to give a type, a number (Val) and an alphabet. You can also create a ContingencyTable using a matrix and a alphabet. For example:

ContingencyTable{Float64, 2, UngappedAlphabet}(UngappedAlphabet())
ContingencyTable(Float64, Val{2}, UngappedAlphabet())
ContingencyTable(zeros(Float64,20,20), UngappedAlphabet())

MIToS.Information.Counts — Type

A Counts object wraps a ContingencyTable storing counts/frequencies.

MIToS.Information.NoPseudocount — Type

You can use NoPseudocount() to avoid pseudocount corrections where a Pseudocount type is needed.

MIToS.Information.NoPseudofrequencies — Type

You can use NoPseudofrequencies() to avoid pseudocount corrections where a Pseudofrequencies type is needed.

MIToS.Information.Probabilities — Type

A Probabilities object wraps a ContingencyTable storing probabilities. It doesn't perform any check. If the total isn't one, you must use normalize or normalize!on the ContingencyTable before wrapping it to make the sum of the probabilities equal to one.

MIToS.Information.Pseudocount — Type

Parametric abstract type to define pseudocount types

MIToS.Information.Pseudofrequencies — Type

Parametric abstract type to define pseudofrequencies types

Base.count! — Method

It populates a ContingencyTable (first argument) using the frequencies in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights and pseudocounts as second and third positional arguments respectively. You can use NoPseudofrequencies() and NoClustering() to avoid the use of sequence weighting and pseudocounts, respectively.

Base.count — Method

It returns a ContingencyTable wrapped in a Counts type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet, weights and pseudocounts to indicate the alphabet of the table (default to UngappedAlphabet()), a clustering result (default to NoClustering()) and the pseudocounts (default to NoPseudocount()) to be used during the estimation of the frequencies.

LinearAlgebra.normalize! — Method

normalize! makes the sum of the frequencies to be one, in place.

LinearAlgebra.normalize — Method

normalize returns another table where the sum of the frequencies is one.

MIToS.Information.APC! — Method

APC (Dunn et. al. 2008)

MIToS.Information.BLMI — Method

BLMI takes a MSA or a file and a FileFormat as first arguments. It calculates a Z score (ZBLMI) and a corrected MI/MIp as described on Busjle et. al. 2009 but using using BLOSUM62 pseudo frequencies instead of a fixed pseudocount.

Keyword argument, type, default value and descriptions:

  - beta        Float64   8.512   β for BLOSUM62 pseudo frequencies
  - lambda      Float64   0.0     Low count value
  - threshold             62      Percent identity threshold for sequence clustering (Hobohm I)
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - samples     Int       50      Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples

This function returns:

  - Z score (ZBLMI)
  - MI or MIp using BLOSUM62 pseudo frequencies (BLMI/BLMIp)

MIToS.Information._calculate_blosum_pseudofrequencies! — Method

_calculate_blosum_pseudofrequencies!{T}(Pab::ContingencyTable{T,2,UngappedAlphabet})

This function uses the conditional probability matrix BLOSUM62_Pij to fill the temporal array field of Pab with pseudo frequencies (Gab). This function needs the real frequencies/probabilities Pab because they are used to estimate the pseudofrequencies.

Gab = Σcd Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d )

MIToS.Information._marginal — Method

_marginal(1,:A,:i,:value) generates the expression: A[i_1, 1] += value

MIToS.Information._mean_column — Method

Mean mutual information of column a (Dunn et. al. 2008). Summation is over j=1 to N, j ≠ a. Total is N-1.

MIToS.Information._mean_total — Method

Mean mutual information of column a (Dunn et. al. 2008). Summation is over j=1 to N, j ≠ a. Total is N-1.

Overall mean mutual information (Dunn et. al. 2008). 2/(N*(N-1)) by the sum of MI where the indices run i=1 to N-1, j=i+1 to N (triu).

MIToS.Information._test_index — Method

_test_index(1, i, continue) generates the expression: i_1 >= 22 && continue

MIToS.Information.apply_pseudocount! — Method

It adds the pseudocount value to the table cells.

MIToS.Information.apply_pseudofrequencies! — Method

apply_pseudofrequencies!{T}(Pab::ContingencyTable{T,2,UngappedAlphabet}, pseudofrequencies::BLOSUM_Pseudofrequencies)

When a BLOSUM_Pseudofrequencies(α,β) is used, this function applies pseudofrequencies Gab over Pab, as a weighted mean of both. It uses the conditional probability matrix BLOSUM62_Pij and the real frequencies/probabilities Pab to estimate the pseudofrequencies Gab. α is the weight of the real frequencies Pab and β the weight of the pseudofrequencies.

Gab = Σcd Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d ) Pab = (α ⋅ Pab + β ⋅ Gab )/(α + β)

MIToS.Information.buslje09 — Method

buslje09 takes a MSA or a file and a FileFormat as first arguments. It calculates a Z score and a corrected MI/MIp as described on Busjle et. al. 2009.

keyword argument, type, default value and descriptions:

  - lambda      Float64   0.05    Low count value
  - clustering  Bool      true    Sequence clustering (Hobohm I)
  - threshold             62      Percent identity threshold for clustering
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - samples     Int       100     Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples
  - alphabet    ResidueAlphabet UngappedAlphabet()  Residue alphabet to be used

This function returns:

  - Z score
  - MI or MIp

MIToS.Information.cleanup! — Method

cleanup! fills the temporal, table and marginals arrays with zeros. It also sets total to zero.

MIToS.Information.cumulative — Method

cumulative allows to calculate cumulative scores (i.e. cMI) as defined in Buslje et. al. 2010

"We calculated a cumulative mutual information score (cMI) for each residue as the sum of MI values above a certain threshold for every amino acid pair where the particular residue appears. This value defines to what degree a given amino acid takes part in a mutual information network." Buslje, Cristina Marino, Elin Teppa, Tomas Di Doménico, José María Delfino, and Morten Nielsen. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 6, no. 11 (2010): e1000978.

MIToS.Information.delete_dimensions! — Method

delete_dimensions!(out::ContingencyTable, in::ContingencyTable, dimensions::Int...)

This function fills a ContingencyTable with the counts/probabilities on in after the deletion of dimensions. i.e. This is useful for getting Pxy from Pxyz.

MIToS.Information.delete_dimensions — Method

delete_dimensions(in::ContingencyTable, dimensions::Int...)

This function creates a ContingencyTable with the counts/probabilities on in after the deletion of dimensions. i.e. This is useful for getting Pxy from Pxyz.

MIToS.Information.gap_intersection_percentage — Method

It calculates the gap intersection as percentage from a table of Counts.

MIToS.Information.gap_union_percentage — Method

It calculates the gap union as percentage from a table of Counts.

MIToS.Information.gaussdca — Method

Wrapper function to GaussDCA.gDCA. You need to install GaussDCA:

using Pkg

Pkg.add(PackageSpec(url="https://github.com/carlobaldassi/GaussDCA.jl", rev="master"))

Look into GaussDCA.jl README for further information. If you use this wrapper, please cite the GaussDCA publication and the package's doi.

It's possible to indicate the path to the julia binary where GaussDCA is installed. However, it's recommended to use the same version where MIToS is installed. That is because this function use serialize/deserialize to transfer data between the processes.

GaussDCA Publication: Baldassi, Carlo, Marco Zamparo, Christoph Feinauer, Andrea Procaccini, Riccardo Zecchina, Martin Weigt, and Andrea Pagnani. "Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners." PloS one 9, no. 3 (2014): e92721.

MIToS.Information.getalphabet — Method

getalphabet allows to access the stored alphabet object.

MIToS.Information.getcontingencytable — Method

getcontingencytable allows to access the wrapped ContingencyTable in a Probabilities or Counts object.

MIToS.Information.getmarginals — Method

getmarginals allows to access the array with the marginal values (NamedArray).

MIToS.Information.getmarginalsarray — Method

getmarginalsarray allows to access the array with the marginal values (Array without names).

MIToS.Information.gettable — Method

gettable allows to access the table (NamedArray).

MIToS.Information.gettablearray — Method

gettablearray allows to access the table (Array without names).

MIToS.Information.gettotal — Method

gettotal allows to access the stored total value.

MIToS.Information.kullback_leibler — Method

It calculates the Kullback-Leibler (KL) divergence from a table of Probabilities. The second positional argument is a Probabilities or ContingencyTable with the background distribution. It's optional, the default is the BLOSUM62_Pi table. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.Information.mapcolfreq! — Method

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.

MIToS.Information.mapcolpairfreq! — Method

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each pair of columns from the msa (second argument). The fourth positional argument usediagonal indicates if the function should be applied to identical element pairs (default to Val{true}).

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.
diagonalvalue (default: 0): Value to fill diagonal elements if usediagonal is Val{false}.

MIToS.Information.mapseqfreq! — Method

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.

MIToS.Information.mapseqpairfreq! — Method

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each pair of sequences from the msa (second argument). The fourth positional argument usediagonal indicates if the function should be applied to identical element pairs (default to Val{true}).

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.
diagonalvalue (default: 0): Value to fill diagonal elements if usediagonal is Val{false}.

MIToS.Information.marginal_entropy — Method

It calculates marginal entropy (H) from a table of Counts or Probabilities. The second positional argument is used to indicate the magin used to calculate the entropy, e.g. it estimates the entropy H(X) if marginal is 1, H(Y) for 2, etc. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.Information.mutual_information — Method

It calculates Mutual Information (MI) from a table of Counts or Probabilities. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits. Calculation of MI from Counts is faster than from Probabilities.

MIToS.Information.normalized_mutual_information — Method

It calculates a Normalized Mutual Information (nMI) by Entropy from a table of Counts or Probabilities.

nMI(X, Y) = MI(X, Y) / H(X, Y)

MIToS.Information.pairwisegapfraction — Method

It takes a MSA or a file and a FileFormat as first arguments. It calculates the percentage of gaps on columns pairs (union and intersection) using sequence clustering (Hobohm I).

Argument, type, default value and descriptions:

    - clustering  Bool      true    Sequence clustering (Hobohm I)
    - threshold             62      Percent identity threshold for sequence clustering (Hobohm I)

This function returns:

    - pairwise gap union as percentage
    - pairwise gap intersection as percentage

MIToS.Information.probabilities! — Method

It populates a ContingencyTable (first argument) using the probabilities in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights, pseudocounts and pseudofrequencies as second, third and fourth positional arguments respectively. You can use NoClustering(), NoPseudocount() and NoPseudofrequencies() to avoid the use of sequence weighting, pseudocounts and pseudofrequencies, respectively.

MIToS.Information.probabilities — Method

It returns a ContingencyTable wrapped in a Probabilities type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet, weights, pseudocounts and pseudofrequencies to indicate the alphabet of the table (default to UngappedAlphabet()), a clustering result (default to NoClustering()), the pseudocounts (default to NoPseudocount()) and the pseudofrequencies (default to NoPseudofrequencies()) to be used during the estimation of the probabilities.

MIToS.Information.update_marginals! — Method

update_marginals! updates the marginal and total values using the table.

StatsBase.entropy — Method

It calculates the Shannon entropy (H) from a table of Counts or Probabilities. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.PDB — Module

The module PDB defines types and methods to work with protein structures inside Julia. It is useful to link structural and sequential information, and needed for measure the predictive performance at protein contact prediction of mutual information scores.

Features

Read and parse PDF and PDBML files
Calculate distance and contacts between atoms or residues
Determine interaction between residues

using MIToS.PDB

MIToS.PDB._hbond_acceptor — Constant

Keys come from Table 1 of Bickerton et. al. 2011, Antecedents come from come from: http://biomachina.org/courses/modeling/download/topallh22x.pro Synonyms come from: http://www.bmrb.wisc.edu/refinfo/atomnom.tbl

MIToS.PDB._hbond_donor — Constant

Keys come from Table 1 of Bickerton et. al. 2011, The hydrogen names of the donor comes from: http://biomachina.org/courses/modeling/download/topallh22x.pro Synonyms come from: http://www.bmrb.wisc.edu/refinfo/atomnom.tbl

MIToS.PDB.covalentradius — Constant

Covalent radius in Å of each element from the Additional file 1 of PICCOLO [1]. Hydrogen was updated using the value on Table 2 from Cordero et. al. [2].

Bickerton, G. R., Higueruelo, A. P., & Blundell, T. L. (2011).

Comprehensive, atomic-level characterization of structurally characterized protein-protein interactions: the PICCOLO database. BMC bioinformatics, 12(1), 313.

Cordero, B., Gómez, V., Platero-Prats, A. E., Revés, M.,

Echeverría, J., Cremades, E., ... & Alvarez, S. (2008). Covalent radii revisited. Dalton Transactions, (21), 2832-2838.

MIToS.PDB.vanderwaalsradius — Constant

van der Waals radius in Å from the Additional file 1 of Bickerton et. al. 2011

Bickerton, G. R., Higueruelo, A. P., & Blundell, T. L. (2011).

Comprehensive, atomic-level characterization of structurally characterized protein-protein interactions: the PICCOLO database. BMC bioinformatics, 12(1), 313.

MIToS.PDB.Coordinates — Type

A Coordinates object is a fixed size vector with the coordinates x,y,z.

MIToS.PDB.PDBAtom — Type

A PDBAtom object contains the information from a PDB atom, without information of the residue. It has the following fields that you can access at any moment for query purposes:

- `coordinates` : x,y,z coordinates, e.g. `Coordinates(109.641,73.162,42.7)`.
- `atom` : Atom name, e.g. `"CA"`.
- `element` : Element type of the atom, e.g. `"C"`.
- `occupancy` : A float number with the occupancy, e.g. `1.0`.
- `B` : B factor as a string, e.g. `"23.60"`.

MIToS.PDB.PDBFile — Type

PDBFile <: FileFormat

Protein Data Bank (PDB) format. It provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies.

MIToS.PDB.PDBML — Type

PDBML <: FileFormat

Protein Data Bank Markup Language (PDBML), a representation of PDB data in XML format.

MIToS.PDB.PDBResidue — Type

A PDBResidue object contains all the information about a PDB residue. It has the following fields that you can access at any moment for query purposes:

- `id` : A `PDBResidueIdentifier` object.
- `atoms` : A vector of `PDBAtom`s.

MIToS.PDB.PDBResidueIdentifier — Type

A PDBResidueIdentifier object contains the information needed to identity PDB residues. It has the following fields that you can access at any moment for query purposes:

- `PDBe_number` : It's only used when a PDBML is readed (PDBe number as a string).
- `number` : PDB residue number, it includes insertion codes, e.g. `"34A"`.
- `name` : Three letter residue name in PDB, e.g. `"LYS"`.
- `group` : It can be `"ATOM"` or `"HETATM"`.
- `model` : The model number as a string, e.g. `"1"`.
- `chain` : The chain as a string, e.g. `"A"`.

Base.angle — Method

angle(a::Coordinates, b::Coordinates, c::Coordinates)

Angle (in degrees) at b between a-b and b-c

Base.any — Method

any(f::Function, a::PDBResidue, b::PDBResidue, criteria::Function)

Test if the function f is true for any pair of atoms between the residues a and b. This function only test atoms that returns true for the fuction criteria.

Base.any — Method

any(f::Function, a::PDBResidue, b::PDBResidue)

Test if the function f is true for any pair of atoms between the residues a and b

Base.parse — Method

parse(pdbml, ::Type{PDBML}; chain=All, model=All, group=All, atomname=All, onlyheavy=false, label=true, occupancyfilter=false)

Reads a LightXML.XMLDocument representing a pdb file. Returns a list of PDBResidues (view MIToS.PDB.PDBResidues). Setting chain, model, group, atomname and onlyheavy values can be used to select of a subset of all residues. If not set, all residues are returned. If the keyword argument label (default: true) is false,the auth_ attributes will be use instead of the label_ attributes for chain, atom and residue name fields. The auth_ attributes are alternatives provided by an author in order to match the identification/values used in the publication that describes the structure. If the keyword argument occupancyfilter (default: false) is true, only the atoms with the best occupancy are returned.

Base.parse — Method

parse(io, ::Type{PDBFile}; chain=All, model=All, group=All, atomname=All, onlyheavy=false, occupancyfilter=false)

Reads a text file of a PDB entry. Returns a list of PDBResidue (view MIToS.PDB.PDBResidues). Setting chain, model, group, atomname and onlyheavy values can be used to select of a subset of all residues. Group can be "ATOM" or "HETATM". If not set, all residues are returned. If the keyword argument occupancyfilter (default: false) is true, only the atoms with the best occupancy are returned.

Base.print — Function

print(io, res, format::Type{PDBFile}) print(res, format::Type{PDBFile})

Print a PDBResidue or a vector of PDBResidues in PDB format.

MIToS.PDB.CAmatrix — Method

Returns a matrix with the x, y and z coordinates of the Cα with best occupancy for each PDBResidue of the ATOM group. If a residue doesn't have a Cα, its Cα coordinates are NaNs.

MIToS.PDB._change_B — Method

Returns a new PDBAtom but with a B as B-factor

MIToS.PDB._generate_residues — Function

Used for parsing a PDB file into Vector{PDBResidue}

MIToS.PDB._get_matched_Cαs — Method

Return the matching CA matrices after deleting the rows/residues where the CA is missing in at least one structure.

MIToS.PDB._rmsf_test — Method

This looks for errors in the input to rmsf methods

MIToS.PDB.aromatic — Method

There's an aromatic interaction if centriods are at 6.0 Å or less.

MIToS.PDB.aromaticsulphur — Method

Returns true if an sulphur and an aromatic atoms are 5.3 Å or less"

MIToS.PDB.atoms — Function

atoms(residue_list, model, chain, group, residue, atom)

These return a vector of PDBAtoms with the selected subset of atoms. You can use the type All (default value of the positional arguments) to avoid filtering a that level.

MIToS.PDB.bestoccupancy — Method

Takes a Vector of PDBAtoms and returns a Vector of the PDBAtoms with best occupancy.

MIToS.PDB.center! — Method

center!(A::AbstractMatrix{Float64})

Takes a set of points A as an NxD matrix (N: number of points, D: dimension). Translates A in place so that its centroid is at the origin of coordinates

MIToS.PDB.centeredcoordinates — Function

Returns a Matrix{Float64} with the centered coordinates of all the atoms in residues. An optional positional argument CA (default: true) defines if only Cα carbons should be used to center the matrix.

MIToS.PDB.centeredresidues — Function

Returns a new Vector{PDBResidue} with the PDBResidues having centered coordinates. An optional positional argument CA (default: true) defines if only Cα carbons should be used to center the matrix.

MIToS.PDB.change_coordinates — Function

change_coordinates(residue::PDBResidue, coordinates::AbstractMatrix{Float64}, offset::Int=1)

Returns a new PDBResidues with (x,y,z) from a coordinates AbstractMatrix{Float64} You can give an offset indicating in wich matrix row starts the (x,y,z) coordinates of the residue.

MIToS.PDB.change_coordinates — Method

change_coordinates(residues::AbstractVector{PDBResidue}, coordinates::AbstractMatrix{Float64})

Returns a new Vector{PDBResidues} with (x,y,z) from a coordinates Matrix{Float64}

MIToS.PDB.change_coordinates — Method

change_coordinates(atom::PDBAtom, coordinates::Coordinates)

Returns a new PDBAtom but with a new coordinates

MIToS.PDB.check_atoms_for_interactions — Method

This function takes a PDBResidue and returns true only if all the atoms can be used for checking interactions.

MIToS.PDB.contact — Method

contact(a::Coordinates, b::Coordinates, limit::AbstractFloat)

It returns true if the distance is less or equal to the limit. It doesn't call sqrt because it does squared_distance(a,b) <= limit^2.

MIToS.PDB.contact — Method

contact(A::PDBResidue, B::PDBResidue, limit::AbstractFloat; criteria::String="All")

Returns true if the residues A and B are at contact distance (limit). The available distance criteria are: Heavy, All, CA, CB (CA for GLY)

MIToS.PDB.contact — Method

contact(residues::Vector{PDBResidue}, limit::AbstractFloat; criteria::String="All")

If contact takes a Vector{PDBResidue}, It returns a matrix with all the pairwise comparisons (contact map).

MIToS.PDB.coordinatesmatrix — Method

Returns a matrix with the x, y, z coordinates of each atom in each PDBResidue

MIToS.PDB.covalent — Method

Returns true if the distance between atoms is less than the sum of the covalentradius of each atom.

MIToS.PDB.distance — Method

It calculates the squared euclidean distance.

MIToS.PDB.distance — Method

distance(residues::Vector{PDBResidue}; criteria::String="All")

If distance takes a Vector{PDBResidue} returns a PairwiseListMatrix{Float64, false} with all the pairwise comparisons (distance matrix).

MIToS.PDB.disulphide — Method

Returns true if two CYS's S are at 2.08 Å or less

MIToS.PDB.downloadpdb — Method

It downloads a gzipped PDB file from PDB database. It requires a four character pdbcode. Its default format is PDBML (PDB XML) and It uses the baseurl "http://www.rcsb.org/pdb/files/". filename is the path/name of the output file. This function calls MIToS.Utils.download_file that calls download from the HTTP.jl package. You can use keyword arguments from HTTP.request. Use the headers keyword argument to pass a Dict{String, String} with the header information.

MIToS.PDB.downloadpdbheader — Method

It downloads a JSON file containing the PDB header information.

MIToS.PDB.findCB — Method

Returns a vector of indices for CB (CA for GLY)

MIToS.PDB.findatoms — Method

findatoms(res::PDBResidue, atom::String)

Returns a index vector of the atoms with the given atom name.

MIToS.PDB.findheavy — Method

Returns a list with the index of the heavy atoms (all atoms except hydrogen) in the PDBResidue

MIToS.PDB.getCA — Method

Returns the Cα with best occupancy in the PDBResidue. If the PDBResidue has no Cα, missing is returned.

MIToS.PDB.getpdbdescription — Method

Access general information about a PDB entry (e.g., Header information) using the GraphQL interface of the PDB database. It parses the JSON answer into a Dict.

MIToS.PDB.hydrogenbond — Method

This function only works if there are hydrogens in the structure. The criteria for a hydrogen bond are:

d(Ai, Aj) < 3.9Å
d(Ah, Aacc) < 2.5Å
θ(Adon, Ah, Aacc) > 90°
θ(Adon, Aacc, Aacc-antecedent) > 90°
θ(Ah, Aacc, Aacc-antecedent) > 90°

Where Ah is the donated hydrogen atom, Adon is the hydrogen bond donor atom, Aacc is the hydrogen bond acceptor atom and Aacc-antecednt is the atom antecedent to the hydrogen bond acceptor atom.

MIToS.PDB.hydrophobic — Method

There's an hydrophobic interaction if two hydrophobic atoms are at 5.0 Å or less.

MIToS.PDB.ionic — Method

There's an ionic interaction if a cationic and an anionic atoms are at 6.0 Å or less.

MIToS.PDB.is_aminoacid — Method

is_aminoacid(residue::PDBResidue)
is_aminoacid(residue_id::PDBResidueIdentifier)

This function returns true if the PDB residue is an amino acid residue. It checks if the residue's three-letter name exists in the MIToS.Utils.THREE2ONE dictionary, and returns false otherwise.

MIToS.PDB.isanionic — Method

Returns true if the atom, e.g. ("GLU","CD"), is an anionic atom in the residue.

MIToS.PDB.isaromatic — Method

Returns true if the atom, e.g. ("HIS","CG"), is an aromatic atom in the residue.

MIToS.PDB.isatom — Method

It tests if the atom has the indicated atom name.

MIToS.PDB.iscationic — Method

Returns true if the atom, e.g. ("ARG","NE"), is a cationic atom in the residue.

MIToS.PDB.ishbondacceptor — Method

Returns true if the atom, e.g. ("ARG","O"), is an acceptor in H bonds.

MIToS.PDB.ishbonddonor — Method

Returns true if the atom, e.g. ("ARG","N"), is a donor in H bonds.

MIToS.PDB.isresidue — Method

It tests if the PDB residue has the indicated model, chain, group and residue number.

MIToS.PDB.kabsch — Method

kabsch(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64})

This function takes two sets of points, A (refrence) and B as NxD matrices, where D is the dimension and N is the number of points. Assumes that the centroids of A and B are at the origin of coordinates. You can call center! on each matrix before calling kabsch to center the matrices in the (0.0, 0.0, 0.0). Rotates B so that rmsd(A,B) is minimized. Returns the rotation matrix. You should do B * RotationMatrix to get the rotated B.

MIToS.PDB.mean_coordinates — Method

Calculates the average/mean position of each atom in a set of structure. The function takes a vector (AbstractVector) of vectors (AbstractVector{PDBResidue}) or matrices (AbstractMatrix{Float64}) as first argument. As second (optional) argument this function can take an AbstractVector{Float64} of matrix/structure weights to return a weighted mean. When a AbstractVector{PDBResidue} is used, if the keyword argument calpha is false the RMSF is calculated for all the atoms. By default only alpha carbons are used (default: calpha=true).

MIToS.PDB.modelled_sequences — Method

modelled_sequences(residue_list::AbstractArray{PDBResidue,N}; 
    model::Union{String,Type{All}}=All, chain::Union{String,Type{All}}=All, 
    group::Union{String,Regex,Type{All}}=All) where N

This function returns an OrderedDict where each key is a named tuple (containing the model and chain identifiers), and each value is the protein sequence corresponding to the modelled residues in those chains. Therefore, the obtained sequences do not contain missing residues. All modelled residues are included by default, but those that don't satisfy specified criteria based on the model, chain, or group keyword arguments are excluded. One-letter residue names are obtained from the MIToS.Utils.THREE2ONE dictionary for all residue names that return true for is_aminoacid.

MIToS.PDB.pication — Method

There's a Π-Cation interaction if a cationic and an aromatic atoms are at 6.0 Å or less

MIToS.PDB.proximitymean — Method

proximitymean calculates the proximity mean/average for each residue as the average score (from a scores list) of all the residues within a certain physical distance to a given amino acid. The score of that residue is not included in the mean unless you set include to true. The default values are 6.05 for the distance threshold/limit and "Heavy" for the criteria keyword argument. This function allows to calculate pMI (proximity mutual information) and pC (proximity conservation) as in Buslje et. al. 2010.

Buslje, Cristina Marino, Elin Teppa, Tomas Di Doménico, José María Delfino, and Morten Nielsen. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 6, no. 11 (2010): e1000978.

MIToS.PDB.residuepairsmatrix — Method

It creates a NamedArray containing a PairwiseListMatrix where each element (column, row) is identified with a PDBResidue from the input vector. You can indicate the value type of the matrix (default to Float64), if the list should have the diagonal values (default to Val{false}) and the diagonal values (default to NaN).

MIToS.PDB.residues — Method

residues(residue_list, model, chain, group, residue)

These return a new vector with the selected subset of residues from a list of residues. You can use the type All (default value) to avoid filtering a that level.

MIToS.PDB.residuesdict — Method

residuesdict(residue_list, model, chain, group, residue)

These return a dictionary (using PDB residue numbers as keys) with the selected subset of residues. You can use the type All (default value) to avoid filtering a that level.

MIToS.PDB.rmsd — Method

rmsd(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64})

Return RMSD between two sets of points A and B, given as NxD matrices (N: number of points, D: dimension).

MIToS.PDB.rmsd — Method

rmsd(A::AbstractVector{PDBResidue}, B::AbstractVector{PDBResidue}; superimposed::Bool=false)

Returns the Cα RMSD value between two PDB structures: A and B. If the structures are already superimposed between them, use superimposed=true to avoid a new superimposition (superimposed is false by default).

MIToS.PDB.rmsf — Method

Calculates the RMSF (Root Mean-Square-Fluctuation) between an atom and its average position in a set of structures. The function takes a vector (AbstractVector) of vectors (AbstractVector{PDBResidue}) or matrices (AbstractMatrix{Float64}) as first argument. As second (optional) argument this function can take an AbstractVector{Float64} of matrix/structure weights to return the root weighted mean-square-fluctuation around the weighted mean structure. When a Vector{PDBResidue} is used, if the keyword argument calpha is false the RMSF is calculated for all the atoms. By default only alpha carbons are used (default: calpha=true).

MIToS.PDB.selectbestoccupancy — Method

Takes a PDBResidue and a Vector of atom indices. Returns the index value of the Vector with maximum occupancy.

MIToS.PDB.squared_distance — Method

It calculates the squared euclidean distance, i.e. it doesn't spend time in sqrt

MIToS.PDB.squared_distance — Method

squared_distance(A::PDBResidue, B::PDBResidue; criteria::String="All")

Returns the squared distance between the residues A and B. The available criteria are: Heavy, All, CA, CB (CA for GLY)

MIToS.PDB.superimpose — Function

Asuper, Bsuper, RMSD = superimpose(A, B, matches=nothing)

This function takes A::AbstractVector{PDBResidue} (reference) and B::AbstractVector{PDBResidue}. Translates A and B to the origin of coordinates, and rotates B so that rmsd(A,B) is minimized with the Kabsch algorithm (using only their α carbons). Returns the rotated and translated versions of A and B, and the RMSD value.

Optionally provide matches which iterates over matched index pairs in A and B, e.g., matches = [(3, 5), (4, 6), ...]. The alignment will be constructed using just the matching residues.

MIToS.PDB.vanderwaals — Method

Test if two atoms or residues are in van der Waals contact using: distance(a,b) <= 0.5 + vanderwaalsradius[a] + vanderwaalsradius[b]. It returns distance <= 0.5 if the atoms aren't in vanderwaalsradius.

MIToS.PDB.vanderwaalsclash — Method

Returns true if the distance between the atoms is less than the sum of the vanderwaalsradius of the atoms. If the atoms aren't on the list (i.e. OXT), the vanderwaalsradius of the element is used. If there is not data in the dict, distance 0.0 is used.

MIToS.PDB.@atoms — Macro

@atoms ... model ... chain ... group ... residue ... atom ...

These return a vector of PDBAtoms with the selected subset of atoms. You can use the type All to avoid filtering that option.

MIToS.PDB.@residues — Macro

@residues ... model ... chain ... group ... residue ...

These return a new vector with the selected subset of residues from a list of residues. You can use the type All to avoid filtering that option.

MIToS.PDB.@residuesdict — Macro

@residuesdict ... model ... chain ... group ... residue ...

These return a dictionary (using PDB residue numbers as keys) with the selected subset of residues. You can use the type All to avoid filtering that option.

MIToS.Utils.Scripts.close_output — Method

Close output (check if output is STDOUT).

MIToS.Utils.Scripts.loadedversion — Method

Return the version of the loaded module.

Source: https://stackoverflow.com/questions/60587041/julia-getting-the-version-number-of-my-module

MIToS.Utils.Scripts.open_output — Method

Opens the output file or returns STDOUT.

MIToS.Utils.Scripts.parse_commandline — Method

Parse MIToS scripts command line arguments.

MIToS.Utils.Scripts.readorparse — Method

Decides if read or parse, uses parse with STDIN

MIToS.Utils.Scripts.run_single_script — Method

Opens and closes the output stream, tries to run the defined script.

MIToS.Utils.Scripts.runscript — Method

If FILE is not used, calls runsinglescript with STDIN. Otherwise calls runsinglescript with the FILE. If list is true, for loop or pmap is used over eachline of the input.

MIToS.Utils.Scripts.set_parallel — Method

Adds the needed number of workers.

MIToS.SIFTS — Module

The SIFTS module of MIToS allows to obtain the residue-level mapping between databases stored in the SIFTS XML files. It makes easy to assign PDB residues to UniProt/Pfam positions. Given the fact that pairwise alignments can lead to misleading association between residues in both sequences, SIFTS offers more reliable association between sequence and structure residue numbers.

Features

Download and parse SIFTS XML files
Store residue-level mapping in Julia
Easy generation of OrderedDicts between residues numbers

using MIToS.SIFTS

MIToS.SIFTS.SIFTSResidue — Type

A SIFTSResidue object stores the SIFTS residue level mapping for a residue. It has the following fields that you can access at any moment for query purposes:

- `PDBe` : A `dbPDBe` object, it's present in all the `SIFTSResidue`s.
- `UniProt` : A `dbUniProt` object or `missing`.
- `Pfam` : A `dbPfam` object or `missing`.
- `NCBI` : A `dbNCBI` object or `missing`.
- `InterPro` : An array of `dbInterPro` objects.
- `PDB` : A `dbPDB` object or `missing`.
- `SCOP` : A `dbSCOP` object or `missing`.
- `SCOP2` : An array of `dbSCOP2` objects.
- `SCOP2B` : A `dbSCOP2B` object or `missing`.
- `CATH` : A `dbCATH` object or `missing`.
- `Ensembl` : An array of `dbEnsembl` objects.
- `missing` : It's `true` if the residue is missing, i.e. not observed, in the structure.
- `sscode` : A string with the secondary structure code of the residue.
- `ssname` : A string with the secondary structure name of the residue.

MIToS.SIFTS.dbCATH — Type

dbCATH stores the residue id, number, name and chain in CATH as strings.

MIToS.SIFTS.dbEnsembl — Type

dbEnsembl stores the residue (gene) accession id, the transcript, translation and exon ids in Ensembl as strings, together with the residue number and name using the UniProt coordinates.

MIToS.SIFTS.dbInterPro — Type

dbInterPro stores the residue id, number, name and evidence in InterPro as strings.

MIToS.SIFTS.dbNCBI — Type

dbNCBI stores the residue id, number and name in NCBI as strings.

MIToS.SIFTS.dbPDB — Type

dbPDB stores the residue id, number, name and chain in PDB as strings.

MIToS.SIFTS.dbPDBe — Type

dbPDBe stores the residue number and name in PDBe as strings.

MIToS.SIFTS.dbPfam — Type

dbPfam stores the residue id, number and name in Pfam as strings.

MIToS.SIFTS.dbSCOP — Type

dbSCOP stores the residue id, number, name and chain in SCOP as strings.

MIToS.SIFTS.dbSCOP2 — Type

dbSCOP2 stores the residue id, number, name and chain in SCOP2 as strings.

MIToS.SIFTS.dbSCOP2B — Type

dbSCOP2B stores the residue id, number, name and chain in SCOP2B as strings. SCOP2B is expansion of SCOP2 domain annotations at superfamily level to every PDB with same UniProt accession having at least 80% SCOP2 domain coverage.

MIToS.SIFTS.dbUniProt — Type

dbUniProt stores the residue id, number and name in UniProt as strings.

Base.parse — Method

parse(document::LightXML.XMLDocument, ::Type{SIFTSXML}; chain=All, missings::Bool=true)

Returns a Vector{SIFTSResidue} parsed from a SIFTSXML file. By default, parses all the chains and includes missing residues.

MIToS.SIFTS._get_attribute — Method

Returns "" if the attributte is missing

MIToS.SIFTS._get_entities — Method

Gets the entities of a SIFTS XML. In some cases, each entity is a PDB chain. WARNING: Sometimes there are more chains than entities!

<entry dbSource="PDBe" ...
  ...
  <entity type="protein" entityId="A">
    ...
  </entity>
  <entity type="protein" entityId="B">
    ...
  </entity>
<ntry>

MIToS.SIFTS._get_nullable_attribute — Method

Returns missing if the attributte is missing

MIToS.SIFTS._get_residues — Method

Returns an Iterator of the residues on the listResidue

<listResidue>
  <residue>
  ...
  </residue>
  ...
</listResidue>

MIToS.SIFTS._get_segments — Method

Gets an array of the segments, the continuous region of an entity. Chimeras and expression tags generates more than one segment for example.

MIToS.SIFTS._is_missing — Method

Returns true if the residue was annotated as Not_Observed.

MIToS.SIFTS.downloadsifts — Method

downloadsifts(pdbcode::String; filename::String, source::String="https")

Download the gzipped SIFTS XML file for the provided pdbcode. The downloaded file will have the default extension .xml.gz. While you can change the filename, it must include the .xml.gz ending. The source keyword argument is set to "https" by default. Alternatively, you can choose "ftp" as the source, which will retrieve the file from the EBI FTP server at ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/. However, please note that using "https" is highly recommended. This option will download the file from the EBI PDBe server at https://www.ebi.ac.uk/pdbe/files/sifts/.

MIToS.SIFTS.siftsmapping — Method

Parses a SIFTS XML file and returns a OrderedDict between residue numbers of two DataBases with the given identifiers. A chain could be specified (All by default). If missings is true (default) all the residues are used, even if they haven’t coordinates in the PDB file.

MIToS.Utils — Module

The Utils has common utils functions and types used in other modules.

using MIToS.Utils

MIToS.Utils.THREE2ONE — Constant

THREE2ONE is a dictionary that maps three-letter amino acid residue codes (String) to their corresponding one-letter codes (Char). The dictionary is generated by parsing components.cif file from the Protein Data Bank.

julia> using MIToS.Utils

julia> one_letter_code = THREE2ONE["ALA"]
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)

MIToS.Utils.All — Type

All is used instead of MIToS 1.0 "all" or "*", because it's possible to dispatch on it.

MIToS.Utils.FileFormat — Type

FileFormat is used for write special parse (and read) methods on it.

Base.read — Method

read(pathname, FileFormat [, Type [, … ] ] ) -> Type

This function opens a file in the pathname and calls parse(io, ...) for the given FileFormat and Type on it. If the pathname is an HTTP or FTP URL, the file is downloaded with download in a temporal file. Gzipped files should end on .gz.

Base.write — Method

write{T<:FileFormat}(filename::AbstractString, object, format::Type{T}, mode::ASCIIString="w")

This function opens a file with filename and mode (default: "w") and writes (print) the object with the given format. Gzipped files should end on .gz.

MIToS.Utils._check_gzip_file — Method

This function raises an error if a GZip file doesn't have the 0x1f8b magic number.

MIToS.Utils._modify_kargs_for_proxy — Method

Helper function that modifies keyword argument to include a proxy, the proxy URL is taken from the HTTPSPROXY and HTTPSPROXY enviromental variables.

MIToS.Utils.check_file — Method

Returns the filename. Throws an ErrorException if the file doesn't exist, or a warning if the file is empty.

MIToS.Utils.check_pdbcode — Method

It checks if a PDB code has the correct format.

MIToS.Utils.download_file — Method

download_file uses HTTP.jl to download files from the web. It takes the file url as first argument and, optionally, a path to save it. Keyword arguments are are directly passed to to HTTP.download (HTTP.request). Use the headers keyword argument to pass a Dict{String,String} with the header information. Set the HTTPS_PROXY and HTTPS_PROXY ENViromental variables if you are behind a proxy.

julia> using MIToS.Utils

julia> download_file("http://www.uniprot.org/uniprot/P69905.fasta","seq.fasta",
       headers = Dict("User-Agent" =>
                      "Mozilla/5.0 (compatible; MSIE 7.01; Windows NT 5.0)"))
"seq.fasta"

MIToS.Utils.get_n_words — Method

get_n_words{T <: Union{ASCIIString, UTF8String}}(line::T, n::Int) It returns a Vector{T} with the first n (possibles) words/fields (delimited by space or tab). If there is more than n words, the last word returned contains the finals words and the delimiters. The length of the returned vector is n or less (if the number of words is less than n). This is used for parsing the Stockholm format.

julia> using MIToS.Utils

julia> get_n_words("#=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH", 3)
3-element Vector{String}:
 "#=GR"
 "O31698/18-71"
 "SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH"

MIToS.Utils.getarray — Method

Getter for the array field of NamedArrays

MIToS.Utils.hascoordinates — Method

hascoordinates(id) It returns true if id/sequence name has the format: UniProt/start-end (i.e. O83071/192-246)

MIToS.Utils.isnotemptyfile — Method

Returns true if the file exists and isn't empty.

MIToS.Utils.lineiterator — Method

Create an iterable object that will yield each line from a stream or string.

MIToS.Utils.list2matrix — Method

Returns a square symmetric matrix from the vector vec. side is the number of rows/columns. The diagonal is not included by default, set to true if there are diagonal elements in the list.

MIToS.Utils.matrix2list — Method

Returns a vector with the part ("upper" or "lower") of the square matrix mat. The diagonal is not included by default.

MIToS.Utils.select_element — Method

Selects the first element of the vector. This is useful for unpacking one element vectors. Throws a warning if there are more elements. element_name is element by default, but the name can be changed using the second argument.

MIToS.MSA — Module

The MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).

Features

Read and write MSAs in Stockholm, FASTA or Raw format
Handle MSA annotations
Edit the MSA, e.g. delete columns or sequences, change sequence order, shuffling...
Keep track of positions and annotations after modifications on the MSA
Describe a MSA, e.g. mean percent identity, sequence coverage, gap percentage...

using MIToS.MSA

MIToS.MSA.GAP — Constant

GAP is the Residue representation on MIToS for gaps ('-', insertions and deletions). Lowercase residue characters, dots and '*' are encoded as GAP in conversion from Strings and Chars. This Residue constant is encoded as Residue(21).

MIToS.MSA.XAA — Constant

XAA is the Residue representation for unknown, ambiguous and non standard residues. This Residue constant is encoded as Residue(22).

MIToS.MSA._max_char — Constant

'z' is the maximum between 'A':'Z', 'a':'z', '.', '-' and '*'. 'z' is 'GAP' but the next character to 'z' is '{', i.e. XAA.

MIToS.MSA.AbstractAlignedObject — Type

MIToS MSA and aligned sequences (aligned objects) are subtypes of AbstractMatrix{Residue}, because MSAs and sequences are stored as Matrix of Residues.

MIToS.MSA.AbstractAlignedSequence — Type

A MIToS aligned sequence is an AbstractMatrix{Residue} with only 1 row/sequence.

MIToS.MSA.AbstractMultipleSequenceAlignment — Type

MSAs are stored as Matrix{Residue}. It's possible to use a NamedResidueMatrix{Array{Residue,2}} as the most simple MSA with sequence identifiers and column names.

MIToS.MSA.AlignedSequence — Type

An AlignedSequence wraps a NamedResidueMatrix{Array{Residue,2}} with only 1 row/sequence. The NamedArray stores the sequence name and original column numbers as Strings.

MIToS.MSA.AnnotatedAlignedSequence — Type

This type represent an aligned sequence, similar to AlignedSequence, but It also stores its Annotations.

MIToS.MSA.AnnotatedMultipleSequenceAlignment — Type

This type represent an MSA, similar to MultipleSequenceAlignment, but It also stores Annotations. This annotations are used to store residue coordinates (i.e. mapping to UniProt residue numbers).

MIToS.MSA.Annotations — Type

The Annotations type is basically a container for Dicts with the annotations of a multiple sequence alignment. Annotations was designed for storage of annotations of the Stockholm format.

MIToS also uses MSA annotations to keep track of:

Modifications of the MSA (MIToS_...) as deletion of sequences or columns.
Positions numbers in the original MSA file (column mapping: ColMap)
Position of the residues in the sequence (sequence mapping: SeqMap)

MIToS.MSA.Clusters — Type

Data structure to represent sequence clusters. The sequence data itself is not included.

MIToS.MSA.GappedAlphabet — Type

This type defines the usual alphabet of the 20 natural residues and a gap character.

julia> using MIToS.MSA

julia> GappedAlphabet()
GappedAlphabet of length 21. Residues : res"ARNDCQEGHILKMFPSTWYV-"

MIToS.MSA.MultipleSequenceAlignment — Type

This MSA type include a NamedArray wrapping a Matrix of Residues. The use of NamedArray allows to store sequence names and original column numbers as Strings, and fast indexing using them.

MIToS.MSA.NoClustering — Type

Use NoClustering() to avoid the use of clustering where a Clusters type is needed.

MIToS.MSA.ReducedAlphabet — Type

ReducedAlphabet allows the construction of reduced residue alphabets, where residues inside parenthesis belong to the same group.

julia> using MIToS.MSA

julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"

julia> ab[Residue('K')]
2

MIToS.MSA.Residue — Type

Most of the MIToS design is created around the Residue bitstype. It has representations for the 20 natural amino acids, a value representing insertions and deletions (GAP, '-') and one representing unknown, ambiguous and non standard residues (XAA, 'X'). Each Residue is encoded as an integer number, with the same bit representation and size than a Int. This allows fast indexing operation of probability or frequency matrices.

Residue creation and conversion

Creation and conversion of Residues should be treated carefully. Residue is encoded as a 32 or 64 bits type similar to Int, to get fast indexing using Int(x::Residue). Int simply calls reinterpret without checking if the residue is valid. Valid residues have integer values in the closed interval [1,22]. convert from Int and Char always returns valid residues, however it's possible to find invalid residues (they are shown using the character '�') after the creation of uninitialized Residue arrays (i.e. using Array). You can use zeros, ones or rand to get initialized Residue arrays with valid residues. Conversions to and from Chars changes the bit representation and allows the use of the usual character representation of residues and amino acids. This conversions are used in IO operations and always return valid residues. In conversions from Char, lowercase letters, '*', '-' and '.' are translated to GAP, letters representing the 20 natural amino (ARNDCQEGHILKMFPSTWYV) acids are translated to their corresponding Residue and any other character is translated to XAA. Since lowercase letters and dots are translated to gaps, Pfam MSA insert columns are converted to columns full of gaps.

julia> using MIToS.MSA

julia> alanine = Residue('A')
A

julia> Char(alanine)
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)

julia> for residue in res"ARNDCQEGHILKMFPSTWYV-X"
           println(residue, " ", Int(residue))
       end
A 1
R 2
N 3
D 4
C 5
Q 6
E 7
G 8
H 9
I 10
L 11
K 12
M 13
F 14
P 15
S 16
T 17
W 18
Y 19
V 20
- 21
X 22

MIToS.MSA.ResidueAlphabet — Type

Abstract type to define residue alphabet types.

MIToS.MSA.UngappedAlphabet — Type

This type defines the usual alphabet of the 20 natural residues, without the gap character.

julia> using MIToS.MSA

julia> UngappedAlphabet()
UngappedAlphabet of length 20. Residues : res"ARNDCQEGHILKMFPSTWYV"

Base.isvalid — Method

isvalid(res::Residue)

It returns true if the encoded integer is in the closed interval [1,22].

Base.join — Method

join(msa_a::AnnotatedMultipleSequenceAlignment, 
     msa_b::AnnotatedMultipleSequenceAlignment, 
     pairing; 
     kind::Symbol=:outer, 
     axis::Int=1)::AnnotatedMultipleSequenceAlignment

join(msa_a::AnnotatedMultipleSequenceAlignment, 
     msa_b::AnnotatedMultipleSequenceAlignment, 
     positions_a, 
     positions_b; 
     kind::Symbol=:outer, 
     axis::Int=1)::AnnotatedMultipleSequenceAlignment

Join two Multiple Sequence Alignments (MSAs), msa_a and msa_b, based on specified matching positions or names. The function supports two formats: one takes a pairing argument as a list of correspondences, and the other takes positions_a and positions_b as separate lists indicating matching positions or names in each MSA. This function allows for various types of join operations (:inner, :outer, :left, :right) and can merge MSAs by sequences (axis 1) or by columns (axis 2).

Parameters:

msa_a::AnnotatedMultipleSequenceAlignment: The first MSA.
msa_b::AnnotatedMultipleSequenceAlignment: The second MSA.
pairing: An iterable where each element is a pair of sequence or column positions (Ints) or names (Strings) to match between msa_a and msa_b. For example, it can be a list of two-element tuples or pairs, or and OrderedDict.
positions_a, positions_b: Separate lists of positions or names in msa_a and msa_b, respectively.
kind::Symbol: Type of join operation. Default is :outer.
axis::Int: The axis along which to join (1 to match sequences, 2 to match columns).

Returns:

AnnotatedMultipleSequenceAlignment: A new MSA resulting from the join operation.

Behavior and Sequence Ordering:

The order of sequences or columns in the resulting MSA depends on the kind of join operation and the order of elements in the pairing or positions_a and positions_b lists.

For :inner joins, the function returns an MSA containing only those sequences/columns that are paired in both msa_a and msa_b. The order of elements in the output MSA follows the order in the pairing or position lists.
For :outer joins, the output MSA includes all sequences/columns from both msa_a and msa_b. Unpaired sequences/columns are filled with gaps as needed. The sequences/columns from msa_a are placed first. If the pairing or position lists are sorted, the output MSA columns and sequences will keep the same order as in the inputs. That's nice for situations such as profile alignments where the order of columns is important. If the pairing or position lists are not sorted, then the order of sequences/columns in the output MSA is not guaranteed to be the same as in the inputs. In particular, the matched sequences or columns will be placed first, followed by the unmatched ones.
For :left joins, all sequences/columns from msa_a are included in the output MSA keeping the same order as in msa_a. Sequences/columns from msa_b are added where matches are found, with gaps filling the unmatched positions.
For :right joins, the output MSA behaves like :left joins but with roles of msa_a and msa_b reversed.

Warning: When using Dict for pairing, the order of elements might not be preserved as expected. Dict in Julia does not maintain the order of its elements, which might lead to unpredictable order of sequences/columns in the output MSA. To preserve order, it is recommended to use an OrderedDict or a list of Pairs objects.

Base.merge! — Method

merge!(target::Annotations, sources::Annotations...)

Merge one or more source Annotations into a target Annotations. This function calls Base.merge! on each of the four dictionaries in the Annotations type: file, sequences, columns, and residues. Consequently, it behaves like Base.merge! for dictionaries; if the same key exists in different Annotations objects, the value from the last one is used.

NOTE: This function does not check for consistency among the various Annotations. For example, it does not verify that SeqMap annotations have consistent lengths.

Base.merge — Method

merge(target::Annotations, sources::Annotations...)

Create a new Annotations object by merging two or more Annotations. If the same annotation exists in different Annotations objects, the value from the last one is used. See also merge!.

NOTE: This function does not check for consistency among the various Annotations. For example, it does not verify that SeqMap annotations have consistent lengths.

Base.names — Method

It returns the name of each group. The name is a string with the one letter code of each residue that belong to the group.

julia> using MIToS.MSA

julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"

julia> names(ab)
8-element Vector{String}:
 "AILMV"
 "RHK"
 "NQST"
 "DE"
 "FWY"
 "C"
 "G"
 "P"

Base.parse — Function

parse(io, format[, output; generatemapping, useidcoordinates, deletefullgaps])

The keyword argument generatemapping (false by default) indicates if the mapping of the sequences ("SeqMap") and columns ("ColMap") and the number of columns in the original MSA ("NCol") should be generated and saved in the annotations. If useidcoordinates is true (default: false) the sequence IDs of the form "ID/start-end" are parsed and used for determining the start and end positions when the mappings are generated. deletefullgaps (true by default) indicates if columns 100% gaps (generally inserts from a HMM) must be removed from the MSA.

Base.rand — Method

It chooses from the 20 natural residues (it doesn't generate gaps).

julia> using MIToS.MSA

julia> using Random

julia> Random.seed!(1); # Reseed the random number generator.

julia> rand(Residue)
R

julia> rand(Residue, 4, 4)
4×4 Matrix{Residue}:
 E  D  D  A
 F  S  K  K
 M  S  I  M
 Y  F  E  D

Clustering.assignments — Method

Get a vector of assignments, where the i value is the index/number of the cluster to which the i-th sequence is assigned.

Clustering.nclusters — Method

Get the number of clusters in a Clusters object.

MIToS.MSA._add_gaps_in_b — Function

_add_gaps_in_b(msa_a, msa_b, positions_a, positions_b, axis::Int=1)

This is an helper function for the :left and :right joins. It aligns msa_b with msa_a by adding gaps to msa_b. Only the sequences or columns in msa_b that match the positions in msa_a are kept. Gaps are inserted in msa_b at positions where msa_a has sequences or columns that are not matched in msa_b. Therefore, we can think of msa_a as the reference MSA and msa_b as the MSA to be aligned with the reference.

The positions_a and positions_b arguments define corresponding positions in msa_a and msa_b, respectively, for accurate gap insertion. The function returns a modified msa_b with gaps inserted to align with msa_a.

The axis argument (1 for sequences, 2 for columns) determines whether gaps are inserted as sequences or columns.

Example

aligned_msa_b = _add_gaps_in_b(msa_a, msa_b, [1, 2, 3], [2, 3, 4], 2) # axis = 2

MIToS.MSA._clean_sequence_mapping! — Method

At this moment, the concatenateannotsequence function does no check for SeqMap annotations. Therefore, if an MSA does not have a SeqMap annotation, the concatenated SeqMap annotation is corrupted. To avoid this, we will delete the SeqMap annotation whose length is not equal to the length of the concatenated MSA.

MIToS.MSA._fill_hobohmI! — Method

Fill cluster and clustersize matrices. These matrices are assumed to be empty (only zeroes) and their length is assumed to be equal to the number of sequences in the alignment (aln). threshold is the minimum identity value between two sequences to be in the same cluster.

MIToS.MSA._filter_mapping — Method

For filter column and sequence mapping of the format: ",,,,10,11,,12"

MIToS.MSA._find_gaps — Method

_find_gaps(positions, n)

Calculate gaps in a sorted sequence of positions given also a maximum value n that would be added at the end as n+1 if n it is not already present.

This function returns a list of tuples, where each tuple represents a gap in the sequence. The first element of the tuple indicates the position after which the gap will be inserted, and the second element is the position that comes just after the gap. The function accounts for gaps at both the beginning and end of the sequence. A gap is identified as a difference greater than 1 between consecutive positions. To account for gaps at the end, the end position of a gap matching the last position of the sequence is set to n+1.

MIToS.MSA._get_selected_sequences — Method

Helper function to create a boolean mask to select sequences.

MIToS.MSA._get_seqid_index — Method

This function takes a vector of sequence names and a sequence id. It returns the position of that id in the vector. If the id isn't in the vector, It throws an error.

MIToS.MSA._get_seqname_mapping_hcat — Method

It returns a Dict mapping the MSA number and sequence name to the horizontally concatenated sequence name.

MIToS.MSA._get_seqname_mapping_vcat — Method

It returns a Dict mapping the MSA number and sequence name to the vertically concatenated sequence name.

MIToS.MSA._get_sequence_weight — Method

Calculates the weight of each sequence in a cluster. The weight is equal to one divided by the number of sequences in the cluster.

MIToS.MSA._insert_sorted_gaps — Method

_insert_sorted_gaps(msa_target, msa_reference, positions_target, positions_reference, 
	block_position::Symbol=:before, axis::Int=1)

Inserts gap blocks into msa_target to match the alignment pattern of msa_reference. This function is utilized for aligning two MSAs based on specific alignment positions. The positions_target and positions_reference parameters dictate the corresponding positions in both MSAs for accurate gap insertion.

The function returns the modified msa_target with inserted gaps, aligned according to msa_reference.

The block_position keyword argument determines whether the gap blocks are inserted :before (the default) or :after the unmatched sequences or columns. The axis keyword argument determines whether the gap blocks are inserted as sequences (1, the default) or columns (2).

Example

gapped_msa_a = _insert_sorted_gaps(msa_a, msa_b, positions_a, positions_b)

MIToS.MSA._keepinserts! — Method

Function to keep insert columns in parse. It uses the first sequence to generate the "Aligned" annotation, and after that, convert all the characters to uppercase.

MIToS.MSA._percentidentity — Method

seq1 and seq2 should have the same len

MIToS.MSA._rename_sequences — Method

_rename_sequences(annotations::Annotations, old2new::Dict{String,String})

Renames sequences in a given Annotations object based on a mapping provided in old2new.

This function iterates through residue-level and sequence-level annotations in the provided Annotations object. For each annotation, it checks if the sequence name exists in the old2new dictionary. If so, the sequence name is updated to the new name from old2new; otherwise, the original sequence name is retained.

The function then returns a new Annotations object with updated sequence names.

MIToS.MSA._renumber_sequence_gaps — Method

_renumber_sequence_gaps(msa::AnnotatedMultipleSequenceAlignment)

Renumbers the gap sequences in a given multiple sequence alignment (MSA) object and returns a new MSA object with updated sequence names.

This function specifically targets sequences within the MSA that are named with the prefix "gap:". It assigns a new sequential number to each gap sequence, ensuring a unique and ordered naming convention (e.g., "gap:1", "gap:2", etc.).

Therefore, this function is useful for renumbering gap sequences in an MSA where gap blocks have been inserted from the end to the beginning. This insertion order can lead to non-sequential numbering of the gap sequences. For example, in the case of two gap blocks, where one block contains a single sequence and the other contains two, the sequences might originally be numbered as "gap:3", "gap:1", "gap:2". This function renumbers them sequentially so that they are numbered as "gap:1", "gap:2", "gap:3".

MIToS.MSA._str2int_mapping — Method

Converts a string of mappings into a vector of Ints

julia> using MIToS.MSA

julia> MSA._str2int_mapping(",,2,,4,5")
6-element Vector{Int64}:
 0
 0
 2
 0
 4
 5

MIToS.MSA._swap! — Method

It swaps the names on the positions i and j of a Vector{String}

MIToS.MSA._v_concatenate_annotcolumn — Method

Column annotations are disambiguated by adding a prefix to the annotation name as we do for the sequence names.

MIToS.MSA._v_concatenate_annotresidue — Method

Residue annotations are disambiguated by adding a prefix to the sequence name holding the annotation as we do for the sequence names.

MIToS.MSA._v_concatenated_seq_names — Method

If returns a vector of sequence names for the vertically concatenated MSA. The prefix is the number associated to the source MSA. If the sequence name has already a number as prefix, the MSA number is increased accordingly.

MIToS.MSA._valid_residue_integer — Method

It takes an Int and returns the Int value used to represent a valid Residue. Invalid residues are encoded as the integer 23.

MIToS.MSA.adjustreference — Function

Creates a new matrix of residues. This function deletes positions/columns of the MSA with gaps in the reference (first) sequence.

MIToS.MSA.adjustreference! — Function

It removes positions/columns of the MSA with gaps in the reference (first) sequence.

MIToS.MSA.annotate_modification! — Method

Annotates on file annotations the modifications realized by MIToS on the MSA. It always returns true, so It can be used in a boolean context.

MIToS.MSA.annotations — Method

The annotations function returns the Annotations of an annotated MSA or aligned sequence. If the object is not annotated, it returns an empty Annotations object.

MIToS.MSA.column_index — Method

column_index(msa, col_name)

Return the index (integer position) of the column with name col_name in the MSA msa. A KeyError is thrown if the column name does not exist. If col_name is an integer, the same integer is returned without checking if it is a valid index.

MIToS.MSA.columngapfraction — Method

Fraction of gaps per column/position on the MSA

MIToS.MSA.columnname_iterator — Method

columnname_iterator(msa)

It returns an iterator that returns the column names of the msa. If the msa is a Matrix{Residue} this function returns the actual column numbers as strings. Otherwise it returns the column number of the original MSA through the wrapped NamedArray column names.

MIToS.MSA.columnnames — Method

columnnames(msa)

It returns a Vector{String} with the sequence names/identifiers. If the msa is a Matrix{Residue} this function returns the actual column numbers as strings. Otherwise it returns the column number of the original MSA through the wrapped NamedArray column names.

MIToS.MSA.columnpairsmatrix — Method

Initialize an empty PairwiseListMatrix for a pairwise measure in sequence pairs. It uses the sequence names if they are available, otherwise it uses the actual sequence numbers. You can use the positional argument to indicate the number Type (default: Float64), if the PairwiseListMatrix should store the diagonal values on the list (default: false) and a default value for the diagonal (default: NaN).

MIToS.MSA.coverage — Method

Coverage of the sequences with respect of the number of positions on the MSA

MIToS.MSA.delete_annotated_modifications! — Method

Deletes all the MIToS annotated modifications

MIToS.MSA.deletefullgapcolumns! — Function

Deletes columns with 100% gaps, this columns are generated by inserts.

MIToS.MSA.filtercolumns! — Function

filtercolumns!(msa, mask[, annotate::Bool=true])

It allows to filter MSA or aligned sequence columns/positions using a AbstractVector{Bool} mask. Annotations are updated if annotate is true (default).

MIToS.MSA.filtercolumns! — Method

filtercolumns!(data::Annotations, mask)

It is useful for deleting column annotations (creating a subset in place).

MIToS.MSA.filtercolumns — Method

It's similar to filtercolumns! but for an AbstractMatrix{Residue}

MIToS.MSA.filtersequences! — Function

filtersequences!(msa, mask[, annotate::Bool=true])

It allows to filter msa sequences using a AbstractVector{Bool} mask (It removes sequences with false values). AnnotatedMultipleSequenceAlignment annotations are updated if annotate is true (default).

MIToS.MSA.filtersequences! — Method

filtersequences!(data::Annotations, ids::Vector{String}, mask::AbstractArray{Bool,1})

It is useful for deleting sequence annotations. ids should be a list of the sequence names and mask should be a logical vector.

MIToS.MSA.filtersequences — Method

It's similar to filtersequences! but for an AbstractMatrix{Residue}

MIToS.MSA.gapfraction — Method

It calculates the fraction of gaps on the Array (alignment, sequence, column, etc.). This function can take an extra dimension argument for calculation of the gap fraction over the given dimension.

MIToS.MSA.gapstrip! — Function

This functions deletes/filters sequences and columns/positions on the MSA on the following order:

Removes all the columns/position on the MSA with gaps on the reference (first) sequence.
Removes all the sequences with a coverage with respect to the number of

columns/positions on the MSA less than a coveragelimit (default to 0.75: sequences with 25% of gaps).

Removes all the columns/position on the MSA with more than a gaplimit

(default to 0.5: 50% of gaps).

MIToS.MSA.gapstrip — Method

Creates a new matrix of Residues (MSA) with deleted sequences and columns/positions. The MSA is edited in the following way:

Removes all the columns/position on the MSA with gaps on the reference (first) sequence
Removes all the sequences with a coverage with respect to the number of

columns/positions on the MSA less than a coveragelimit (default to 0.75: sequences with 25% of gaps)

Removes all the columns/position on the MSA with more than a gaplimit

(default to 0.5: 50% of gaps)

MIToS.MSA.getannotcolumn — Function

getannotcolumn(ann[, feature[,default]])

It returns per column annotation for feature

MIToS.MSA.getannotfile — Function

getannotfile(ann[, feature[,default]])

It returns per file annotation for feature

MIToS.MSA.getannotresidue — Function

getannotresidue(ann[, seqname, feature[,default]])

It returns per residue annotation for (seqname, feature)

MIToS.MSA.getannotsequence — Function

getannotsequence(ann[, seqname, feature[,default]])

It returns per sequence annotation for (seqname, feature)

MIToS.MSA.getcolumnmapping — Method

It returns a Vector{Int} with the original column number of each column on the actual MSA. The mapping is annotated in the ColMap file annotation of an AnnotatedMultipleSequenceAlignment or in the column names of an NamedArray or MultipleSequenceAlignment.

NOTE: When the MSA results from vertically concatenating MSAs using vcat, the column map annotations from the constituent MSAs (such as 1_ColMap, 2_ColMap, etc.) are not returned. Instead, the column numbers referenced in the column names are provided. To access the original annotations, utilize the getannotfile function.

MIToS.MSA.gethcatmapping — Method

It returns a vector of numbers from 1 to N for each column that indicates the source MSA. The mapping is annotated in the "HCat" file annotation of an AnnotatedMultipleSequenceAlignment or in the column names of an NamedArray or MultipleSequenceAlignment.

NOTE: When the MSA results from vertically concatenating MSAs using vcat, the "HCat" annotations from the constituent MSAs are renamed as "1_HCat", "2_HCat", etc. In that case, the MSA numbers referenced in the column names are provided. To access the original annotations, utilize the getannotfile function.

MIToS.MSA.getnamedict — Method

It takes a ResidueAlphabet and returns a dictionary from group name to group position.

julia> using MIToS.MSA

julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"

julia> getnamedict(ab)
OrderedCollections.OrderedDict{String, Int64} with 8 entries:
  "AILMV" => 1
  "RHK"   => 2
  "NQST"  => 3
  "DE"    => 4
  "FWY"   => 5
  "C"     => 6
  "G"     => 7
  "P"     => 8

MIToS.MSA.getresidues — Method

getresidues allows you to access the residues stored inside an MSA or aligned sequence as a Matrix{Residue} without annotations nor column/row names.

MIToS.MSA.getresiduesequences — Method

getresiduesequences returns a Vector{Vector{Residue}} with all the MSA sequences without annotations nor column/sequence names.

MIToS.MSA.getsequence — Function

getsequence takes an MSA and a sequence number or identifier and returns an aligned sequence object. If the MSA is an AnnotatedMultipleSequenceAlignment, it returns an AnnotatedAlignedSequence with the sequence annotations. From a MultipleSequenceAlignment, It returns an AlignedSequence object. If an Annotations object and a sequence identifier are used, this function returns the annotations related to the sequence.

MIToS.MSA.getsequencemapping — Method

It returns the sequence coordinates as a Vector{Int} for an MSA sequence. That vector has one element for each MSA column. If the number if 0 in the mapping, there is a gap in that column for that sequence.

MIToS.MSA.getweight — Method

getweight(c[, i::Int])

This function returns the weight of the sequence number i. getweight should be defined for any type used for count!/count in order to use his weigths. If i isn't used, this function returns a vector with the weight of each sequence.

MIToS.MSA.hobohmI — Method

Sequence clustering using the Hobohm I method from Hobohm et. al. 1992.

Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.

MIToS.MSA.meanpercentidentity — Function

Returns the mean of the percent identity between the sequences of a MSA. If the MSA has 300 sequences or less, the mean is exact. If the MSA has more sequences and the exact keyword is false (defualt), 44850 random pairs of sequences are used for the estimation. The number of samples can be changed using the second argument. Use exact=true to perform all the pairwise comparison (the calculation could be slow).

MIToS.MSA.namedmatrix — Method

The namedmatrix function returns the NamedResidueMatrix{Array{Residue,2}} stored in an MSA or aligned sequence.

MIToS.MSA.ncolumns — Method

ncolumns returns the number of MSA columns or positions.

MIToS.MSA.ncolumns — Method

ncolumns(ann::Annotations) returns the number of columns/residues with annotations. This function returns -1 if there is not annotations per column/residue.

MIToS.MSA.nsequences — Method

nsequences returns the number of sequences on the MSA.

MIToS.MSA.percentidentity — Method

percentidentity(seq1, seq2, threshold)

Computes quickly if two aligned sequences have a identity value greater than a given threshold value. Returns a boolean value. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA in at least one sequence aren't counted.

MIToS.MSA.percentidentity — Method

percentidentity(seq1, seq2)

Calculates the fraction of identities between two aligned sequences. The identity value is calculated as the number of identical characters in the i-th position of both sequences divided by the length of both sequences. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA in at least one sequence aren't counted.

MIToS.MSA.percentidentity — Method

percentidentity(msa[, out::Type=Float64])

Calculates the identity between all the sequences on a MSA. You can indicate the output element type with the last optional parameter (Float64 by default). For a MSA with a lot of sequences, you can use Float32 or Flot16 in order to avoid the OutOfMemoryError().

MIToS.MSA.percentsimilarity — Function

Calculates the similarity percent between two aligned sequences. The 100% is the length of the aligned sequences minus the number of columns with gaps in both sequences and the number of columns with at least one residue outside the alphabet. So, columns with residues outside the alphabet (other than the specially treated GAP) aren't counted to the protein length. Two residues are considered similar if they below to the same group in a ReducedAlphabet. The alphabet (third positional argument) by default is:

ReducedAlphabet("(AILMV)(NQST)(RHK)(DE)(FWY)CGP")

The first group is composed of the non polar residues (AILMV), the second group is composed of polar residues, the third group are positive residues, the fourth group are negative residues, the fifth group is composed by the aromatic residues (FWY). C, G and P are considered unique residues.

Other residue groups/alphabets:

SMS (Sequence Manipulation Suite) Ident and Sim:

ReducedAlphabet("(GAVLI)(FYW)(ST)(KRH)(DENQ)P(CM)")

Stothard P (2000) The Sequence Manipulation Suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques 28:1102-1104.

Bio3D 2.2 seqidentity:

ReducedAlphabet("(GA)(MVLI)(FYW)(ST)(KRH)(DE)(NQ)PC")

Grant, B.J. et al. (2006) Bioinformatics 22, 2695–2696.

MIToS.MSA.percentsimilarity — Method

Calculates the similarity percent between all the sequences on a MSA. You can indicate the output element type with the out keyword argument (Float64 by default). For an MSA with a lot of sequences, you can use out=Float32 or out=Flot16 in order to avoid the OutOfMemoryError().

MIToS.MSA.printmodifications — Method

Prints MIToS annotated modifications

MIToS.MSA.rename_sequences! — Method

rename_sequences!(msa, newnames)
rename_sequences!(msa, old2new::AbstractDict)
rename_sequences!(msa, old2new::Pair...)

Rename the sequences of an MSA given a vector of new names, a dictionary mapping old names to new names, or one or more pairs going from old to new names. If the msa is an AnnotatedMultipleSequenceAlignment, the annotations are also updated. The function modifies the msa in place and returns it.

MIToS.MSA.rename_sequences — Method

rename_sequences(msa, newnames)
rename_sequences(msa, old2new::AbstractDict)
rename_sequences(msa, old2new::Pair...)

MIToS.MSA.residue2three — Method

This function returns the three letter name of the Residue.

julia> using MIToS.MSA

julia> residue2three(Residue('G'))
"GLY"

MIToS.MSA.residuefraction — Method

It calculates the fraction of residues (no gaps) on the Array (alignment, sequence, column, etc.). This function can take an extra dimension argument for calculation of the residue fraction over the given dimension

MIToS.MSA.sequence_index — Method

sequence_index(msa, seq_name)

Return the index (integer position) of the sequence with name seq_name in the MSA msa. A KeyError is thrown if the sequence name does not exist. If seq_name is an integer, the same integer is returned without checking if it is a valid index.

MIToS.MSA.sequencename_iterator — Method

sequencename_iterator(msa)

It returns an iterator that returns the sequence names/identifiers of the msa.

MIToS.MSA.sequencenames — Method

sequencenames(msa)

It returns a Vector{String} with the sequence names/identifiers.

MIToS.MSA.sequencepairsmatrix — Method

Initialize an empty PairwiseListMatrix for a pairwise measure in column pairs. It uses the column mapping (column number in the input MSA file) if it’s available, otherwise it uses the actual column numbers. You can use the positional argument to indicate the number Type (default: Float64), if the PairwiseListMatrix should store the diagonal values on the list (default: false) and a default value for the diagonal (default: NaN).

MIToS.MSA.setannotcolumn! — Function

setannotcolumn!(ann, feature, annotation)

It stores per column annotation (1 char per column) for feature

MIToS.MSA.setannotfile! — Function

setannotfile!(ann, feature, annotation)

It stores per file annotation for feature

MIToS.MSA.setannotresidue! — Function

setannotresidue!(ann, seqname, feature, annotation)

It stores per residue annotation (1 char per residue) for (seqname, feature)

MIToS.MSA.setannotsequence! — Function

setannotsequence!(ann, seqname, feature, annotation)

It stores per sequence annotation for (seqname, feature)

MIToS.MSA.setreference! — Function

It puts the sequence i (name or position) as reference (first sequence) of the MSA. This function swaps the sequences 1 and i.

MIToS.MSA.stringsequence — Method

stringsequence(seq)
stringsequence(msa, i::Int)
stringsequence(msa, id::String)

It returns the selected sequence as a String.

MIToS.MSA.swapsequences! — Method

It swaps the sequences on the positions i and j of an MSA. Also it's possible to swap sequences using their sequence names/identifiers when the MSA object as names.

MIToS.MSA.three2residue — Method

It takes a three letter residue name and returns the corresponding Residue. If the name isn't in the MIToS dictionary, a XAA is returned.

julia> using MIToS.MSA

julia> three2residue("ALA")
A

Random.shuffle! — Function

It's like Random.shuffle. When a Matrix{Residue} is used, you can indicate if the gaps should remain their positions using the last boolean argument. The previous argument should be the dimension to shuffle, 1 for shuffling residues in a sequence (row) or 2 for shuffling residues in a column.

julia> using MIToS.MSA

julia> using Random

julia> msa = hcat(res"RRE",res"DDK", res"G--")
3×3 Matrix{Residue}:
 R  D  G
 R  D  -
 E  K  -

julia> Random.seed!(42);

julia> shuffle(msa, 1, true)
3×3 Matrix{Residue}:
 G  D  R
 R  D  -
 E  K  -

julia> Random.seed!(42);

julia> shuffle(msa, 1, false)
3×3 Matrix{Residue}:
 G  D  R
 R  -  D
 E  K  -

Random.shuffle — Method

It's like shuffle but in-place. When a Matrix{Residue} or a AbstractAlignedObject (sequence or MSA) is used, you can indicate if the gaps should remain their positions using the last boolean argument.

StatsBase.counts — Method

Get sample counts of clusters as a Vector. Each k value is the number of samples assigned to the k-th cluster.

MIToS.MSA.@res_str — Macro

The MIToS macro @res_str takes a string and returns a Vector of Residues (sequence).

julia> using MIToS.MSA

julia> res"MIToS"
5-element Vector{Residue}:
 M
 I
 T
 -
 S

MIToS.Pfam — Module

The Pfam module, defines functions to measure the protein contact prediction performance of information measure between column pairs from a Pfam MSA.

Features

Read and download Pfam MSAs
Obtain PDB information from alignment annotations
Map between sequence/alignment residues/columns and PDB structures
Measure of AUC (ROC curve) for contact prediction of MI scores

using MIToS.Pfam

MIToS.Pfam.downloadpfam — Method

It downloads a gzipped Stockholm alignment from InterPro for the Pfam family with the given pfamcode.

By default, it downloads the full Pfam alignment. You can use the alignment keyword argument to download the seed or the uniprot alignment instead. For example, downloadpfam("PF00069") will download the full alignment for the PF00069 Pfam family, while downloadpfam("PF00069", alignment="seed") will download the seed alignment of the family.

The extension of the downloaded file is .stockholm.gz by default; you can change it using the filename keyword argument, but the .gz at the end is mandatory.

MIToS.Pfam.getcontactmasks — Method

This function takes a msacontacts or its list of contacts contact_list with 1.0 for true contacts and 0.0 for not contacts (NaN or other numbers for missing values). Returns two BitVectors, the first with trues where contact_list is 1.0 and the second with trues where contact_list is 0.0. There are useful for AUC calculations.

MIToS.Pfam.getseq2pdb — Method

Generates from a Pfam msa a Dict{String, Vector{Tuple{String,String}}}. Keys are sequence IDs and each value is a list of tuples containing PDB code and chain.

julia> getseq2pdb(msa)
Dict{String,Array{Tuple{String,String},1}} with 1 entry:
  "F112_SSV1/3-112" => [("2VQC","A")]

MIToS.Pfam.hasresidues — Method

Returns a BitVector where there is a true for each column with PDB residue.

MIToS.Pfam.msacolumn2pdbresidue — Method

msacolumn2pdbresidue(msa, seqid, pdbid, chain, pfamid, siftsfile; strict=false, checkpdbname=false, missings=true)

This function returns a OrderedDict{Int,String} with MSA column numbers on the input file as keys and PDB residue numbers ("" for missings) as values. The mapping is performed using SIFTS. This function needs correct ColMap and SeqMap annotations. This checks correspondence of the residues between the MSA sequence and SIFTS (It throws a warning if there are differences). Missing residues are included if the keyword argument missings is true (default: true). If the keyword argument strict is true (default: false), throws an Error, instead of a Warning, when residues don't match. If the keyword argument checkpdbname is true (default: false), throws an Error if the three letter name of the PDB residue isn't the MSA residue. If you are working with a downloaded Pfam MSA without modifications, you should read it using generatemapping=true and useidcoordinates=true. If you don't indicate the path to the siftsfile used in the mapping, this function downloads the SIFTS file in the current folder. If you don't indicate the Pfam accession number (pfamid), this function tries to read the AC file annotation.

MIToS.Pfam.msacontacts — Function

This function takes an AnnotatedMultipleSequenceAlignment with correct ColMap annotations and two dicts:

The first is an OrderedDict{String,PDBResidue} from PDB residue number to PDBResidue.
The second is a Dict{Int,String} from MSA column number on the input file to PDB residue number.

msacontacts returns a PairwiseListMatrix{Float64,false} of 0.0 and 1.0 where 1.0 indicates a residue contact. Contacts are defined with an inter residue distance less or equal to distance_limit (default to 6.05) angstroms between any heavy atom. NaN indicates a missing value.

MIToS.Pfam.msaresidues — Method

This function takes an AnnotatedMultipleSequenceAlignment with correct ColMap annotations and two dicts:

The first is an OrderedDict{String,PDBResidue} from PDB residue number to PDBResidue.
The second is a Dict{Int,String} from MSA column number on the input file to PDB residue number.

msaresidues returns an OrderedDict{Int,PDBResidue} from input column number (ColMap) to PDBResidue. Residues on inserts are not included.