MIToS.InformationModule

The Information module of MIToS defines types and functions useful to calculate information measures (e.g. Mutual Information (MI) and Entropy) over a Multiple Sequence Alignment (MSA). This module was designed to count Residues (defined in the MSA module) in special contingency tables (as fast as possible) and to derive probabilities from this counts. Also, includes methods for applying corrections to that tables, e.g. pseudocounts and pseudo frequencies. Finally, Information allows to use this probabilities and counts to estimate information measures and other frequency based values.

Features

  • Estimate multi dimensional frequencies and probabilities tables from sequences, MSAs, etc...
  • Correction for small number of observations
  • Correction for data redundancy on a MSA
  • Estimate information measures
  • Calculate corrected mutual information between residues
using MIToS.Information
MIToS.Information.BLOSUM62_PijConstant

Table with conditional probabilities of residues based on BLOSUM62. The normalization is done row based. The firts row contains the P(aa|A) and so one.

MIToS.Information.AdditiveSmoothingType

Additive Smoothing or fixed pseudocount λ for ResidueCount (in order to estimate probabilities when the number of samples is low).

Common values of λ are:

  • 0 : No cell frequency prior, gives you the maximum likelihood estimator.
  • 0.05 is the optimum value for λ found in Buslje et. al. 2009, similar results was obtained for λ in the range [0.025, 0.075].
  • 1 / p : Perks prior (Perks, 1947) where p the number of parameters (i.e. residues, pairs of residues) to estimate. If p is the number of residues (20 without counting gaps), this gives you 0.05.
  • sqrt(n) / p : Minimax prior (Trybula, 1958) where n is the number of samples and p the number of parameters to estimate. If the number of samples n is 400 (minimum number of sequence clusters for achieve good performance in Buslje et. al. 2009) for estimating 400 parameters (pairs of residues without counting gaps) this gives you 0.05.
  • 0.5 : Jeffreys prior (Jeffreys, 1946).
  • 1 : Bayes-Laplace uniform prior, aka. Laplace smoothing.
MIToS.Information.BLOSUM_PseudofrequenciesType

BLOSUM_Pseudofrequencies type. It takes to arguments/fields:

  • α : Usually the number of sequences or sequence clusters in the MSA.
  • β : The weight of the pseudofrequencies, a value close to 8.512 when α is the number of sequence clusters.
MIToS.Information.ContingencyTableType

A ContingencyTable is a multidimensional array. It stores the contingency matrix, its marginal values and total. The type also has an internal and private temporal array and an alphabet object. It's a parametric type, taking three ordered parameters:

  • T : The element type of the multidimensional array.
  • N : It's the dimension of the array and should be an Int.
  • A : This should be a type, subtype of ResidueAlphabet, i.e.: UngappedAlphabet,

GappedAlphabet or ReducedAlphabet.

A ContingencyTable can be created from an alphabet if all the parameters are given. Otherwise, you need to give a type, a number (Val) and an alphabet. You can also create a ContingencyTable using a matrix and a alphabet. For example:

ContingencyTable{Float64, 2, UngappedAlphabet}(UngappedAlphabet())
ContingencyTable(Float64, Val{2}, UngappedAlphabet())
ContingencyTable(zeros(Float64,20,20), UngappedAlphabet())
MIToS.Information.ProbabilitiesType

A Probabilities object wraps a ContingencyTable storing probabilities. It doesn't perform any check. If the total isn't one, you must use normalize or normalize!on the ContingencyTable before wrapping it to make the sum of the probabilities equal to one.

Base.count!Method

It populates a ContingencyTable (first argument) using the frequencies in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights and pseudocounts as second and third positional arguments respectively. You can use NoPseudofrequencies() and NoClustering() to avoid the use of sequence weighting and pseudocounts, respectively.

Base.countMethod

It returns a ContingencyTable wrapped in a Counts type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet, weights and pseudocounts to indicate the alphabet of the table (default to UngappedAlphabet()), a clustering result (default to NoClustering()) and the pseudocounts (default to NoPseudocount()) to be used during the estimation of the frequencies.

MIToS.Information.BLMIMethod

BLMI takes a MSA or a file and a FileFormat as first arguments. It calculates a Z score (ZBLMI) and a corrected MI/MIp as described on Busjle et. al. 2009 but using using BLOSUM62 pseudo frequencies instead of a fixed pseudocount.

Keyword argument, type, default value and descriptions:

  - beta        Float64   8.512   β for BLOSUM62 pseudo frequencies
  - lambda      Float64   0.0     Low count value
  - threshold             62      Percent identity threshold for sequence clustering (Hobohm I)
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - samples     Int       50      Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples

This function returns:

  - Z score (ZBLMI)
  - MI or MIp using BLOSUM62 pseudo frequencies (BLMI/BLMIp)
MIToS.Information._calculate_blosum_pseudofrequencies!Method

_calculate_blosum_pseudofrequencies!{T}(Pab::ContingencyTable{T,2,UngappedAlphabet})

This function uses the conditional probability matrix BLOSUM62_Pij to fill the temporal array field of Pab with pseudo frequencies (Gab). This function needs the real frequencies/probabilities Pab because they are used to estimate the pseudofrequencies.

Gab = Σcd Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d )

MIToS.Information._mean_totalMethod

Mean mutual information of column a (Dunn et. al. 2008). Summation is over j=1 to N, j ≠ a. Total is N-1.

Overall mean mutual information (Dunn et. al. 2008). 2/(N*(N-1)) by the sum of MI where the indices run i=1 to N-1, j=i+1 to N (triu).

MIToS.Information.apply_pseudofrequencies!Method

apply_pseudofrequencies!{T}(Pab::ContingencyTable{T,2,UngappedAlphabet}, pseudofrequencies::BLOSUM_Pseudofrequencies)

When a BLOSUM_Pseudofrequencies(α,β) is used, this function applies pseudofrequencies Gab over Pab, as a weighted mean of both. It uses the conditional probability matrix BLOSUM62_Pij and the real frequencies/probabilities Pab to estimate the pseudofrequencies Gab. α is the weight of the real frequencies Pab and β the weight of the pseudofrequencies.

Gab = Σcd Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d ) Pab = (α ⋅ Pab + β ⋅ Gab )/(α + β)

MIToS.Information.buslje09Method

buslje09 takes a MSA or a file and a FileFormat as first arguments. It calculates a Z score and a corrected MI/MIp as described on Busjle et. al. 2009.

keyword argument, type, default value and descriptions:

  - lambda      Float64   0.05    Low count value
  - clustering  Bool      true    Sequence clustering (Hobohm I)
  - threshold             62      Percent identity threshold for clustering
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - samples     Int       100     Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples
  - alphabet    ResidueAlphabet UngappedAlphabet()  Residue alphabet to be used

This function returns:

  - Z score
  - MI or MIp
MIToS.Information.cleanup!Method

cleanup! fills the temporal, table and marginals arrays with zeros. It also sets total to zero.

MIToS.Information.cumulativeMethod

cumulative allows to calculate cumulative scores (i.e. cMI) as defined in Buslje et. al. 2010

"We calculated a cumulative mutual information score (cMI) for each residue as the sum of MI values above a certain threshold for every amino acid pair where the particular residue appears. This value defines to what degree a given amino acid takes part in a mutual information network." Buslje, Cristina Marino, Elin Teppa, Tomas Di Doménico, José María Delfino, and Morten Nielsen. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 6, no. 11 (2010): e1000978.

MIToS.Information.delete_dimensions!Method

delete_dimensions!(out::ContingencyTable, in::ContingencyTable, dimensions::Int...)

This function fills a ContingencyTable with the counts/probabilities on in after the deletion of dimensions. i.e. This is useful for getting Pxy from Pxyz.

MIToS.Information.delete_dimensionsMethod

delete_dimensions(in::ContingencyTable, dimensions::Int...)

This function creates a ContingencyTable with the counts/probabilities on in after the deletion of dimensions. i.e. This is useful for getting Pxy from Pxyz.

MIToS.Information.gaussdcaMethod

Wrapper function to GaussDCA.gDCA. You need to install GaussDCA:

using Pkg

Pkg.add(PackageSpec(url="https://github.com/carlobaldassi/GaussDCA.jl", rev="master"))

Look into GaussDCA.jl README for further information. If you use this wrapper, please cite the GaussDCA publication and the package's doi.

It's possible to indicate the path to the julia binary where GaussDCA is installed. However, it's recommended to use the same version where MIToS is installed. That is because this function use serialize/deserialize to transfer data between the processes.

GaussDCA Publication: Baldassi, Carlo, Marco Zamparo, Christoph Feinauer, Andrea Procaccini, Riccardo Zecchina, Martin Weigt, and Andrea Pagnani. "Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners." PloS one 9, no. 3 (2014): e92721.

MIToS.Information.kullback_leiblerMethod

It calculates the Kullback-Leibler (KL) divergence from a table of Probabilities. The second positional argument is a Probabilities or ContingencyTable with the background distribution. It's optional, the default is the BLOSUM62_Pi table. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.Information.mapcolfreq!Method

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each column from the msa (second argument).

  • weights (default: NoClustering()): Weights to be used for table counting.
  • pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
  • pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.
MIToS.Information.mapcolpairfreq!Method

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each pair of columns from the msa (second argument). The fourth positional argument usediagonal indicates if the function should be applied to identical element pairs (default to Val{true}).

  • weights (default: NoClustering()): Weights to be used for table counting.

  • pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.

  • pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.

  • diagonalvalue (default: 0): Value to fill diagonal elements if usediagonal is Val{false}.

MIToS.Information.mapseqfreq!Method

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each sequence from the msa (second argument).

  • weights (default: NoClustering()): Weights to be used for table counting.
  • pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
  • pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.
MIToS.Information.mapseqpairfreq!Method

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each pair of sequences from the msa (second argument). The fourth positional argument usediagonal indicates if the function should be applied to identical element pairs (default to Val{true}).

  • weights (default: NoClustering()): Weights to be used for table counting.

  • pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.

  • pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.

  • diagonalvalue (default: 0): Value to fill diagonal elements if usediagonal is Val{false}.

MIToS.Information.marginal_entropyMethod

It calculates marginal entropy (H) from a table of Counts or Probabilities. The second positional argument is used to indicate the magin used to calculate the entropy, e.g. it estimates the entropy H(X) if marginal is 1, H(Y) for 2, etc. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.Information.mutual_informationMethod

It calculates Mutual Information (MI) from a table of Counts or Probabilities. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits. Calculation of MI from Counts is faster than from Probabilities.

MIToS.Information.pairwisegapfractionMethod

It takes a MSA or a file and a FileFormat as first arguments. It calculates the percentage of gaps on columns pairs (union and intersection) using sequence clustering (Hobohm I).

Argument, type, default value and descriptions:

    - clustering  Bool      true    Sequence clustering (Hobohm I)
    - threshold             62      Percent identity threshold for sequence clustering (Hobohm I)

This function returns:

    - pairwise gap union as percentage
    - pairwise gap intersection as percentage
MIToS.Information.probabilities!Method

It populates a ContingencyTable (first argument) using the probabilities in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights, pseudocounts and pseudofrequencies as second, third and fourth positional arguments respectively. You can use NoClustering(), NoPseudocount() and NoPseudofrequencies() to avoid the use of sequence weighting, pseudocounts and pseudofrequencies, respectively.

MIToS.Information.probabilitiesMethod

It returns a ContingencyTable wrapped in a Probabilities type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet, weights, pseudocounts and pseudofrequencies to indicate the alphabet of the table (default to UngappedAlphabet()), a clustering result (default to NoClustering()), the pseudocounts (default to NoPseudocount()) and the pseudofrequencies (default to NoPseudofrequencies()) to be used during the estimation of the probabilities.

StatsBase.entropyMethod

It calculates the Shannon entropy (H) from a table of Counts or Probabilities. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.PDBModule

The module PDB defines types and methods to work with protein structures inside Julia. It is useful to link structural and sequential information, and needed for measure the predictive performance at protein contact prediction of mutual information scores.

Features

  • Read and parse PDF and PDBML files
  • Calculate distance and contacts between atoms or residues
  • Determine interaction between residues
using MIToS.PDB
MIToS.PDB._hbond_acceptorConstant

Keys come from Table 1 of Bickerton et. al. 2011, Antecedents come from come from: http://biomachina.org/courses/modeling/download/topallh22x.pro Synonyms come from: http://www.bmrb.wisc.edu/refinfo/atomnom.tbl

MIToS.PDB._hbond_donorConstant

Keys come from Table 1 of Bickerton et. al. 2011, The hydrogen names of the donor comes from: http://biomachina.org/courses/modeling/download/topallh22x.pro Synonyms come from: http://www.bmrb.wisc.edu/refinfo/atomnom.tbl

MIToS.PDB.covalentradiusConstant

Covalent radius in Å of each element from the Additional file 1 of PICCOLO [1]. Hydrogen was updated using the value on Table 2 from Cordero et. al. [2].

  1. Bickerton, G. R., Higueruelo, A. P., & Blundell, T. L. (2011).

Comprehensive, atomic-level characterization of structurally characterized protein-protein interactions: the PICCOLO database. BMC bioinformatics, 12(1), 313.

  1. Cordero, B., Gómez, V., Platero-Prats, A. E., Revés, M.,

Echeverría, J., Cremades, E., ... & Alvarez, S. (2008). Covalent radii revisited. Dalton Transactions, (21), 2832-2838.

MIToS.PDB.vanderwaalsradiusConstant

van der Waals radius in Å from the Additional file 1 of Bickerton et. al. 2011

  • Bickerton, G. R., Higueruelo, A. P., & Blundell, T. L. (2011).

Comprehensive, atomic-level characterization of structurally characterized protein-protein interactions: the PICCOLO database. BMC bioinformatics, 12(1), 313.

MIToS.PDB.PDBAtomType

A PDBAtom object contains the information from a PDB atom, without information of the residue. It has the following fields that you can access at any moment for query purposes:

- `coordinates` : x,y,z coordinates, e.g. `Coordinates(109.641,73.162,42.7)`.
- `atom` : Atom name, e.g. `"CA"`.
- `element` : Element type of the atom, e.g. `"C"`.
- `occupancy` : A float number with the occupancy, e.g. `1.0`.
- `B` : B factor as a string, e.g. `"23.60"`.
MIToS.PDB.PDBFileType

PDBFile <: FileFormat

Protein Data Bank (PDB) format. It provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies.

MIToS.PDB.PDBMLType

PDBML <: FileFormat

Protein Data Bank Markup Language (PDBML), a representation of PDB data in XML format.

MIToS.PDB.PDBResidueType

A PDBResidue object contains all the information about a PDB residue. It has the following fields that you can access at any moment for query purposes:

- `id` : A `PDBResidueIdentifier` object.
- `atoms` : A vector of `PDBAtom`s.
MIToS.PDB.PDBResidueIdentifierType

A PDBResidueIdentifier object contains the information needed to identity PDB residues. It has the following fields that you can access at any moment for query purposes:

- `PDBe_number` : It's only used when a PDBML is readed (PDBe number as a string).
- `number` : PDB residue number, it includes insertion codes, e.g. `"34A"`.
- `name` : Three letter residue name in PDB, e.g. `"LYS"`.
- `group` : It can be `"ATOM"` or `"HETATM"`.
- `model` : The model number as a string, e.g. `"1"`.
- `chain` : The chain as a string, e.g. `"A"`.
Base.angleMethod

angle(a::Coordinates, b::Coordinates, c::Coordinates)

Angle (in degrees) at b between a-b and b-c

Base.anyMethod

any(f::Function, a::PDBResidue, b::PDBResidue, criteria::Function)

Test if the function f is true for any pair of atoms between the residues a and b. This function only test atoms that returns true for the fuction criteria.

Base.anyMethod

any(f::Function, a::PDBResidue, b::PDBResidue)

Test if the function f is true for any pair of atoms between the residues a and b

Base.parseMethod

parse(pdbml, ::Type{PDBML}; chain=All, model=All, group=All, atomname=All, onlyheavy=false, label=true, occupancyfilter=false)

Reads a LightXML.XMLDocument representing a pdb file. Returns a list of PDBResidues (view MIToS.PDB.PDBResidues). Setting chain, model, group, atomname and onlyheavy values can be used to select of a subset of all residues. If not set, all residues are returned. If the keyword argument label (default: true) is false,the auth_ attributes will be use instead of the label_ attributes for chain, atom and residue name fields. The auth_ attributes are alternatives provided by an author in order to match the identification/values used in the publication that describes the structure. If the keyword argument occupancyfilter (default: false) is true, only the atoms with the best occupancy are returned.

Base.parseMethod

parse(io, ::Type{PDBFile}; chain=All, model=All, group=All, atomname=All, onlyheavy=false, occupancyfilter=false)

Reads a text file of a PDB entry. Returns a list of PDBResidue (view MIToS.PDB.PDBResidues). Setting chain, model, group, atomname and onlyheavy values can be used to select of a subset of all residues. Group can be "ATOM" or "HETATM". If not set, all residues are returned. If the keyword argument occupancyfilter (default: false) is true, only the atoms with the best occupancy are returned.

Base.printFunction

print(io, res, format::Type{PDBFile}) print(res, format::Type{PDBFile})

Print a PDBResidue or a vector of PDBResidues in PDB format.

MIToS.PDB.CAmatrixMethod

Returns a matrix with the x, y and z coordinates of the Cα with best occupancy for each PDBResidue of the ATOM group. If a residue doesn't have a Cα, its Cα coordinates are NaNs.

MIToS.PDB._get_matched_CαsMethod

Return the matching CA matrices after deleting the rows/residues where the CA is missing in at least one structure.

MIToS.PDB.aromaticMethod

There's an aromatic interaction if centriods are at 6.0 Å or less.

MIToS.PDB.atomsFunction

atoms(residue_list, model, chain, group, residue, atom)

These return a vector of PDBAtoms with the selected subset of atoms. You can use the type All (default value of the positional arguments) to avoid filtering a that level.

MIToS.PDB.bestoccupancyMethod

Takes a Vector of PDBAtoms and returns a Vector of the PDBAtoms with best occupancy.

MIToS.PDB.center!Method

center!(A::AbstractMatrix{Float64})

Takes a set of points A as an NxD matrix (N: number of points, D: dimension). Translates A in place so that its centroid is at the origin of coordinates

MIToS.PDB.centeredcoordinatesFunction

Returns a Matrix{Float64} with the centered coordinates of all the atoms in residues. An optional positional argument CA (default: true) defines if only Cα carbons should be used to center the matrix.

MIToS.PDB.centeredresiduesFunction

Returns a new Vector{PDBResidue} with the PDBResidues having centered coordinates. An optional positional argument CA (default: true) defines if only Cα carbons should be used to center the matrix.

MIToS.PDB.change_coordinatesFunction

change_coordinates(residue::PDBResidue, coordinates::AbstractMatrix{Float64}, offset::Int=1)

Returns a new PDBResidues with (x,y,z) from a coordinates AbstractMatrix{Float64} You can give an offset indicating in wich matrix row starts the (x,y,z) coordinates of the residue.

MIToS.PDB.change_coordinatesMethod

change_coordinates(residues::AbstractVector{PDBResidue}, coordinates::AbstractMatrix{Float64})

Returns a new Vector{PDBResidues} with (x,y,z) from a coordinates Matrix{Float64}

MIToS.PDB.change_coordinatesMethod

change_coordinates(atom::PDBAtom, coordinates::Coordinates)

Returns a new PDBAtom but with a new coordinates

MIToS.PDB.contactMethod

contact(a::Coordinates, b::Coordinates, limit::AbstractFloat)

It returns true if the distance is less or equal to the limit. It doesn't call sqrt because it does squared_distance(a,b) <= limit^2.

MIToS.PDB.contactMethod

contact(A::PDBResidue, B::PDBResidue, limit::AbstractFloat; criteria::String="All")

Returns true if the residues A and B are at contact distance (limit). The available distance criteria are: Heavy, All, CA, CB (CA for GLY)

MIToS.PDB.contactMethod

contact(residues::Vector{PDBResidue}, limit::AbstractFloat; criteria::String="All")

If contact takes a Vector{PDBResidue}, It returns a matrix with all the pairwise comparisons (contact map).

MIToS.PDB.covalentMethod

Returns true if the distance between atoms is less than the sum of the covalentradius of each atom.

MIToS.PDB.distanceMethod

distance(residues::Vector{PDBResidue}; criteria::String="All")

If distance takes a Vector{PDBResidue} returns a PairwiseListMatrix{Float64, false} with all the pairwise comparisons (distance matrix).

MIToS.PDB.downloadpdbMethod

It downloads a gzipped PDB file from PDB database. It requires a four character pdbcode. Its default format is PDBML (PDB XML) and It uses the baseurl "http://www.rcsb.org/pdb/files/". filename is the path/name of the output file. This function calls MIToS.Utils.download_file that calls download from the HTTP.jl package. You can use keyword arguments from HTTP.request. Use the headers keyword argument to pass a Dict{String, String} with the header information.

MIToS.PDB.findatomsMethod

findatoms(res::PDBResidue, atom::String)

Returns a index vector of the atoms with the given atom name.

MIToS.PDB.findheavyMethod

Returns a list with the index of the heavy atoms (all atoms except hydrogen) in the PDBResidue

MIToS.PDB.getCAMethod

Returns the Cα with best occupancy in the PDBResidue. If the PDBResidue has no Cα, missing is returned.

MIToS.PDB.getpdbdescriptionMethod

Access general information about a PDB entry (e.g., Header information) using the GraphQL interface of the PDB database. It parses the JSON answer into a Dict.

MIToS.PDB.hydrogenbondMethod

This function only works if there are hydrogens in the structure. The criteria for a hydrogen bond are:

  • d(Ai, Aj) < 3.9Å
  • d(Ah, Aacc) < 2.5Å
  • θ(Adon, Ah, Aacc) > 90°
  • θ(Adon, Aacc, Aacc-antecedent) > 90°
  • θ(Ah, Aacc, Aacc-antecedent) > 90°

Where Ah is the donated hydrogen atom, Adon is the hydrogen bond donor atom, Aacc is the hydrogen bond acceptor atom and Aacc-antecednt is the atom antecedent to the hydrogen bond acceptor atom.

MIToS.PDB.hydrophobicMethod

There's an hydrophobic interaction if two hydrophobic atoms are at 5.0 Å or less.

MIToS.PDB.ionicMethod

There's an ionic interaction if a cationic and an anionic atoms are at 6.0 Å or less.

MIToS.PDB.is_aminoacidMethod
is_aminoacid(residue::PDBResidue)
is_aminoacid(residue_id::PDBResidueIdentifier)

This function returns true if the PDB residue is an amino acid residue. It checks if the residue's three-letter name exists in the MIToS.Utils.THREE2ONE dictionary, and returns false otherwise.

MIToS.PDB.isanionicMethod

Returns true if the atom, e.g. ("GLU","CD"), is an anionic atom in the residue.

MIToS.PDB.isaromaticMethod

Returns true if the atom, e.g. ("HIS","CG"), is an aromatic atom in the residue.

MIToS.PDB.iscationicMethod

Returns true if the atom, e.g. ("ARG","NE"), is a cationic atom in the residue.

MIToS.PDB.isresidueMethod

It tests if the PDB residue has the indicated model, chain, group and residue number.

MIToS.PDB.kabschMethod

kabsch(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64})

This function takes two sets of points, A (refrence) and B as NxD matrices, where D is the dimension and N is the number of points. Assumes that the centroids of A and B are at the origin of coordinates. You can call center! on each matrix before calling kabsch to center the matrices in the (0.0, 0.0, 0.0). Rotates B so that rmsd(A,B) is minimized. Returns the rotation matrix. You should do B * RotationMatrix to get the rotated B.

MIToS.PDB.mean_coordinatesMethod

Calculates the average/mean position of each atom in a set of structure. The function takes a vector (AbstractVector) of vectors (AbstractVector{PDBResidue}) or matrices (AbstractMatrix{Float64}) as first argument. As second (optional) argument this function can take an AbstractVector{Float64} of matrix/structure weights to return a weighted mean. When a AbstractVector{PDBResidue} is used, if the keyword argument calpha is false the RMSF is calculated for all the atoms. By default only alpha carbons are used (default: calpha=true).

MIToS.PDB.modelled_sequencesMethod
modelled_sequences(residue_list::AbstractArray{PDBResidue,N}; 
    model::Union{String,Type{All}}=All, chain::Union{String,Type{All}}=All, 
    group::Union{String,Regex,Type{All}}=All) where N

This function returns an OrderedDict where each key is a named tuple (containing the model and chain identifiers), and each value is the protein sequence corresponding to the modelled residues in those chains. Therefore, the obtained sequences do not contain missing residues. All modelled residues are included by default, but those that don't satisfy specified criteria based on the model, chain, or group keyword arguments are excluded. One-letter residue names are obtained from the MIToS.Utils.THREE2ONE dictionary for all residue names that return true for is_aminoacid.

MIToS.PDB.picationMethod

There's a Π-Cation interaction if a cationic and an aromatic atoms are at 6.0 Å or less

MIToS.PDB.proximitymeanMethod

proximitymean calculates the proximity mean/average for each residue as the average score (from a scores list) of all the residues within a certain physical distance to a given amino acid. The score of that residue is not included in the mean unless you set include to true. The default values are 6.05 for the distance threshold/limit and "Heavy" for the criteria keyword argument. This function allows to calculate pMI (proximity mutual information) and pC (proximity conservation) as in Buslje et. al. 2010.

Buslje, Cristina Marino, Elin Teppa, Tomas Di Doménico, José María Delfino, and Morten Nielsen. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 6, no. 11 (2010): e1000978.

MIToS.PDB.residuepairsmatrixMethod

It creates a NamedArray containing a PairwiseListMatrix where each element (column, row) is identified with a PDBResidue from the input vector. You can indicate the value type of the matrix (default to Float64), if the list should have the diagonal values (default to Val{false}) and the diagonal values (default to NaN).

MIToS.PDB.residuesMethod

residues(residue_list, model, chain, group, residue)

These return a new vector with the selected subset of residues from a list of residues. You can use the type All (default value) to avoid filtering a that level.

MIToS.PDB.residuesdictMethod

residuesdict(residue_list, model, chain, group, residue)

These return a dictionary (using PDB residue numbers as keys) with the selected subset of residues. You can use the type All (default value) to avoid filtering a that level.

MIToS.PDB.rmsdMethod

rmsd(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64})

Return RMSD between two sets of points A and B, given as NxD matrices (N: number of points, D: dimension).

MIToS.PDB.rmsdMethod

rmsd(A::AbstractVector{PDBResidue}, B::AbstractVector{PDBResidue}; superimposed::Bool=false)

Returns the Cα RMSD value between two PDB structures: A and B. If the structures are already superimposed between them, use superimposed=true to avoid a new superimposition (superimposed is false by default).

MIToS.PDB.rmsfMethod

Calculates the RMSF (Root Mean-Square-Fluctuation) between an atom and its average position in a set of structures. The function takes a vector (AbstractVector) of vectors (AbstractVector{PDBResidue}) or matrices (AbstractMatrix{Float64}) as first argument. As second (optional) argument this function can take an AbstractVector{Float64} of matrix/structure weights to return the root weighted mean-square-fluctuation around the weighted mean structure. When a Vector{PDBResidue} is used, if the keyword argument calpha is false the RMSF is calculated for all the atoms. By default only alpha carbons are used (default: calpha=true).

MIToS.PDB.selectbestoccupancyMethod

Takes a PDBResidue and a Vector of atom indices. Returns the index value of the Vector with maximum occupancy.

MIToS.PDB.squared_distanceMethod

squared_distance(A::PDBResidue, B::PDBResidue; criteria::String="All")

Returns the squared distance between the residues A and B. The available criteria are: Heavy, All, CA, CB (CA for GLY)

MIToS.PDB.superimposeFunction
Asuper, Bsuper, RMSD = superimpose(A, B, matches=nothing)

This function takes A::AbstractVector{PDBResidue} (reference) and B::AbstractVector{PDBResidue}. Translates A and B to the origin of coordinates, and rotates B so that rmsd(A,B) is minimized with the Kabsch algorithm (using only their α carbons). Returns the rotated and translated versions of A and B, and the RMSD value.

Optionally provide matches which iterates over matched index pairs in A and B, e.g., matches = [(3, 5), (4, 6), ...]. The alignment will be constructed using just the matching residues.

MIToS.PDB.vanderwaalsMethod

Test if two atoms or residues are in van der Waals contact using: distance(a,b) <= 0.5 + vanderwaalsradius[a] + vanderwaalsradius[b]. It returns distance <= 0.5 if the atoms aren't in vanderwaalsradius.

MIToS.PDB.vanderwaalsclashMethod

Returns true if the distance between the atoms is less than the sum of the vanderwaalsradius of the atoms. If the atoms aren't on the list (i.e. OXT), the vanderwaalsradius of the element is used. If there is not data in the dict, distance 0.0 is used.

MIToS.PDB.@atomsMacro

@atoms ... model ... chain ... group ... residue ... atom ...

These return a vector of PDBAtoms with the selected subset of atoms. You can use the type All to avoid filtering that option.

MIToS.PDB.@residuesMacro

@residues ... model ... chain ... group ... residue ...

These return a new vector with the selected subset of residues from a list of residues. You can use the type All to avoid filtering that option.

MIToS.PDB.@residuesdictMacro

@residuesdict ... model ... chain ... group ... residue ...

These return a dictionary (using PDB residue numbers as keys) with the selected subset of residues. You can use the type All to avoid filtering that option.

MIToS.Utils.Scripts.loadedversionMethod

Return the version of the loaded module.

Source: https://stackoverflow.com/questions/60587041/julia-getting-the-version-number-of-my-module

MIToS.Utils.Scripts.runscriptMethod

If FILE is not used, calls runsinglescript with STDIN. Otherwise calls runsinglescript with the FILE. If list is true, for loop or pmap is used over eachline of the input.

MIToS.SIFTSModule

The SIFTS module of MIToS allows to obtain the residue-level mapping between databases stored in the SIFTS XML files. It makes easy to assign PDB residues to UniProt/Pfam positions. Given the fact that pairwise alignments can lead to misleading association between residues in both sequences, SIFTS offers more reliable association between sequence and structure residue numbers.

Features

  • Download and parse SIFTS XML files
  • Store residue-level mapping in Julia
  • Easy generation of OrderedDicts between residues numbers
using MIToS.SIFTS
MIToS.SIFTS.SIFTSResidueType

A SIFTSResidue object stores the SIFTS residue level mapping for a residue. It has the following fields that you can access at any moment for query purposes:

- `PDBe` : A `dbPDBe` object, it's present in all the `SIFTSResidue`s.
- `UniProt` : A `dbUniProt` object or `missing`.
- `Pfam` : A `dbPfam` object or `missing`.
- `NCBI` : A `dbNCBI` object or `missing`.
- `InterPro` : An array of `dbInterPro` objects.
- `PDB` : A `dbPDB` object or `missing`.
- `SCOP` : A `dbSCOP` object or `missing`.
- `SCOP2` : An array of `dbSCOP2` objects.
- `SCOP2B` : A `dbSCOP2B` object or `missing`.
- `CATH` : A `dbCATH` object or `missing`.
- `Ensembl` : An array of `dbEnsembl` objects.
- `missing` : It's `true` if the residue is missing, i.e. not observed, in the structure.
- `sscode` : A string with the secondary structure code of the residue.
- `ssname` : A string with the secondary structure name of the residue.
MIToS.SIFTS.dbCATHType

dbCATH stores the residue id, number, name and chain in CATH as strings.

MIToS.SIFTS.dbEnsemblType

dbEnsembl stores the residue (gene) accession id, the transcript, translation and exon ids in Ensembl as strings, together with the residue number and name using the UniProt coordinates.

MIToS.SIFTS.dbInterProType

dbInterPro stores the residue id, number, name and evidence in InterPro as strings.

MIToS.SIFTS.dbNCBIType

dbNCBI stores the residue id, number and name in NCBI as strings.

MIToS.SIFTS.dbPDBType

dbPDB stores the residue id, number, name and chain in PDB as strings.

MIToS.SIFTS.dbPfamType

dbPfam stores the residue id, number and name in Pfam as strings.

MIToS.SIFTS.dbSCOPType

dbSCOP stores the residue id, number, name and chain in SCOP as strings.

MIToS.SIFTS.dbSCOP2Type

dbSCOP2 stores the residue id, number, name and chain in SCOP2 as strings.

MIToS.SIFTS.dbSCOP2BType

dbSCOP2B stores the residue id, number, name and chain in SCOP2B as strings. SCOP2B is expansion of SCOP2 domain annotations at superfamily level to every PDB with same UniProt accession having at least 80% SCOP2 domain coverage.

Base.parseMethod

parse(document::LightXML.XMLDocument, ::Type{SIFTSXML}; chain=All, missings::Bool=true)

Returns a Vector{SIFTSResidue} parsed from a SIFTSXML file. By default, parses all the chains and includes missing residues.

MIToS.SIFTS._get_entitiesMethod

Gets the entities of a SIFTS XML. In some cases, each entity is a PDB chain. WARNING: Sometimes there are more chains than entities!

<entry dbSource="PDBe" ...
  ...
  <entity type="protein" entityId="A">
    ...
  </entity>
  <entity type="protein" entityId="B">
    ...
  </entity>
<ntry>
MIToS.SIFTS._get_residuesMethod

Returns an Iterator of the residues on the listResidue

<listResidue>
  <residue>
  ...
  </residue>
  ...
</listResidue>
MIToS.SIFTS._get_segmentsMethod

Gets an array of the segments, the continuous region of an entity. Chimeras and expression tags generates more than one segment for example.

MIToS.SIFTS.downloadsiftsMethod
downloadsifts(pdbcode::String; filename::String, source::String="https")

Download the gzipped SIFTS XML file for the provided pdbcode. The downloaded file will have the default extension .xml.gz. While you can change the filename, it must include the .xml.gz ending. The source keyword argument is set to "https" by default. Alternatively, you can choose "ftp" as the source, which will retrieve the file from the EBI FTP server at ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/. However, please note that using "https" is highly recommended. This option will download the file from the EBI PDBe server at https://www.ebi.ac.uk/pdbe/files/sifts/.

MIToS.SIFTS.siftsmappingMethod

Parses a SIFTS XML file and returns a OrderedDict between residue numbers of two DataBases with the given identifiers. A chain could be specified (All by default). If missings is true (default) all the residues are used, even if they haven’t coordinates in the PDB file.

MIToS.UtilsModule

The Utils has common utils functions and types used in other modules.

using MIToS.Utils
MIToS.Utils.THREE2ONEConstant

THREE2ONE is a dictionary that maps three-letter amino acid residue codes (String) to their corresponding one-letter codes (Char). The dictionary is generated by parsing components.cif file from the Protein Data Bank.

julia> using MIToS.Utils

julia> one_letter_code = THREE2ONE["ALA"]
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
MIToS.Utils.AllType

All is used instead of MIToS 1.0 "all" or "*", because it's possible to dispatch on it.

Base.readMethod

read(pathname, FileFormat [, Type [, … ] ] ) -> Type

This function opens a file in the pathname and calls parse(io, ...) for the given FileFormat and Type on it. If the pathname is an HTTP or FTP URL, the file is downloaded with download in a temporal file. Gzipped files should end on .gz.

Base.writeMethod

write{T<:FileFormat}(filename::AbstractString, object, format::Type{T}, mode::ASCIIString="w")

This function opens a file with filename and mode (default: "w") and writes (print) the object with the given format. Gzipped files should end on .gz.

MIToS.Utils._modify_kargs_for_proxyMethod

Helper function that modifies keyword argument to include a proxy, the proxy URL is taken from the HTTPSPROXY and HTTPSPROXY enviromental variables.

MIToS.Utils.check_fileMethod

Returns the filename. Throws an ErrorException if the file doesn't exist, or a warning if the file is empty.

MIToS.Utils.download_fileMethod

download_file uses HTTP.jl to download files from the web. It takes the file url as first argument and, optionally, a path to save it. Keyword arguments are are directly passed to to HTTP.download (HTTP.request). Use the headers keyword argument to pass a Dict{String,String} with the header information. Set the HTTPS_PROXY and HTTPS_PROXY ENViromental variables if you are behind a proxy.

julia> using MIToS.Utils

julia> download_file("http://www.uniprot.org/uniprot/P69905.fasta","seq.fasta",
       headers = Dict("User-Agent" =>
                      "Mozilla/5.0 (compatible; MSIE 7.01; Windows NT 5.0)"))
"seq.fasta"
MIToS.Utils.get_n_wordsMethod

get_n_words{T <: Union{ASCIIString, UTF8String}}(line::T, n::Int) It returns a Vector{T} with the first n (possibles) words/fields (delimited by space or tab). If there is more than n words, the last word returned contains the finals words and the delimiters. The length of the returned vector is n or less (if the number of words is less than n). This is used for parsing the Stockholm format.

julia> using MIToS.Utils

julia> get_n_words("#=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH", 3)
3-element Vector{String}:
 "#=GR"
 "O31698/18-71"
 "SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH"
MIToS.Utils.hascoordinatesMethod

hascoordinates(id) It returns true if id/sequence name has the format: UniProt/start-end (i.e. O83071/192-246)

MIToS.Utils.list2matrixMethod

Returns a square symmetric matrix from the vector vec. side is the number of rows/columns. The diagonal is not included by default, set to true if there are diagonal elements in the list.

MIToS.Utils.matrix2listMethod

Returns a vector with the part ("upper" or "lower") of the square matrix mat. The diagonal is not included by default.

MIToS.Utils.select_elementMethod

Selects the first element of the vector. This is useful for unpacking one element vectors. Throws a warning if there are more elements. element_name is element by default, but the name can be changed using the second argument.

MIToS.MSAModule

The MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).

Features

  • Read and write MSAs in Stockholm, FASTA or Raw format
  • Handle MSA annotations
  • Edit the MSA, e.g. delete columns or sequences, change sequence order, shuffling...
  • Keep track of positions and annotations after modifications on the MSA
  • Describe a MSA, e.g. mean percent identity, sequence coverage, gap percentage...
using MIToS.MSA
MIToS.MSA.GAPConstant

GAP is the Residue representation on MIToS for gaps ('-', insertions and deletions). Lowercase residue characters, dots and '*' are encoded as GAP in conversion from Strings and Chars. This Residue constant is encoded as Residue(21).

MIToS.MSA.XAAConstant

XAA is the Residue representation for unknown, ambiguous and non standard residues. This Residue constant is encoded as Residue(22).

MIToS.MSA._max_charConstant

'z' is the maximum between 'A':'Z', 'a':'z', '.', '-' and '*'. 'z' is 'GAP' but the next character to 'z' is '{', i.e. XAA.

MIToS.MSA.AbstractAlignedObjectType

MIToS MSA and aligned sequences (aligned objects) are subtypes of AbstractMatrix{Residue}, because MSAs and sequences are stored as Matrix of Residues.

MIToS.MSA.AlignedSequenceType

An AlignedSequence wraps a NamedResidueMatrix{Array{Residue,2}} with only 1 row/sequence. The NamedArray stores the sequence name and original column numbers as Strings.

MIToS.MSA.AnnotatedMultipleSequenceAlignmentType

This type represent an MSA, similar to MultipleSequenceAlignment, but It also stores Annotations. This annotations are used to store residue coordinates (i.e. mapping to UniProt residue numbers).

MIToS.MSA.AnnotationsType

The Annotations type is basically a container for Dicts with the annotations of a multiple sequence alignment. Annotations was designed for storage of annotations of the Stockholm format.

MIToS also uses MSA annotations to keep track of:

  • Modifications of the MSA (MIToS_...) as deletion of sequences or columns.
  • Positions numbers in the original MSA file (column mapping: ColMap)
  • Position of the residues in the sequence (sequence mapping: SeqMap)
MIToS.MSA.ClustersType

Data structure to represent sequence clusters. The sequence data itself is not included.

MIToS.MSA.GappedAlphabetType

This type defines the usual alphabet of the 20 natural residues and a gap character.

julia> using MIToS.MSA

julia> GappedAlphabet()
GappedAlphabet of length 21. Residues : res"ARNDCQEGHILKMFPSTWYV-"
MIToS.MSA.MultipleSequenceAlignmentType

This MSA type include a NamedArray wrapping a Matrix of Residues. The use of NamedArray allows to store sequence names and original column numbers as Strings, and fast indexing using them.

MIToS.MSA.NoClusteringType

Use NoClustering() to avoid the use of clustering where a Clusters type is needed.

MIToS.MSA.ReducedAlphabetType

ReducedAlphabet allows the construction of reduced residue alphabets, where residues inside parenthesis belong to the same group.

julia> using MIToS.MSA

julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"

julia> ab[Residue('K')]
2
MIToS.MSA.ResidueType

Most of the MIToS design is created around the Residue bitstype. It has representations for the 20 natural amino acids, a value representing insertions and deletions (GAP, '-') and one representing unknown, ambiguous and non standard residues (XAA, 'X'). Each Residue is encoded as an integer number, with the same bit representation and size than a Int. This allows fast indexing operation of probability or frequency matrices.

Residue creation and conversion

Creation and conversion of Residues should be treated carefully. Residue is encoded as a 32 or 64 bits type similar to Int, to get fast indexing using Int(x::Residue). Int simply calls reinterpret without checking if the residue is valid. Valid residues have integer values in the closed interval [1,22]. convert from Int and Char always returns valid residues, however it's possible to find invalid residues (they are shown using the character '�') after the creation of uninitialized Residue arrays (i.e. using Array). You can use zeros, ones or rand to get initialized Residue arrays with valid residues. Conversions to and from Chars changes the bit representation and allows the use of the usual character representation of residues and amino acids. This conversions are used in IO operations and always return valid residues. In conversions from Char, lowercase letters, '*', '-' and '.' are translated to GAP, letters representing the 20 natural amino (ARNDCQEGHILKMFPSTWYV) acids are translated to their corresponding Residue and any other character is translated to XAA. Since lowercase letters and dots are translated to gaps, Pfam MSA insert columns are converted to columns full of gaps.

julia> using MIToS.MSA

julia> alanine = Residue('A')
A

julia> Char(alanine)
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)

julia> for residue in res"ARNDCQEGHILKMFPSTWYV-X"
           println(residue, " ", Int(residue))
       end
A 1
R 2
N 3
D 4
C 5
Q 6
E 7
G 8
H 9
I 10
L 11
K 12
M 13
F 14
P 15
S 16
T 17
W 18
Y 19
V 20
- 21
X 22
MIToS.MSA.UngappedAlphabetType

This type defines the usual alphabet of the 20 natural residues, without the gap character.

julia> using MIToS.MSA

julia> UngappedAlphabet()
UngappedAlphabet of length 20. Residues : res"ARNDCQEGHILKMFPSTWYV"
Base.isvalidMethod

isvalid(res::Residue)

It returns true if the encoded integer is in the closed interval [1,22].

Base.joinMethod
join(msa_a::AnnotatedMultipleSequenceAlignment, 
     msa_b::AnnotatedMultipleSequenceAlignment, 
     pairing; 
     kind::Symbol=:outer, 
     axis::Int=1)::AnnotatedMultipleSequenceAlignment

join(msa_a::AnnotatedMultipleSequenceAlignment, 
     msa_b::AnnotatedMultipleSequenceAlignment, 
     positions_a, 
     positions_b; 
     kind::Symbol=:outer, 
     axis::Int=1)::AnnotatedMultipleSequenceAlignment

Join two Multiple Sequence Alignments (MSAs), msa_a and msa_b, based on specified matching positions or names. The function supports two formats: one takes a pairing argument as a list of correspondences, and the other takes positions_a and positions_b as separate lists indicating matching positions or names in each MSA. This function allows for various types of join operations (:inner, :outer, :left, :right) and can merge MSAs by sequences (axis 1) or by columns (axis 2).

Parameters:

  • msa_a::AnnotatedMultipleSequenceAlignment: The first MSA.
  • msa_b::AnnotatedMultipleSequenceAlignment: The second MSA.
  • pairing: An iterable where each element is a pair of sequence or column positions (Ints) or names (Strings) to match between msa_a and msa_b. For example, it can be a list of two-element tuples or pairs, or and OrderedDict.
  • positions_a, positions_b: Separate lists of positions or names in msa_a and msa_b, respectively.
  • kind::Symbol: Type of join operation. Default is :outer.
  • axis::Int: The axis along which to join (1 to match sequences, 2 to match columns).

Returns:

  • AnnotatedMultipleSequenceAlignment: A new MSA resulting from the join operation.

Behavior and Sequence Ordering:

The order of sequences or columns in the resulting MSA depends on the kind of join operation and the order of elements in the pairing or positions_a and positions_b lists.

  • For :inner joins, the function returns an MSA containing only those sequences/columns that are paired in both msa_a and msa_b. The order of elements in the output MSA follows the order in the pairing or position lists.
  • For :outer joins, the output MSA includes all sequences/columns from both msa_a and msa_b. Unpaired sequences/columns are filled with gaps as needed. The sequences/columns from msa_a are placed first. If the pairing or position lists are sorted, the output MSA columns and sequences will keep the same order as in the inputs. That's nice for situations such as profile alignments where the order of columns is important. If the pairing or position lists are not sorted, then the order of sequences/columns in the output MSA is not guaranteed to be the same as in the inputs. In particular, the matched sequences or columns will be placed first, followed by the unmatched ones.
  • For :left joins, all sequences/columns from msa_a are included in the output MSA keeping the same order as in msa_a. Sequences/columns from msa_b are added where matches are found, with gaps filling the unmatched positions.
  • For :right joins, the output MSA behaves like :left joins but with roles of msa_a and msa_b reversed.

Warning: When using Dict for pairing, the order of elements might not be preserved as expected. Dict in Julia does not maintain the order of its elements, which might lead to unpredictable order of sequences/columns in the output MSA. To preserve order, it is recommended to use an OrderedDict or a list of Pairs objects.

Base.merge!Method
merge!(target::Annotations, sources::Annotations...)

Merge one or more source Annotations into a target Annotations. This function calls Base.merge! on each of the four dictionaries in the Annotations type: file, sequences, columns, and residues. Consequently, it behaves like Base.merge! for dictionaries; if the same key exists in different Annotations objects, the value from the last one is used.

NOTE: This function does not check for consistency among the various Annotations. For example, it does not verify that SeqMap annotations have consistent lengths.

Base.mergeMethod
merge(target::Annotations, sources::Annotations...)

Create a new Annotations object by merging two or more Annotations. If the same annotation exists in different Annotations objects, the value from the last one is used. See also merge!.

NOTE: This function does not check for consistency among the various Annotations. For example, it does not verify that SeqMap annotations have consistent lengths.

Base.namesMethod

It returns the name of each group. The name is a string with the one letter code of each residue that belong to the group.

julia> using MIToS.MSA

julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"

julia> names(ab)
8-element Vector{String}:
 "AILMV"
 "RHK"
 "NQST"
 "DE"
 "FWY"
 "C"
 "G"
 "P"
Base.parseFunction

parse(io, format[, output; generatemapping, useidcoordinates, deletefullgaps])

The keyword argument generatemapping (false by default) indicates if the mapping of the sequences ("SeqMap") and columns ("ColMap") and the number of columns in the original MSA ("NCol") should be generated and saved in the annotations. If useidcoordinates is true (default: false) the sequence IDs of the form "ID/start-end" are parsed and used for determining the start and end positions when the mappings are generated. deletefullgaps (true by default) indicates if columns 100% gaps (generally inserts from a HMM) must be removed from the MSA.

Base.randMethod

It chooses from the 20 natural residues (it doesn't generate gaps).

julia> using MIToS.MSA

julia> using Random

julia> Random.seed!(1); # Reseed the random number generator.

julia> rand(Residue)
R

julia> rand(Residue, 4, 4)
4×4 Matrix{Residue}:
 E  D  D  A
 F  S  K  K
 M  S  I  M
 Y  F  E  D
Clustering.assignmentsMethod

Get a vector of assignments, where the i value is the index/number of the cluster to which the i-th sequence is assigned.

MIToS.MSA._add_gaps_in_bFunction
_add_gaps_in_b(msa_a, msa_b, positions_a, positions_b, axis::Int=1)

This is an helper function for the :left and :right joins. It aligns msa_b with msa_a by adding gaps to msa_b. Only the sequences or columns in msa_b that match the positions in msa_a are kept. Gaps are inserted in msa_b at positions where msa_a has sequences or columns that are not matched in msa_b. Therefore, we can think of msa_a as the reference MSA and msa_b as the MSA to be aligned with the reference.

The positions_a and positions_b arguments define corresponding positions in msa_a and msa_b, respectively, for accurate gap insertion. The function returns a modified msa_b with gaps inserted to align with msa_a.

The axis argument (1 for sequences, 2 for columns) determines whether gaps are inserted as sequences or columns.

Example

aligned_msa_b = _add_gaps_in_b(msa_a, msa_b, [1, 2, 3], [2, 3, 4], 2) # axis = 2
MIToS.MSA._clean_sequence_mapping!Method

At this moment, the concatenateannotsequence function does no check for SeqMap annotations. Therefore, if an MSA does not have a SeqMap annotation, the concatenated SeqMap annotation is corrupted. To avoid this, we will delete the SeqMap annotation whose length is not equal to the length of the concatenated MSA.

MIToS.MSA._fill_hobohmI!Method

Fill cluster and clustersize matrices. These matrices are assumed to be empty (only zeroes) and their length is assumed to be equal to the number of sequences in the alignment (aln). threshold is the minimum identity value between two sequences to be in the same cluster.

MIToS.MSA._find_gapsMethod
_find_gaps(positions, n)

Calculate gaps in a sorted sequence of positions given also a maximum value n that would be added at the end as n+1 if n it is not already present.

This function returns a list of tuples, where each tuple represents a gap in the sequence. The first element of the tuple indicates the position after which the gap will be inserted, and the second element is the position that comes just after the gap. The function accounts for gaps at both the beginning and end of the sequence. A gap is identified as a difference greater than 1 between consecutive positions. To account for gaps at the end, the end position of a gap matching the last position of the sequence is set to n+1.

MIToS.MSA._get_seqid_indexMethod

This function takes a vector of sequence names and a sequence id. It returns the position of that id in the vector. If the id isn't in the vector, It throws an error.

MIToS.MSA._get_sequence_weightMethod

Calculates the weight of each sequence in a cluster. The weight is equal to one divided by the number of sequences in the cluster.

MIToS.MSA._insert_sorted_gapsMethod
_insert_sorted_gaps(msa_target, msa_reference, positions_target, positions_reference, 
	block_position::Symbol=:before, axis::Int=1)

Inserts gap blocks into msa_target to match the alignment pattern of msa_reference. This function is utilized for aligning two MSAs based on specific alignment positions. The positions_target and positions_reference parameters dictate the corresponding positions in both MSAs for accurate gap insertion.

The function returns the modified msa_target with inserted gaps, aligned according to msa_reference.

The block_position keyword argument determines whether the gap blocks are inserted :before (the default) or :after the unmatched sequences or columns. The axis keyword argument determines whether the gap blocks are inserted as sequences (1, the default) or columns (2).

Example

gapped_msa_a = _insert_sorted_gaps(msa_a, msa_b, positions_a, positions_b)
MIToS.MSA._keepinserts!Method

Function to keep insert columns in parse. It uses the first sequence to generate the "Aligned" annotation, and after that, convert all the characters to uppercase.

MIToS.MSA._rename_sequencesMethod
_rename_sequences(annotations::Annotations, old2new::Dict{String,String})

Renames sequences in a given Annotations object based on a mapping provided in old2new.

This function iterates through residue-level and sequence-level annotations in the provided Annotations object. For each annotation, it checks if the sequence name exists in the old2new dictionary. If so, the sequence name is updated to the new name from old2new; otherwise, the original sequence name is retained.

The function then returns a new Annotations object with updated sequence names.

MIToS.MSA._renumber_sequence_gapsMethod
_renumber_sequence_gaps(msa::AnnotatedMultipleSequenceAlignment)

Renumbers the gap sequences in a given multiple sequence alignment (MSA) object and returns a new MSA object with updated sequence names.

This function specifically targets sequences within the MSA that are named with the prefix "gap:". It assigns a new sequential number to each gap sequence, ensuring a unique and ordered naming convention (e.g., "gap:1", "gap:2", etc.).

Therefore, this function is useful for renumbering gap sequences in an MSA where gap blocks have been inserted from the end to the beginning. This insertion order can lead to non-sequential numbering of the gap sequences. For example, in the case of two gap blocks, where one block contains a single sequence and the other contains two, the sequences might originally be numbered as "gap:3", "gap:1", "gap:2". This function renumbers them sequentially so that they are numbered as "gap:1", "gap:2", "gap:3".

MIToS.MSA._str2int_mappingMethod

Converts a string of mappings into a vector of Ints

julia> using MIToS.MSA

julia> MSA._str2int_mapping(",,2,,4,5")
6-element Vector{Int64}:
 0
 0
 2
 0
 4
 5
MIToS.MSA._swap!Method

It swaps the names on the positions i and j of a Vector{String}

MIToS.MSA._v_concatenated_seq_namesMethod

If returns a vector of sequence names for the vertically concatenated MSA. The prefix is the number associated to the source MSA. If the sequence name has already a number as prefix, the MSA number is increased accordingly.

MIToS.MSA._valid_residue_integerMethod

It takes an Int and returns the Int value used to represent a valid Residue. Invalid residues are encoded as the integer 23.

MIToS.MSA.adjustreferenceFunction

Creates a new matrix of residues. This function deletes positions/columns of the MSA with gaps in the reference (first) sequence.

MIToS.MSA.annotate_modification!Method

Annotates on file annotations the modifications realized by MIToS on the MSA. It always returns true, so It can be used in a boolean context.

MIToS.MSA.annotationsMethod

The annotations function returns the Annotations of an annotated MSA or aligned sequence. If the object is not annotated, it returns an empty Annotations object.

MIToS.MSA.column_indexMethod
column_index(msa, col_name)

Return the index (integer position) of the column with name col_name in the MSA msa. A KeyError is thrown if the column name does not exist. If col_name is an integer, the same integer is returned without checking if it is a valid index.

MIToS.MSA.columnname_iteratorMethod

columnname_iterator(msa)

It returns an iterator that returns the column names of the msa. If the msa is a Matrix{Residue} this function returns the actual column numbers as strings. Otherwise it returns the column number of the original MSA through the wrapped NamedArray column names.

MIToS.MSA.columnnamesMethod

columnnames(msa)

It returns a Vector{String} with the sequence names/identifiers. If the msa is a Matrix{Residue} this function returns the actual column numbers as strings. Otherwise it returns the column number of the original MSA through the wrapped NamedArray column names.

MIToS.MSA.columnpairsmatrixMethod

Initialize an empty PairwiseListMatrix for a pairwise measure in sequence pairs. It uses the sequence names if they are available, otherwise it uses the actual sequence numbers. You can use the positional argument to indicate the number Type (default: Float64), if the PairwiseListMatrix should store the diagonal values on the list (default: false) and a default value for the diagonal (default: NaN).

MIToS.MSA.coverageMethod

Coverage of the sequences with respect of the number of positions on the MSA

MIToS.MSA.filtercolumns!Function

filtercolumns!(msa, mask[, annotate::Bool=true])

It allows to filter MSA or aligned sequence columns/positions using a AbstractVector{Bool} mask. Annotations are updated if annotate is true (default).

MIToS.MSA.filtercolumns!Method

filtercolumns!(data::Annotations, mask)

It is useful for deleting column annotations (creating a subset in place).

MIToS.MSA.filtersequences!Function

filtersequences!(msa, mask[, annotate::Bool=true])

It allows to filter msa sequences using a AbstractVector{Bool} mask (It removes sequences with false values). AnnotatedMultipleSequenceAlignment annotations are updated if annotate is true (default).

MIToS.MSA.filtersequences!Method

filtersequences!(data::Annotations, ids::Vector{String}, mask::AbstractArray{Bool,1})

It is useful for deleting sequence annotations. ids should be a list of the sequence names and mask should be a logical vector.

MIToS.MSA.gapfractionMethod

It calculates the fraction of gaps on the Array (alignment, sequence, column, etc.). This function can take an extra dimension argument for calculation of the gap fraction over the given dimension.

MIToS.MSA.gapstrip!Function

This functions deletes/filters sequences and columns/positions on the MSA on the following order:

  1. Removes all the columns/position on the MSA with gaps on the reference (first) sequence.
  2. Removes all the sequences with a coverage with respect to the number of

columns/positions on the MSA less than a coveragelimit (default to 0.75: sequences with 25% of gaps).

  1. Removes all the columns/position on the MSA with more than a gaplimit

(default to 0.5: 50% of gaps).

MIToS.MSA.gapstripMethod

Creates a new matrix of Residues (MSA) with deleted sequences and columns/positions. The MSA is edited in the following way:

  1. Removes all the columns/position on the MSA with gaps on the reference (first) sequence
  2. Removes all the sequences with a coverage with respect to the number of

columns/positions on the MSA less than a coveragelimit (default to 0.75: sequences with 25% of gaps)

  1. Removes all the columns/position on the MSA with more than a gaplimit

(default to 0.5: 50% of gaps)

MIToS.MSA.getannotcolumnFunction

getannotcolumn(ann[, feature[,default]])

It returns per column annotation for feature

MIToS.MSA.getannotfileFunction

getannotfile(ann[, feature[,default]])

It returns per file annotation for feature

MIToS.MSA.getannotresidueFunction

getannotresidue(ann[, seqname, feature[,default]])

It returns per residue annotation for (seqname, feature)

MIToS.MSA.getannotsequenceFunction

getannotsequence(ann[, seqname, feature[,default]])

It returns per sequence annotation for (seqname, feature)

MIToS.MSA.getcolumnmappingMethod

It returns a Vector{Int} with the original column number of each column on the actual MSA. The mapping is annotated in the ColMap file annotation of an AnnotatedMultipleSequenceAlignment or in the column names of an NamedArray or MultipleSequenceAlignment.

NOTE: When the MSA results from vertically concatenating MSAs using vcat, the column map annotations from the constituent MSAs (such as 1_ColMap, 2_ColMap, etc.) are not returned. Instead, the column numbers referenced in the column names are provided. To access the original annotations, utilize the getannotfile function.

MIToS.MSA.gethcatmappingMethod

It returns a vector of numbers from 1 to N for each column that indicates the source MSA. The mapping is annotated in the "HCat" file annotation of an AnnotatedMultipleSequenceAlignment or in the column names of an NamedArray or MultipleSequenceAlignment.

NOTE: When the MSA results from vertically concatenating MSAs using vcat, the "HCat" annotations from the constituent MSAs are renamed as "1_HCat", "2_HCat", etc. In that case, the MSA numbers referenced in the column names are provided. To access the original annotations, utilize the getannotfile function.

MIToS.MSA.getnamedictMethod

It takes a ResidueAlphabet and returns a dictionary from group name to group position.

julia> using MIToS.MSA

julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"

julia> getnamedict(ab)
OrderedCollections.OrderedDict{String, Int64} with 8 entries:
  "AILMV" => 1
  "RHK"   => 2
  "NQST"  => 3
  "DE"    => 4
  "FWY"   => 5
  "C"     => 6
  "G"     => 7
  "P"     => 8
MIToS.MSA.getresiduesMethod

getresidues allows you to access the residues stored inside an MSA or aligned sequence as a Matrix{Residue} without annotations nor column/row names.

MIToS.MSA.getresiduesequencesMethod

getresiduesequences returns a Vector{Vector{Residue}} with all the MSA sequences without annotations nor column/sequence names.

MIToS.MSA.getsequenceFunction

getsequence takes an MSA and a sequence number or identifier and returns an aligned sequence object. If the MSA is an AnnotatedMultipleSequenceAlignment, it returns an AnnotatedAlignedSequence with the sequence annotations. From a MultipleSequenceAlignment, It returns an AlignedSequence object. If an Annotations object and a sequence identifier are used, this function returns the annotations related to the sequence.

MIToS.MSA.getsequencemappingMethod

It returns the sequence coordinates as a Vector{Int} for an MSA sequence. That vector has one element for each MSA column. If the number if 0 in the mapping, there is a gap in that column for that sequence.

MIToS.MSA.getweightMethod

getweight(c[, i::Int])

This function returns the weight of the sequence number i. getweight should be defined for any type used for count!/count in order to use his weigths. If i isn't used, this function returns a vector with the weight of each sequence.

MIToS.MSA.hobohmIMethod

Sequence clustering using the Hobohm I method from Hobohm et. al. 1992.

Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.

MIToS.MSA.meanpercentidentityFunction

Returns the mean of the percent identity between the sequences of a MSA. If the MSA has 300 sequences or less, the mean is exact. If the MSA has more sequences and the exact keyword is false (defualt), 44850 random pairs of sequences are used for the estimation. The number of samples can be changed using the second argument. Use exact=true to perform all the pairwise comparison (the calculation could be slow).

MIToS.MSA.namedmatrixMethod

The namedmatrix function returns the NamedResidueMatrix{Array{Residue,2}} stored in an MSA or aligned sequence.

MIToS.MSA.ncolumnsMethod

ncolumns(ann::Annotations) returns the number of columns/residues with annotations. This function returns -1 if there is not annotations per column/residue.

MIToS.MSA.percentidentityMethod

percentidentity(seq1, seq2, threshold)

Computes quickly if two aligned sequences have a identity value greater than a given threshold value. Returns a boolean value. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA in at least one sequence aren't counted.

MIToS.MSA.percentidentityMethod

percentidentity(seq1, seq2)

Calculates the fraction of identities between two aligned sequences. The identity value is calculated as the number of identical characters in the i-th position of both sequences divided by the length of both sequences. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA in at least one sequence aren't counted.

MIToS.MSA.percentidentityMethod

percentidentity(msa[, out::Type=Float64])

Calculates the identity between all the sequences on a MSA. You can indicate the output element type with the last optional parameter (Float64 by default). For a MSA with a lot of sequences, you can use Float32 or Flot16 in order to avoid the OutOfMemoryError().

MIToS.MSA.percentsimilarityFunction

Calculates the similarity percent between two aligned sequences. The 100% is the length of the aligned sequences minus the number of columns with gaps in both sequences and the number of columns with at least one residue outside the alphabet. So, columns with residues outside the alphabet (other than the specially treated GAP) aren't counted to the protein length. Two residues are considered similar if they below to the same group in a ReducedAlphabet. The alphabet (third positional argument) by default is:

ReducedAlphabet("(AILMV)(NQST)(RHK)(DE)(FWY)CGP")

The first group is composed of the non polar residues (AILMV), the second group is composed of polar residues, the third group are positive residues, the fourth group are negative residues, the fifth group is composed by the aromatic residues (FWY). C, G and P are considered unique residues.

Other residue groups/alphabets:

SMS (Sequence Manipulation Suite) Ident and Sim:

ReducedAlphabet("(GAVLI)(FYW)(ST)(KRH)(DENQ)P(CM)")

Stothard P (2000) The Sequence Manipulation Suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques 28:1102-1104.

Bio3D 2.2 seqidentity:

ReducedAlphabet("(GA)(MVLI)(FYW)(ST)(KRH)(DE)(NQ)PC")

Grant, B.J. et al. (2006) Bioinformatics 22, 2695–2696.

MIToS.MSA.percentsimilarityMethod

Calculates the similarity percent between all the sequences on a MSA. You can indicate the output element type with the out keyword argument (Float64 by default). For an MSA with a lot of sequences, you can use out=Float32 or out=Flot16 in order to avoid the OutOfMemoryError().

MIToS.MSA.rename_sequences!Method
rename_sequences!(msa, newnames)
rename_sequences!(msa, old2new::AbstractDict)
rename_sequences!(msa, old2new::Pair...)

Rename the sequences of an MSA given a vector of new names, a dictionary mapping old names to new names, or one or more pairs going from old to new names. If the msa is an AnnotatedMultipleSequenceAlignment, the annotations are also updated. The function modifies the msa in place and returns it.

MIToS.MSA.rename_sequencesMethod
rename_sequences(msa, newnames)
rename_sequences(msa, old2new::AbstractDict)
rename_sequences(msa, old2new::Pair...)

Rename the sequences of an MSA given a vector of new names, a dictionary mapping old names to new names, or one or more pairs going from old to new names. If the msa is an AnnotatedMultipleSequenceAlignment, the annotations are also updated. The function returns a new MSA with the sequences renamed without modifying the original MSA.

MIToS.MSA.residue2threeMethod

This function returns the three letter name of the Residue.

julia> using MIToS.MSA

julia> residue2three(Residue('G'))
"GLY"
MIToS.MSA.residuefractionMethod

It calculates the fraction of residues (no gaps) on the Array (alignment, sequence, column, etc.). This function can take an extra dimension argument for calculation of the residue fraction over the given dimension

MIToS.MSA.sequence_indexMethod
sequence_index(msa, seq_name)

Return the index (integer position) of the sequence with name seq_name in the MSA msa. A KeyError is thrown if the sequence name does not exist. If seq_name is an integer, the same integer is returned without checking if it is a valid index.

MIToS.MSA.sequencenamesMethod

sequencenames(msa)

It returns a Vector{String} with the sequence names/identifiers.

MIToS.MSA.sequencepairsmatrixMethod

Initialize an empty PairwiseListMatrix for a pairwise measure in column pairs. It uses the column mapping (column number in the input MSA file) if it’s available, otherwise it uses the actual column numbers. You can use the positional argument to indicate the number Type (default: Float64), if the PairwiseListMatrix should store the diagonal values on the list (default: false) and a default value for the diagonal (default: NaN).

MIToS.MSA.setannotcolumn!Function

setannotcolumn!(ann, feature, annotation)

It stores per column annotation (1 char per column) for feature

MIToS.MSA.setannotfile!Function

setannotfile!(ann, feature, annotation)

It stores per file annotation for feature

MIToS.MSA.setannotresidue!Function

setannotresidue!(ann, seqname, feature, annotation)

It stores per residue annotation (1 char per residue) for (seqname, feature)

MIToS.MSA.setannotsequence!Function

setannotsequence!(ann, seqname, feature, annotation)

It stores per sequence annotation for (seqname, feature)

MIToS.MSA.setreference!Function

It puts the sequence i (name or position) as reference (first sequence) of the MSA. This function swaps the sequences 1 and i.

MIToS.MSA.stringsequenceMethod
stringsequence(seq)
stringsequence(msa, i::Int)
stringsequence(msa, id::String)

It returns the selected sequence as a String.

MIToS.MSA.swapsequences!Method

It swaps the sequences on the positions i and j of an MSA. Also it's possible to swap sequences using their sequence names/identifiers when the MSA object as names.

MIToS.MSA.three2residueMethod

It takes a three letter residue name and returns the corresponding Residue. If the name isn't in the MIToS dictionary, a XAA is returned.

julia> using MIToS.MSA

julia> three2residue("ALA")
A
Random.shuffle!Function

It's like Random.shuffle. When a Matrix{Residue} is used, you can indicate if the gaps should remain their positions using the last boolean argument. The previous argument should be the dimension to shuffle, 1 for shuffling residues in a sequence (row) or 2 for shuffling residues in a column.

julia> using MIToS.MSA

julia> using Random

julia> msa = hcat(res"RRE",res"DDK", res"G--")
3×3 Matrix{Residue}:
 R  D  G
 R  D  -
 E  K  -

julia> Random.seed!(42);

julia> shuffle(msa, 1, true)
3×3 Matrix{Residue}:
 G  D  R
 R  D  -
 E  K  -

julia> Random.seed!(42);

julia> shuffle(msa, 1, false)
3×3 Matrix{Residue}:
 G  D  R
 R  -  D
 E  K  -
Random.shuffleMethod

It's like shuffle but in-place. When a Matrix{Residue} or a AbstractAlignedObject (sequence or MSA) is used, you can indicate if the gaps should remain their positions using the last boolean argument.

StatsBase.countsMethod

Get sample counts of clusters as a Vector. Each k value is the number of samples assigned to the k-th cluster.

MIToS.MSA.@res_strMacro

The MIToS macro @res_str takes a string and returns a Vector of Residues (sequence).

julia> using MIToS.MSA

julia> res"MIToS"
5-element Vector{Residue}:
 M
 I
 T
 -
 S
MIToS.PfamModule

The Pfam module, defines functions to measure the protein contact prediction performance of information measure between column pairs from a Pfam MSA.

Features

  • Read and download Pfam MSAs
  • Obtain PDB information from alignment annotations
  • Map between sequence/alignment residues/columns and PDB structures
  • Measure of AUC (ROC curve) for contact prediction of MI scores
using MIToS.Pfam
MIToS.Pfam.downloadpfamMethod

It downloads a gzipped Stockholm alignment from InterPro for the Pfam family with the given pfamcode.

By default, it downloads the full Pfam alignment. You can use the alignment keyword argument to download the seed or the uniprot alignment instead. For example, downloadpfam("PF00069") will download the full alignment for the PF00069 Pfam family, while downloadpfam("PF00069", alignment="seed") will download the seed alignment of the family.

The extension of the downloaded file is .stockholm.gz by default; you can change it using the filename keyword argument, but the .gz at the end is mandatory.

MIToS.Pfam.getcontactmasksMethod

This function takes a msacontacts or its list of contacts contact_list with 1.0 for true contacts and 0.0 for not contacts (NaN or other numbers for missing values). Returns two BitVectors, the first with trues where contact_list is 1.0 and the second with trues where contact_list is 0.0. There are useful for AUC calculations.

MIToS.Pfam.getseq2pdbMethod

Generates from a Pfam msa a Dict{String, Vector{Tuple{String,String}}}. Keys are sequence IDs and each value is a list of tuples containing PDB code and chain.

julia> getseq2pdb(msa)
Dict{String,Array{Tuple{String,String},1}} with 1 entry:
  "F112_SSV1/3-112" => [("2VQC","A")]
MIToS.Pfam.msacolumn2pdbresidueMethod

msacolumn2pdbresidue(msa, seqid, pdbid, chain, pfamid, siftsfile; strict=false, checkpdbname=false, missings=true)

This function returns a OrderedDict{Int,String} with MSA column numbers on the input file as keys and PDB residue numbers ("" for missings) as values. The mapping is performed using SIFTS. This function needs correct ColMap and SeqMap annotations. This checks correspondence of the residues between the MSA sequence and SIFTS (It throws a warning if there are differences). Missing residues are included if the keyword argument missings is true (default: true). If the keyword argument strict is true (default: false), throws an Error, instead of a Warning, when residues don't match. If the keyword argument checkpdbname is true (default: false), throws an Error if the three letter name of the PDB residue isn't the MSA residue. If you are working with a downloaded Pfam MSA without modifications, you should read it using generatemapping=true and useidcoordinates=true. If you don't indicate the path to the siftsfile used in the mapping, this function downloads the SIFTS file in the current folder. If you don't indicate the Pfam accession number (pfamid), this function tries to read the AC file annotation.

MIToS.Pfam.msacontactsFunction

This function takes an AnnotatedMultipleSequenceAlignment with correct ColMap annotations and two dicts:

  1. The first is an OrderedDict{String,PDBResidue} from PDB residue number to PDBResidue.
  2. The second is a Dict{Int,String} from MSA column number on the input file to PDB residue number.

msacontacts returns a PairwiseListMatrix{Float64,false} of 0.0 and 1.0 where 1.0 indicates a residue contact. Contacts are defined with an inter residue distance less or equal to distance_limit (default to 6.05) angstroms between any heavy atom. NaN indicates a missing value.

MIToS.Pfam.msaresiduesMethod

This function takes an AnnotatedMultipleSequenceAlignment with correct ColMap annotations and two dicts:

  1. The first is an OrderedDict{String,PDBResidue} from PDB residue number to PDBResidue.
  2. The second is a Dict{Int,String} from MSA column number on the input file to PDB residue number.

msaresidues returns an OrderedDict{Int,PDBResidue} from input column number (ColMap) to PDBResidue. Residues on inserts are not included.