MIToS.Utils
— ModuleThe Utils
has common utils functions and types used in other modules.
using MIToS.Utils
MIToS.Utils.THREE2ONE
— ConstantTHREE2ONE
is a dictionary that maps three-letter amino acid residue codes (String
) to their corresponding one-letter codes (Char
). The dictionary is generated by parsing components.cif
file from the Protein Data Bank.
julia> using MIToS.Utils
julia> one_letter_code = THREE2ONE["ALA"]
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
MIToS.Utils.All
— TypeAll is used instead of MIToS 1.0 "all" or "*", because it's possible to dispatch on it.
MIToS.Utils.FileFormat
— TypeFileFormat
is used for write special parse
(and read
) methods on it.
Base.read
— Methodread(pathname, FileFormat [, Type [, … ] ] ) -> Type
This function opens a file in the pathname
and calls parse(io, ...)
for the given FileFormat
and Type
on it. If the pathname
is an HTTP or FTP URL, the file is downloaded with download
in a temporal file. Gzipped files should end on .gz
.
Base.write
— Methodwrite{T<:FileFormat}(filename::AbstractString, object, format::Type{T}, mode::ASCIIString="w")
This function opens a file with filename
and mode
(default: "w") and writes (print
) the object
with the given format
. Gzipped files should end on .gz
.
MIToS.Utils._check_gzip_file
— MethodThis function raises an error if a GZip file doesn't have the 0x1f8b magic number.
MIToS.Utils._modify_kargs_for_proxy
— MethodHelper function that modifies keyword argument to include a proxy, the proxy URL is taken from the HTTPSPROXY and HTTPSPROXY enviromental variables.
MIToS.Utils.check_file
— MethodReturns the filename
. Throws an ErrorException
if the file doesn't exist, or a warning if the file is empty.
MIToS.Utils.check_pdbcode
— MethodIt checks if a PDB code has the correct format.
MIToS.Utils.download_file
— Methoddownload_file
uses HTTP.jl to download files from the web. It takes the file url as first argument and, optionally, a path to save it. Keyword arguments are are directly passed to to HTTP.download
(HTTP.request
). Use the headers
keyword argument to pass a Dict{String,String}
with the header information. Set the HTTPS_PROXY
and HTTPS_PROXY
ENV
iromental variables if you are behind a proxy.
julia> using MIToS.Utils
julia> download_file("http://www.uniprot.org/uniprot/P69905.fasta","seq.fasta",
headers = Dict("User-Agent" =>
"Mozilla/5.0 (compatible; MSIE 7.01; Windows NT 5.0)"))
"seq.fasta"
MIToS.Utils.get_n_words
— Methodget_n_words{T <: Union{ASCIIString, UTF8String}}(line::T, n::Int)
It returns a Vector{T}
with the first n
(possibles) words/fields (delimited by space or tab). If there is more than n
words, the last word returned contains the finals words and the delimiters. The length of the returned vector is n
or less (if the number of words is less than n
). This is used for parsing the Stockholm format.
julia> using MIToS.Utils
julia> get_n_words("#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH", 3)
3-element Vector{String}:
"#=GR"
"O31698/18-71"
"SS CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH"
MIToS.Utils.getarray
— MethodGetter for the array
field of NamedArray
s
MIToS.Utils.hascoordinates
— Methodhascoordinates(id)
It returns true
if id
/sequence name has the format: UniProt/start-end (i.e. O83071/192-246)
MIToS.Utils.isnotemptyfile
— MethodReturns true
if the file exists and isn't empty.
MIToS.Utils.lineiterator
— MethodCreate an iterable object that will yield each line from a stream or string.
MIToS.Utils.list2matrix
— MethodReturns a square symmetric matrix from the vector vec
. side
is the number of rows/columns. The diagonal
is not included by default, set to true
if there are diagonal elements in the list.
MIToS.Utils.matrix2list
— MethodReturns a vector with the part
("upper" or "lower") of the square matrix mat
. The diagonal
is not included by default.
MIToS.Utils.select_element
— MethodSelects the first element of the vector. This is useful for unpacking one element vectors. Throws a warning if there are more elements. element_name
is element by default, but the name can be changed using the second argument.
MIToS.Information
— ModuleThe Information
module of MIToS defines types and functions useful to calculate information measures (e.g. Mutual Information (MI) and Entropy) over a Multiple Sequence Alignment (MSA). This module was designed to count Residue
s (defined in the MSA
module) in special contingency tables (as fast as possible) and to derive probabilities from this counts. Also, includes methods for applying corrections to that tables, e.g. pseudocounts and pseudo frequencies. Finally, Information
allows to use this probabilities and counts to estimate information measures and other frequency based values.
Features
- Estimate multi dimensional frequencies and probabilities tables from sequences, MSAs, etc...
- Correction for small number of observations
- Correction for data redundancy on a MSA
- Estimate information measures
- Calculate corrected mutual information between residues
using MIToS.Information
MIToS.Information.BLOSUM62_Pi
— ConstantBLOSUM62 probabilities P(aa) for each residue on the UngappedAlphabet
. SUM: 0.9987
MIToS.Information.BLOSUM62_Pij
— ConstantTable with conditional probabilities of residues based on BLOSUM62. The normalization is done row based. The firts row contains the P(aa|A) and so one.
MIToS.Information.AdditiveSmoothing
— TypeAdditive Smoothing or fixed pseudocount λ
for ResidueCount
(in order to estimate probabilities when the number of samples is low).
Common values of λ
are:
0
: No cell frequency prior, gives you the maximum likelihood estimator.0.05
is the optimum value forλ
found in Buslje et. al. 2009, similar results was obtained forλ
in the range [0.025, 0.075].1 / p
: Perks prior (Perks, 1947) wherep
the number of parameters (i.e. residues, pairs of residues) to estimate. Ifp
is the number of residues (20
without counting gaps), this gives you0.05
.sqrt(n) / p
: Minimax prior (Trybula, 1958) wheren
is the number of samples andp
the number of parameters to estimate. If the number of samplesn
is 400 (minimum number of sequence clusters for achieve good performance in Buslje et. al. 2009) for estimating 400 parameters (pairs of residues without counting gaps) this gives you0.05
.0.5
: Jeffreys prior (Jeffreys, 1946).1
: Bayes-Laplace uniform prior, aka. Laplace smoothing.
MIToS.Information.BLOSUM_Pseudofrequencies
— TypeBLOSUM_Pseudofrequencies
type. It takes to arguments/fields:
α
: Usually the number of sequences or sequence clusters in the MSA.β
: The weight of the pseudofrequencies, a value close to 8.512 whenα
is the number of sequence clusters.
MIToS.Information.ContingencyTable
— TypeA ContingencyTable
is a multidimensional array. It stores the contingency matrix, its marginal values and total. The type also has an internal and private temporal array and an alphabet object. It's a parametric type, taking three ordered parameters:
T
: The element type of the multidimensional array.N
: It's the dimension of the array and should be anInt
.A
: This should be a type, subtype ofResidueAlphabet
, i.e.:UngappedAlphabet
,
GappedAlphabet
or ReducedAlphabet
.
A ContingencyTable
can be created from an alphabet if all the parameters are given. Otherwise, you need to give a type, a number (Val
) and an alphabet. You can also create a ContingencyTable
using a matrix and a alphabet. For example:
ContingencyTable{Float64, 2, UngappedAlphabet}(UngappedAlphabet())
ContingencyTable(Float64, Val{2}, UngappedAlphabet())
ContingencyTable(zeros(Float64,20,20), UngappedAlphabet())
MIToS.Information.Counts
— TypeA Counts
object wraps a ContingencyTable
storing counts/frequencies.
MIToS.Information.NoPseudocount
— TypeYou can use NoPseudocount()
to avoid pseudocount corrections where a Pseudocount
type is needed.
MIToS.Information.NoPseudofrequencies
— TypeYou can use NoPseudofrequencies()
to avoid pseudocount corrections where a Pseudofrequencies
type is needed.
MIToS.Information.Probabilities
— TypeA Probabilities
object wraps a ContingencyTable
storing probabilities. It doesn't perform any check. If the total isn't one, you must use normalize
or normalize!
on the ContingencyTable
before wrapping it to make the sum of the probabilities equal to one.
MIToS.Information.Pseudocount
— TypeParametric abstract type to define pseudocount types
MIToS.Information.Pseudofrequencies
— TypeParametric abstract type to define pseudofrequencies types
Base.count!
— MethodIt populates a ContingencyTable
(first argument) using the frequencies in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights and pseudocounts as second and third positional arguments respectively. You can use NoPseudofrequencies()
and NoClustering()
to avoid the use of sequence weighting and pseudocounts, respectively.
Base.count
— MethodIt returns a ContingencyTable
wrapped in a Counts
type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet
, weights
and pseudocounts
to indicate the alphabet of the table (default to UngappedAlphabet()
), a clustering result (default to NoClustering()
) and the pseudocounts (default to NoPseudocount()
) to be used during the estimation of the frequencies.
LinearAlgebra.normalize!
— Methodnormalize!
makes the sum of the frequencies to be one, in place.
LinearAlgebra.normalize
— Methodnormalize
returns another table where the sum of the frequencies is one.
MIToS.Information.APC!
— MethodAPC (Dunn et. al. 2008)
MIToS.Information.BLMI
— MethodBLMI
takes a MSA or a file and a FileFormat
as first arguments. It calculates a Z score (ZBLMI) and a corrected MI/MIp as described on Busjle et. al. 2009 but using using BLOSUM62 pseudo frequencies instead of a fixed pseudocount.
Keyword argument, type, default value and descriptions:
- beta Float64 8.512 β for BLOSUM62 pseudo frequencies
- lambda Float64 0.0 Low count value
- threshold 62 Percent identity threshold for sequence clustering (Hobohm I)
- maxgap Float64 0.5 Maximum fraction of gaps in positions included in calculation
- apc Bool true Use APC correction (MIp)
- samples Int 50 Number of samples for Z-score
- fixedgaps Bool true Fix gaps positions for the random samples
This function returns:
- Z score (ZBLMI)
- MI or MIp using BLOSUM62 pseudo frequencies (BLMI/BLMIp)
MIToS.Information._calculate_blosum_pseudofrequencies!
— Method_calculate_blosum_pseudofrequencies!{T}(Pab::ContingencyTable{T,2,UngappedAlphabet})
This function uses the conditional probability matrix BLOSUM62_Pij
to fill the temporal array field of Pab
with pseudo frequencies (Gab
). This function needs the real frequencies/probabilities Pab
because they are used to estimate the pseudofrequencies.
Gab = Σcd Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d )
MIToS.Information._marginal
— Method_marginal(1,:A,:i,:value)
generates the expression: A[i_1, 1] += value
MIToS.Information._mean_column
— MethodMean mutual information of column a (Dunn et. al. 2008). Summation is over j=1 to N, j ≠ a. Total is N-1.
MIToS.Information._mean_total
— MethodMean mutual information of column a (Dunn et. al. 2008). Summation is over j=1 to N, j ≠ a. Total is N-1.
Overall mean mutual information (Dunn et. al. 2008). 2/(N*(N-1)) by the sum of MI where the indices run i=1 to N-1, j=i+1 to N (triu).
MIToS.Information._test_index
— Method_test_index(1, i, continue)
generates the expression: i_1 >= 22 && continue
MIToS.Information.apply_pseudocount!
— MethodIt adds the pseudocount
value to the table cells.
MIToS.Information.apply_pseudofrequencies!
— Methodapply_pseudofrequencies!{T}(Pab::ContingencyTable{T,2,UngappedAlphabet}, pseudofrequencies::BLOSUM_Pseudofrequencies)
When a BLOSUM_Pseudofrequencies(α,β)
is used, this function applies pseudofrequencies Gab
over Pab
, as a weighted mean of both. It uses the conditional probability matrix BLOSUM62_Pij
and the real frequencies/probabilities Pab
to estimate the pseudofrequencies Gab
. α is the weight of the real frequencies Pab
and β the weight of the pseudofrequencies.
Gab = Σcd Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d )
Pab = (α ⋅ Pab + β ⋅ Gab )/(α + β)
MIToS.Information.buslje09
— Methodbuslje09
takes a MSA or a file and a FileFormat
as first arguments. It calculates a Z score and a corrected MI/MIp as described on Busjle et. al. 2009.
keyword argument, type, default value and descriptions:
- lambda Float64 0.05 Low count value
- clustering Bool true Sequence clustering (Hobohm I)
- threshold 62 Percent identity threshold for clustering
- maxgap Float64 0.5 Maximum fraction of gaps in positions included in calculation
- apc Bool true Use APC correction (MIp)
- samples Int 100 Number of samples for Z-score
- fixedgaps Bool true Fix gaps positions for the random samples
- alphabet ResidueAlphabet UngappedAlphabet() Residue alphabet to be used
This function returns:
- Z score
- MI or MIp
MIToS.Information.cleanup!
— Methodcleanup!
fills the temporal, table and marginals arrays with zeros. It also sets total to zero.
MIToS.Information.cumulative
— Methodcumulative
allows to calculate cumulative scores (i.e. cMI) as defined in Buslje et. al. 2010
"We calculated a cumulative mutual information score (cMI) for each residue as the sum of MI values above a certain threshold for every amino acid pair where the particular residue appears. This value defines to what degree a given amino acid takes part in a mutual information network." Buslje, Cristina Marino, Elin Teppa, Tomas Di Doménico, José María Delfino, and Morten Nielsen. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 6, no. 11 (2010): e1000978.
MIToS.Information.delete_dimensions!
— Methoddelete_dimensions!(out::ContingencyTable, in::ContingencyTable, dimensions::Int...)
This function fills a ContingencyTable with the counts/probabilities on in
after the deletion of dimensions
. i.e. This is useful for getting Pxy from Pxyz.
MIToS.Information.delete_dimensions
— Methoddelete_dimensions(in::ContingencyTable, dimensions::Int...)
This function creates a ContingencyTable with the counts/probabilities on in
after the deletion of dimensions
. i.e. This is useful for getting Pxy from Pxyz.
MIToS.Information.gap_intersection_percentage
— MethodIt calculates the gap intersection as percentage from a table of Counts
.
MIToS.Information.gap_union_percentage
— MethodIt calculates the gap union as percentage from a table of Counts
.
MIToS.Information.gaussdca
— MethodWrapper function to GaussDCA.gDCA
. You need to install GaussDCA:
using Pkg
Pkg.add(PackageSpec(url="https://github.com/carlobaldassi/GaussDCA.jl", rev="master"))
Look into GaussDCA.jl README for further information. If you use this wrapper, please cite the GaussDCA publication and the package's doi.
It's possible to indicate the path to the julia binary where GaussDCA is installed. However, it's recommended to use the same version where MIToS is installed. That is because this function use serialize
/deserialize
to transfer data between the processes.
GaussDCA Publication: Baldassi, Carlo, Marco Zamparo, Christoph Feinauer, Andrea Procaccini, Riccardo Zecchina, Martin Weigt, and Andrea Pagnani. "Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners." PloS one 9, no. 3 (2014): e92721.
MIToS.Information.getalphabet
— Methodgetalphabet
allows to access the stored alphabet object.
MIToS.Information.getcontingencytable
— Methodgetcontingencytable
allows to access the wrapped ContingencyTable
in a Probabilities
or Counts
object.
MIToS.Information.getmarginals
— Methodgetmarginals
allows to access the array with the marginal values (NamedArray
).
MIToS.Information.getmarginalsarray
— Methodgetmarginalsarray
allows to access the array with the marginal values (Array
without names).
MIToS.Information.gettable
— Methodgettable
allows to access the table (NamedArray
).
MIToS.Information.gettablearray
— Methodgettablearray
allows to access the table (Array
without names).
MIToS.Information.gettotal
— Methodgettotal
allows to access the stored total value.
MIToS.Information.kullback_leibler
— MethodIt calculates the Kullback-Leibler (KL) divergence from a table of Probabilities
. The second positional argument is a Probabilities
or ContingencyTable
with the background distribution. It's optional, the default is the BLOSUM62_Pi
table. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.
MIToS.Information.mapcolfreq!
— MethodIt efficiently map a function (first argument) that takes a table of Counts
or Probabilities
(third argument). The table is filled in place with the counts or probabilities of each column from the msa
(second argument).
weights
(default:NoClustering()
): Weights to be used for table counting.pseudocounts
(default:NoPseudocount()
):Pseudocount
object to be applied to table.pseudofrequencies
(default:NoPseudofrequencies()
):Pseudofrequencies
to be applied to the normalized (probabilities) table.
MIToS.Information.mapcolpairfreq!
— MethodIt efficiently map a function (first argument) that takes a table of Counts
or Probabilities
(third argument). The table is filled in place with the counts or probabilities of each pair of columns from the msa
(second argument). The fourth positional argument usediagonal
indicates if the function should be applied to identical element pairs (default to Val{true}
).
weights
(default:NoClustering()
): Weights to be used for table counting.pseudocounts
(default:NoPseudocount()
):Pseudocount
object to be applied to table.pseudofrequencies
(default:NoPseudofrequencies()
):Pseudofrequencies
to be applied to the normalized (probabilities) table.diagonalvalue
(default:0
): Value to fill diagonal elements ifusediagonal
isVal{false}
.
MIToS.Information.mapseqfreq!
— MethodIt efficiently map a function (first argument) that takes a table of Counts
or Probabilities
(third argument). The table is filled in place with the counts or probabilities of each sequence from the msa
(second argument).
weights
(default:NoClustering()
): Weights to be used for table counting.pseudocounts
(default:NoPseudocount()
):Pseudocount
object to be applied to table.pseudofrequencies
(default:NoPseudofrequencies()
):Pseudofrequencies
to be applied to the normalized (probabilities) table.
MIToS.Information.mapseqpairfreq!
— MethodIt efficiently map a function (first argument) that takes a table of Counts
or Probabilities
(third argument). The table is filled in place with the counts or probabilities of each pair of sequences from the msa
(second argument). The fourth positional argument usediagonal
indicates if the function should be applied to identical element pairs (default to Val{true}
).
weights
(default:NoClustering()
): Weights to be used for table counting.pseudocounts
(default:NoPseudocount()
):Pseudocount
object to be applied to table.pseudofrequencies
(default:NoPseudofrequencies()
):Pseudofrequencies
to be applied to the normalized (probabilities) table.diagonalvalue
(default:0
): Value to fill diagonal elements ifusediagonal
isVal{false}
.
MIToS.Information.marginal_entropy
— MethodIt calculates marginal entropy (H) from a table of Counts
or Probabilities
. The second positional argument is used to indicate the magin used to calculate the entropy, e.g. it estimates the entropy H(X) if marginal is 1, H(Y) for 2, etc. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.
MIToS.Information.mutual_information
— MethodIt calculates Mutual Information (MI) from a table of Counts
or Probabilities
. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits. Calculation of MI from Counts
is faster than from Probabilities
.
MIToS.Information.normalized_mutual_information
— MethodIt calculates a Normalized Mutual Information (nMI) by Entropy from a table of Counts
or Probabilities
.
nMI(X, Y) = MI(X, Y) / H(X, Y)
MIToS.Information.pairwisegapfraction
— MethodIt takes a MSA or a file and a FileFormat
as first arguments. It calculates the percentage of gaps on columns pairs (union and intersection) using sequence clustering (Hobohm I).
Argument, type, default value and descriptions:
- clustering Bool true Sequence clustering (Hobohm I)
- threshold 62 Percent identity threshold for sequence clustering (Hobohm I)
This function returns:
- pairwise gap union as percentage
- pairwise gap intersection as percentage
MIToS.Information.probabilities!
— MethodIt populates a ContingencyTable
(first argument) using the probabilities in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights, pseudocounts and pseudofrequencies as second, third and fourth positional arguments respectively. You can use NoClustering()
, NoPseudocount()
and NoPseudofrequencies()
to avoid the use of sequence weighting, pseudocounts and pseudofrequencies, respectively.
MIToS.Information.probabilities
— MethodIt returns a ContingencyTable
wrapped in a Probabilities
type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet
, weights
, pseudocounts
and pseudofrequencies
to indicate the alphabet of the table (default to UngappedAlphabet()
), a clustering result (default to NoClustering()
), the pseudocounts (default to NoPseudocount()
) and the pseudofrequencies (default to NoPseudofrequencies()
) to be used during the estimation of the probabilities.
MIToS.Information.update_marginals!
— Methodupdate_marginals!
updates the marginal and total values using the table.
StatsBase.entropy
— MethodIt calculates the Shannon entropy (H) from a table of Counts
or Probabilities
. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.
MIToS.MSA
— ModuleThe MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).
Features
- Read and write MSAs in
Stockholm
,FASTA
orRaw
format - Handle MSA annotations
- Edit the MSA, e.g. delete columns or sequences, change sequence order, shuffling...
- Keep track of positions and annotations after modifications on the MSA
- Describe a MSA, e.g. mean percent identity, sequence coverage, gap percentage...
using MIToS.MSA
MIToS.MSA.GAP
— ConstantGAP
is the Residue
representation on MIToS for gaps ('-'
, insertions and deletions). Lowercase residue characters, dots and '*'
are encoded as GAP
in conversion from String
s and Char
s. This Residue
constant is encoded as Residue(21)
.
MIToS.MSA.XAA
— ConstantXAA
is the Residue
representation for unknown, ambiguous and non standard residues. This Residue
constant is encoded as Residue(22)
.
MIToS.MSA._max_char
— Constant'z' is the maximum between 'A':'Z', 'a':'z', '.', '-' and '*'. 'z' is 'GAP' but the next character to 'z' is '{', i.e. XAA
.
MIToS.MSA.AbstractAlignedObject
— TypeMIToS MSA and aligned sequences (aligned objects) are subtypes of AbstractMatrix{Residue}
, because MSAs and sequences are stored as Matrix
of Residue
s.
MIToS.MSA.AbstractAlignedSequence
— TypeA MIToS aligned sequence is an AbstractMatrix{Residue}
with only 1 row/sequence.
MIToS.MSA.AbstractMultipleSequenceAlignment
— TypeMSAs are stored as Matrix{Residue}
. It's possible to use a NamedResidueMatrix{Array{Residue,2}}
as the most simple MSA with sequence identifiers and column names.
MIToS.MSA.AlignedSequence
— TypeAn AlignedSequence
wraps a NamedResidueMatrix{Array{Residue,2}}
with only 1 row/sequence. The NamedArray
stores the sequence name and original column numbers as String
s.
MIToS.MSA.AnnotatedAlignedSequence
— TypeThis type represent an aligned sequence, similar to AlignedSequence
, but It also stores its Annotations
.
MIToS.MSA.AnnotatedMultipleSequenceAlignment
— TypeThis type represent an MSA, similar to MultipleSequenceAlignment
, but It also stores Annotations
. This annotations are used to store residue coordinates (i.e. mapping to UniProt residue numbers).
MIToS.MSA.Annotations
— TypeThe Annotations
type is basically a container for Dict
s with the annotations of a multiple sequence alignment. Annotations
was designed for storage of annotations of the Stockholm format.
MIToS also uses MSA annotations to keep track of:
- Modifications of the MSA (
MIToS_...
) as deletion of sequences or columns. - Positions numbers in the original MSA file (column mapping:
ColMap
) - Position of the residues in the sequence (sequence mapping:
SeqMap
)
MIToS.MSA.Clusters
— TypeData structure to represent sequence clusters. The sequence data itself is not included.
MIToS.MSA.GappedAlphabet
— TypeThis type defines the usual alphabet of the 20 natural residues and a gap character.
julia> using MIToS.MSA
julia> GappedAlphabet()
GappedAlphabet of length 21. Residues : res"ARNDCQEGHILKMFPSTWYV-"
MIToS.MSA.MultipleSequenceAlignment
— TypeThis MSA type include a NamedArray
wrapping a Matrix
of Residue
s. The use of NamedArray
allows to store sequence names and original column numbers as String
s, and fast indexing using them.
MIToS.MSA.NoClustering
— TypeUse NoClustering()
to avoid the use of clustering where a Clusters
type is needed.
MIToS.MSA.ReducedAlphabet
— TypeReducedAlphabet
allows the construction of reduced residue alphabets, where residues inside parenthesis belong to the same group.
julia> using MIToS.MSA
julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"
julia> ab[Residue('K')]
2
MIToS.MSA.Residue
— TypeMost of the MIToS design is created around the Residue
bitstype. It has representations for the 20 natural amino acids, a value representing insertions and deletions (GAP
, '-'
) and one representing unknown, ambiguous and non standard residues (XAA
, 'X'
). Each Residue
is encoded as an integer number, with the same bit representation and size than a Int
. This allows fast indexing operation of probability or frequency matrices.
Residue creation and conversion
Creation and conversion of Residue
s should be treated carefully. Residue
is encoded as a 32 or 64 bits type similar to Int
, to get fast indexing using Int(x::Residue)
. Int
simply calls reinterpret
without checking if the residue is valid. Valid residues have integer values in the closed interval [1,22]. convert
from Int
and Char
always returns valid residues, however it's possible to find invalid residues (they are shown using the character '�'
) after the creation of uninitialized Residue
arrays (i.e. using Array
). You can use zeros
, ones
or rand
to get initialized Residue
arrays with valid residues. Conversions to and from Char
s changes the bit representation and allows the use of the usual character representation of residues and amino acids. This conversions are used in IO operations and always return valid residues. In conversions from Char
, lowercase letters, '*'
, '-'
and '.'
are translated to GAP
, letters representing the 20 natural amino (ARNDCQEGHILKMFPSTWYV) acids are translated to their corresponding Residue
and any other character is translated to XAA
. Since lowercase letters and dots are translated to gaps, Pfam MSA insert columns are converted to columns full of gaps.
julia> using MIToS.MSA
julia> alanine = Residue('A')
A
julia> Char(alanine)
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
julia> for residue in res"ARNDCQEGHILKMFPSTWYV-X"
println(residue, " ", Int(residue))
end
A 1
R 2
N 3
D 4
C 5
Q 6
E 7
G 8
H 9
I 10
L 11
K 12
M 13
F 14
P 15
S 16
T 17
W 18
Y 19
V 20
- 21
X 22
MIToS.MSA.ResidueAlphabet
— TypeAbstract type to define residue alphabet types.
MIToS.MSA.UngappedAlphabet
— TypeThis type defines the usual alphabet of the 20 natural residues, without the gap character.
julia> using MIToS.MSA
julia> UngappedAlphabet()
UngappedAlphabet of length 20. Residues : res"ARNDCQEGHILKMFPSTWYV"
Base.isvalid
— Methodisvalid(res::Residue)
It returns true
if the encoded integer is in the closed interval [1,22].
Base.names
— MethodIt returns the name of each group. The name is a string with the one letter code of each residue that belong to the group.
julia> using MIToS.MSA
julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"
julia> names(ab)
8-element Vector{String}:
"AILMV"
"RHK"
"NQST"
"DE"
"FWY"
"C"
"G"
"P"
Base.parse
— Functionparse(io, format[, output; generatemapping, useidcoordinates, deletefullgaps])
The keyword argument generatemapping
(false
by default) indicates if the mapping of the sequences ("SeqMap") and columns ("ColMap") and the number of columns in the original MSA ("NCol") should be generated and saved in the annotations. If useidcoordinates
is true
(default: false
) the sequence IDs of the form "ID/start-end" are parsed and used for determining the start and end positions when the mappings are generated. deletefullgaps
(true
by default) indicates if columns 100% gaps (generally inserts from a HMM) must be removed from the MSA.
Base.rand
— MethodIt chooses from the 20 natural residues (it doesn't generate gaps).
julia> using MIToS.MSA
julia> using Random
julia> Random.seed!(1); # Reseed the random number generator.
julia> rand(Residue)
R
julia> rand(Residue, 4, 4)
4×4 Matrix{Residue}:
E D D A
F S K K
M S I M
Y F E D
Clustering.assignments
— MethodGet a vector of assignments, where the i
value is the index/number of the cluster to which the i-th sequence is assigned.
Clustering.nclusters
— MethodGet the number of clusters in a Clusters
object.
MIToS.MSA._fill_hobohmI!
— MethodFill cluster
and clustersize
matrices. These matrices are assumed to be empty (only zeroes) and their length is assumed to be equal to the number of sequences in the alignment (aln
). threshold
is the minimum identity value between two sequences to be in the same cluster.
MIToS.MSA._filter_mapping
— MethodFor filter column and sequence mapping of the format: ",,,,10,11,,12"
MIToS.MSA._get_selected_sequences
— MethodHelper function to create a boolean mask to select sequences.
MIToS.MSA._get_seqid_index
— MethodThis function takes a vector of sequence names and a sequence id. It returns the position of that id in the vector. If the id isn't in the vector, It throws an error.
MIToS.MSA._get_sequence_weight
— MethodCalculates the weight of each sequence in a cluster. The weight is equal to one divided by the number of sequences in the cluster.
MIToS.MSA._keepinserts!
— MethodFunction to keep insert columns in parse
. It uses the first sequence to generate the "Aligned" annotation, and after that, convert all the characters to uppercase.
MIToS.MSA._percentidentity
— Methodseq1 and seq2 should have the same len
MIToS.MSA._str2int_mapping
— MethodConverts a string of mappings into a vector of Int
s
julia> using MIToS.MSA
julia> MSA._str2int_mapping(",,2,,4,5")
6-element Vector{Int64}:
0
0
2
0
4
5
MIToS.MSA._swap!
— MethodIt swaps the names on the positions i
and j
of a Vector{String}
MIToS.MSA._valid_residue_integer
— MethodIt takes an Int
and returns the Int
value used to represent a valid Residue
. Invalid residues are encoded as the integer 23.
MIToS.MSA.adjustreference
— FunctionCreates a new matrix of residues. This function deletes positions/columns of the MSA with gaps in the reference (first) sequence.
MIToS.MSA.adjustreference!
— FunctionIt removes positions/columns of the MSA with gaps in the reference (first) sequence.
MIToS.MSA.annotate_modification!
— MethodAnnotates on file annotations the modifications realized by MIToS on the MSA. It always returns true
, so It can be used in a boolean context.
MIToS.MSA.annotations
— Methodannotations
returns the Annotations
of an MSA or aligned sequence.
MIToS.MSA.columngapfraction
— MethodFraction of gaps per column/position on the MSA
MIToS.MSA.columnnames
— Methodcolumnnames(msa)
It returns a Vector{String}
with the sequence names/identifiers. If the msa
is a Matrix{Residue}
this function returns the actual column numbers as strings. Otherwise it returns the column number of the original MSA through the wrapped NamedArray
column names.
MIToS.MSA.columnpairsmatrix
— MethodInitialize an empty PairwiseListMatrix
for a pairwise measure in sequence pairs. It uses the sequence names if they are available, otherwise it uses the actual sequence numbers. You can use the positional argument to indicate the number Type
(default: Float64
), if the PairwiseListMatrix
should store the diagonal values on the list (default: false
) and a default value for the diagonal (default: NaN
).
MIToS.MSA.coverage
— MethodCoverage of the sequences with respect of the number of positions on the MSA
MIToS.MSA.delete_annotated_modifications!
— MethodDeletes all the MIToS annotated modifications
MIToS.MSA.deletefullgapcolumns!
— FunctionDeletes columns with 100% gaps, this columns are generated by inserts.
MIToS.MSA.filtercolumns!
— Functionfiltercolumns!(msa, mask[, annotate::Bool=true])
It allows to filter MSA or aligned sequence columns/positions using a AbstractVector{Bool}
mask
. Annotations are updated if annotate
is true
(default).
MIToS.MSA.filtercolumns!
— Methodfiltercolumns!(data::Annotations, mask)
It is useful for deleting column annotations (creating a subset in place).
MIToS.MSA.filtercolumns
— MethodIt's similar to filtercolumns!
but for an AbstractMatrix{Residue}
MIToS.MSA.filtersequences!
— Functionfiltersequences!(msa, mask[, annotate::Bool=true])
It allows to filter msa
sequences using a AbstractVector{Bool}
mask
(It removes sequences with false
values). AnnotatedMultipleSequenceAlignment
annotations are updated if annotate
is true
(default).
MIToS.MSA.filtersequences!
— Methodfiltersequences!(data::Annotations, ids::Vector{String}, mask::AbstractArray{Bool,1})
It is useful for deleting sequence annotations. ids
should be a list of the sequence names and mask
should be a logical vector.
MIToS.MSA.filtersequences
— MethodIt's similar to filtersequences!
but for an AbstractMatrix{Residue}
MIToS.MSA.gapfraction
— MethodIt calculates the fraction of gaps on the Array
(alignment, sequence, column, etc.). This function can take an extra dimension
argument for calculation of the gap fraction over the given dimension.
MIToS.MSA.gapstrip!
— FunctionThis functions deletes/filters sequences and columns/positions on the MSA on the following order:
- Removes all the columns/position on the MSA with gaps on the reference (first) sequence.
- Removes all the sequences with a coverage with respect to the number of
columns/positions on the MSA less than a coveragelimit
(default to 0.75
: sequences with 25% of gaps).
- Removes all the columns/position on the MSA with more than a
gaplimit
(default to 0.5
: 50% of gaps).
MIToS.MSA.gapstrip
— MethodCreates a new matrix of Residue
s (MSA) with deleted sequences and columns/positions. The MSA is edited in the following way:
- Removes all the columns/position on the MSA with gaps on the reference (first) sequence
- Removes all the sequences with a coverage with respect to the number of
columns/positions on the MSA less than a coveragelimit
(default to 0.75
: sequences with 25% of gaps)
- Removes all the columns/position on the MSA with more than a
gaplimit
(default to 0.5
: 50% of gaps)
MIToS.MSA.getannotcolumn
— Functiongetannotcolumn(ann[, feature[,default]])
It returns per column annotation for feature
MIToS.MSA.getannotfile
— Functiongetannotfile(ann[, feature[,default]])
It returns per file annotation for feature
MIToS.MSA.getannotresidue
— Functiongetannotresidue(ann[, seqname, feature[,default]])
It returns per residue annotation for (seqname, feature)
MIToS.MSA.getannotsequence
— Functiongetannotsequence(ann[, seqname, feature[,default]])
It returns per sequence annotation for (seqname, feature)
MIToS.MSA.getcolumnmapping
— MethodIt returns a Vector{Int}
with the original column number of each column on the actual MSA. The mapping is annotated in the "ColMap" file annotation of an AnnotatedMultipleSequenceAlignment
or in the column names of an NamedArray
or MultipleSequenceAlignment
.
MIToS.MSA.gethcatmapping
— MethodIt returns a vector of numbers from 1
to N for each column that indicates the source MSA. The mapping is annotated in the "HCat"
file annotation of an AnnotatedMultipleSequenceAlignment
or in the column names of an NamedArray
or MultipleSequenceAlignment
.
MIToS.MSA.getnamedict
— MethodIt takes a ResidueAlphabet
and returns a dictionary from group name to group position.
julia> using MIToS.MSA
julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"
julia> getnamedict(ab)
OrderedCollections.OrderedDict{String, Int64} with 8 entries:
"AILMV" => 1
"RHK" => 2
"NQST" => 3
"DE" => 4
"FWY" => 5
"C" => 6
"G" => 7
"P" => 8
MIToS.MSA.getresidues
— Methodgetresidues
allows you to access the residues stored inside an MSA or aligned sequence as a Matrix{Residue}
without annotations nor column/row names.
MIToS.MSA.getresiduesequences
— Methodgetresiduesequences
returns a Vector{Vector{Residue}}
with all the MSA sequences without annotations nor column/sequence names.
MIToS.MSA.getsequence
— Functiongetsequence
takes an MSA and a sequence number or identifier and returns an aligned sequence object. If the MSA is an AnnotatedMultipleSequenceAlignment
, it returns an AnnotatedAlignedSequence
with the sequence annotations. From a MultipleSequenceAlignment
, It returns an AlignedSequence
object. If an Annotations
object and a sequence identifier are used, this function returns the annotations related to the sequence.
MIToS.MSA.getsequencemapping
— MethodIt returns the sequence coordinates as a Vector{Int}
for an MSA sequence. That vector has one element for each MSA column. If the number if 0
in the mapping, there is a gap in that column for that sequence.
MIToS.MSA.getweight
— Methodgetweight(c[, i::Int])
This function returns the weight of the sequence number i
. getweight should be defined for any type used for count!
/count
in order to use his weigths. If i
isn't used, this function returns a vector with the weight of each sequence.
MIToS.MSA.hobohmI
— MethodSequence clustering using the Hobohm I method from Hobohm et. al. 1992.
Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.
MIToS.MSA.meanpercentidentity
— FunctionReturns the mean of the percent identity between the sequences of a MSA. If the MSA has 300 sequences or less, the mean is exact. If the MSA has more sequences and the exact
keyword is false
(defualt), 44850 random pairs of sequences are used for the estimation. The number of samples can be changed using the second argument. Use exact=true
to perform all the pairwise comparison (the calculation could be slow).
MIToS.MSA.namedmatrix
— Methodnamedmatrix
returns the NamedResidueMatrix{Array{Residue,2}}
stored in an MSA or aligned sequence.
MIToS.MSA.ncolumns
— Methodncolumns
returns the number of MSA columns or positions.
MIToS.MSA.ncolumns
— Methodncolumns(ann::Annotations)
returns the number of columns/residues with annotations. This function returns -1
if there is not annotations per column/residue.
MIToS.MSA.nsequences
— Methodnsequences
returns the number of sequences on the MSA.
MIToS.MSA.percentidentity
— Methodpercentidentity(seq1, seq2, threshold)
Computes quickly if two aligned sequences have a identity value greater than a given threshold
value. Returns a boolean value. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA
in at least one sequence aren't counted.
MIToS.MSA.percentidentity
— Methodpercentidentity(seq1, seq2)
Calculates the fraction of identities between two aligned sequences. The identity value is calculated as the number of identical characters in the i-th position of both sequences divided by the length of both sequences. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA
in at least one sequence aren't counted.
MIToS.MSA.percentidentity
— Methodpercentidentity(msa[, out::Type=Float64])
Calculates the identity between all the sequences on a MSA. You can indicate the output element type with the last optional parameter (Float64
by default). For a MSA with a lot of sequences, you can use Float32
or Flot16
in order to avoid the OutOfMemoryError()
.
MIToS.MSA.percentsimilarity
— FunctionCalculates the similarity percent between two aligned sequences. The 100% is the length of the aligned sequences minus the number of columns with gaps in both sequences and the number of columns with at least one residue outside the alphabet. So, columns with residues outside the alphabet (other than the specially treated GAP
) aren't counted to the protein length. Two residues are considered similar if they below to the same group in a ReducedAlphabet
. The alphabet
(third positional argument) by default is:
ReducedAlphabet("(AILMV)(NQST)(RHK)(DE)(FWY)CGP")
The first group is composed of the non polar residues (AILMV)
, the second group is composed of polar residues, the third group are positive residues, the fourth group are negative residues, the fifth group is composed by the aromatic residues (FWY)
. C
, G
and P
are considered unique residues.
Other residue groups/alphabets:
SMS (Sequence Manipulation Suite) Ident and Sim:
ReducedAlphabet("(GAVLI)(FYW)(ST)(KRH)(DENQ)P(CM)")
Stothard P (2000) The Sequence Manipulation Suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques 28:1102-1104.
Bio3D 2.2 seqidentity:
ReducedAlphabet("(GA)(MVLI)(FYW)(ST)(KRH)(DE)(NQ)PC")
Grant, B.J. et al. (2006) Bioinformatics 22, 2695–2696.
MIToS.MSA.percentsimilarity
— MethodCalculates the similarity percent between all the sequences on a MSA. You can indicate the output element type with the out
keyword argument (Float64
by default). For an MSA with a lot of sequences, you can use out=Float32
or out=Flot16
in order to avoid the OutOfMemoryError()
.
MIToS.MSA.printmodifications
— MethodPrints MIToS annotated modifications
MIToS.MSA.residue2three
— MethodThis function returns the three letter name of the Residue
.
julia> using MIToS.MSA
julia> residue2three(Residue('G'))
"GLY"
MIToS.MSA.residuefraction
— MethodIt calculates the fraction of residues (no gaps) on the Array
(alignment, sequence, column, etc.). This function can take an extra dimension
argument for calculation of the residue fraction over the given dimension
MIToS.MSA.sequencenames
— Methodsequencenames(msa)
It returns a Vector{String}
with the sequence names/identifiers.
MIToS.MSA.sequencepairsmatrix
— MethodInitialize an empty PairwiseListMatrix
for a pairwise measure in column pairs. It uses the column mapping (column number in the input MSA file) if it’s available, otherwise it uses the actual column numbers. You can use the positional argument to indicate the number Type
(default: Float64
), if the PairwiseListMatrix
should store the diagonal values on the list (default: false
) and a default value for the diagonal (default: NaN
).
MIToS.MSA.setannotcolumn!
— Functionsetannotcolumn!(ann, feature, annotation)
It stores per column annotation
(1 char per column) for feature
MIToS.MSA.setannotfile!
— Functionsetannotfile!(ann, feature, annotation)
It stores per file annotation
for feature
MIToS.MSA.setannotresidue!
— Functionsetannotresidue!(ann, seqname, feature, annotation)
It stores per residue annotation
(1 char per residue) for (seqname, feature)
MIToS.MSA.setannotsequence!
— Functionsetannotsequence!(ann, seqname, feature, annotation)
It stores per sequence annotation
for (seqname, feature)
MIToS.MSA.setreference!
— FunctionIt puts the sequence i
(name or position) as reference (first sequence) of the MSA. This function swaps the sequences 1 and i
.
MIToS.MSA.stringsequence
— Methodstringsequence(seq)
stringsequence(msa, i::Int)
stringsequence(msa, id::String)
It returns the selected sequence as a String
.
MIToS.MSA.swapsequences!
— MethodIt swaps the sequences on the positions i
and j
of an MSA. Also it's possible to swap sequences using their sequence names/identifiers when the MSA object as names.
MIToS.MSA.three2residue
— MethodIt takes a three letter residue name and returns the corresponding Residue
. If the name isn't in the MIToS dictionary, a XAA
is returned.
julia> using MIToS.MSA
julia> three2residue("ALA")
A
Random.shuffle!
— FunctionIt's like Random.shuffle
. When a Matrix{Residue}
is used, you can indicate if the gaps should remain their positions using the last boolean argument. The previous argument should be the dimension to shuffle, 1 for shuffling residues in a sequence (row) or 2 for shuffling residues in a column.
julia> using MIToS.MSA
julia> using Random
julia> msa = hcat(res"RRE",res"DDK", res"G--")
3×3 Matrix{Residue}:
R D G
R D -
E K -
julia> Random.seed!(42);
julia> shuffle(msa, 1, true)
3×3 Matrix{Residue}:
G D R
R D -
E K -
julia> Random.seed!(42);
julia> shuffle(msa, 1, false)
3×3 Matrix{Residue}:
G D R
R - D
E K -
Random.shuffle
— MethodIt's like shuffle
but in-place. When a Matrix{Residue}
or a AbstractAlignedObject
(sequence or MSA) is used, you can indicate if the gaps should remain their positions using the last boolean argument.
StatsBase.counts
— MethodGet sample counts of clusters as a Vector
. Each k
value is the number of samples assigned to the k-th cluster.
MIToS.MSA.@res_str
— MacroThe MIToS macro @res_str
takes a string and returns a Vector
of Residues
(sequence).
julia> using MIToS.MSA
julia> res"MIToS"
5-element Vector{Residue}:
M
I
T
-
S
MIToS.SIFTS
— ModuleThe SIFTS
module of MIToS allows to obtain the residue-level mapping between databases stored in the SIFTS XML files. It makes easy to assign PDB residues to UniProt/Pfam positions. Given the fact that pairwise alignments can lead to misleading association between residues in both sequences, SIFTS offers more reliable association between sequence and structure residue numbers.
Features
- Download and parse SIFTS XML files
- Store residue-level mapping in Julia
- Easy generation of
OrderedDict
s between residues numbers
using MIToS.SIFTS
MIToS.SIFTS.SIFTSResidue
— TypeA SIFTSResidue
object stores the SIFTS residue level mapping for a residue. It has the following fields that you can access at any moment for query purposes:
- `PDBe` : A `dbPDBe` object, it's present in all the `SIFTSResidue`s.
- `UniProt` : A `dbUniProt` object or `missing`.
- `Pfam` : A `dbPfam` object or `missing`.
- `NCBI` : A `dbNCBI` object or `missing`.
- `InterPro` : An array of `dbInterPro` objects.
- `PDB` : A `dbPDB` object or `missing`.
- `SCOP` : A `dbSCOP` object or `missing`.
- `SCOP2` : An array of `dbSCOP2` objects.
- `SCOP2B` : A `dbSCOP2B` object or `missing`.
- `CATH` : A `dbCATH` object or `missing`.
- `Ensembl` : An array of `dbEnsembl` objects.
- `missing` : It's `true` if the residue is missing, i.e. not observed, in the structure.
- `sscode` : A string with the secondary structure code of the residue.
- `ssname` : A string with the secondary structure name of the residue.
MIToS.SIFTS.dbCATH
— TypedbCATH
stores the residue id
, number
, name
and chain
in CATH as strings.
MIToS.SIFTS.dbEnsembl
— TypedbEnsembl
stores the residue (gene) accession id
, the transcript
, translation
and exon
ids in Ensembl as strings, together with the residue number
and name
using the UniProt coordinates.
MIToS.SIFTS.dbInterPro
— TypedbInterPro
stores the residue id
, number
, name
and evidence
in InterPro as strings.
MIToS.SIFTS.dbNCBI
— TypedbNCBI
stores the residue id
, number
and name
in NCBI as strings.
MIToS.SIFTS.dbPDB
— TypedbPDB
stores the residue id
, number
, name
and chain
in PDB as strings.
MIToS.SIFTS.dbPDBe
— TypedbPDBe
stores the residue number
and name
in PDBe as strings.
MIToS.SIFTS.dbPfam
— TypedbPfam
stores the residue id
, number
and name
in Pfam as strings.
MIToS.SIFTS.dbSCOP
— TypedbSCOP
stores the residue id
, number
, name
and chain
in SCOP as strings.
MIToS.SIFTS.dbSCOP2
— TypedbSCOP2
stores the residue id
, number
, name
and chain
in SCOP2 as strings.
MIToS.SIFTS.dbSCOP2B
— TypedbSCOP2B
stores the residue id
, number
, name
and chain
in SCOP2B as strings. SCOP2B is expansion of SCOP2 domain annotations at superfamily level to every PDB with same UniProt accession having at least 80% SCOP2 domain coverage.
MIToS.SIFTS.dbUniProt
— TypedbUniProt
stores the residue id
, number
and name
in UniProt as strings.
Base.parse
— Methodparse(document::LightXML.XMLDocument, ::Type{SIFTSXML}; chain=All, missings::Bool=true)
Returns a Vector{SIFTSResidue}
parsed from a SIFTSXML
file. By default, parses all the chain
s and includes missing residues.
MIToS.SIFTS._get_attribute
— MethodReturns "" if the attributte is missing
MIToS.SIFTS._get_entities
— MethodGets the entities of a SIFTS XML. In some cases, each entity is a PDB chain. WARNING: Sometimes there are more chains than entities!
<entry dbSource="PDBe" ...
...
<entity type="protein" entityId="A">
...
</entity>
<entity type="protein" entityId="B">
...
</entity>
<ntry>
MIToS.SIFTS._get_nullable_attribute
— MethodReturns missing
if the attributte is missing
MIToS.SIFTS._get_residues
— MethodReturns an Iterator of the residues on the listResidue
<listResidue>
<residue>
...
</residue>
...
</listResidue>
MIToS.SIFTS._get_segments
— MethodGets an array of the segments, the continuous region of an entity. Chimeras and expression tags generates more than one segment for example.
MIToS.SIFTS._is_missing
— MethodReturns true
if the residue was annotated as Not_Observed.
MIToS.SIFTS.downloadsifts
— Methoddownloadsifts(pdbcode::String; filename::String, source::String="https")
Download the gzipped SIFTS XML file for the provided pdbcode
. The downloaded file will have the default extension .xml.gz
. While you can change the filename
, it must include the .xml.gz
ending. The source
keyword argument is set to "https"
by default. Alternatively, you can choose "ftp"
as the source
, which will retrieve the file from the EBI FTP server at ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/. However, please note that using "https"
is highly recommended. This option will download the file from the EBI PDBe server at https://www.ebi.ac.uk/pdbe/files/sifts/.
MIToS.SIFTS.siftsmapping
— MethodParses a SIFTS XML file and returns a OrderedDict
between residue numbers of two DataBase
s with the given identifiers. A chain
could be specified (All
by default). If missings
is true
(default) all the residues are used, even if they haven’t coordinates in the PDB file.
MIToS.Utils.Scripts.close_output
— MethodClose output (check if output is STDOUT).
MIToS.Utils.Scripts.loadedversion
— MethodReturn the version of the loaded module.
Source: https://stackoverflow.com/questions/60587041/julia-getting-the-version-number-of-my-module
MIToS.Utils.Scripts.open_output
— MethodOpens the output file or returns STDOUT.
MIToS.Utils.Scripts.parse_commandline
— MethodParse MIToS scripts command line arguments.
MIToS.Utils.Scripts.readorparse
— MethodDecides if read or parse, uses parse with STDIN
MIToS.Utils.Scripts.run_single_script
— MethodOpens and closes the output stream, tries to run the defined script.
MIToS.Utils.Scripts.runscript
— MethodIf FILE is not used, calls runsinglescript with STDIN. Otherwise calls runsinglescript with the FILE. If list
is true
, for loop or pmap
is used over eachline of the input.
MIToS.Utils.Scripts.set_parallel
— MethodAdds the needed number of workers.
MIToS.PDB
— ModuleThe module PDB
defines types and methods to work with protein structures inside Julia. It is useful to link structural and sequential information, and needed for measure the predictive performance at protein contact prediction of mutual information scores.
Features
- Read and parse PDF and PDBML files
- Calculate distance and contacts between atoms or residues
- Determine interaction between residues
using MIToS.PDB
MIToS.PDB._hbond_acceptor
— ConstantKeys come from Table 1 of Bickerton et. al. 2011, Antecedents come from come from: http://biomachina.org/courses/modeling/download/topallh22x.pro Synonyms come from: http://www.bmrb.wisc.edu/refinfo/atomnom.tbl
MIToS.PDB._hbond_donor
— ConstantKeys come from Table 1 of Bickerton et. al. 2011, The hydrogen names of the donor comes from: http://biomachina.org/courses/modeling/download/topallh22x.pro Synonyms come from: http://www.bmrb.wisc.edu/refinfo/atomnom.tbl
MIToS.PDB.covalentradius
— ConstantCovalent radius in Å of each element from the Additional file 1 of PICCOLO [1]. Hydrogen was updated using the value on Table 2 from Cordero et. al. [2].
- Bickerton, G. R., Higueruelo, A. P., & Blundell, T. L. (2011).
Comprehensive, atomic-level characterization of structurally characterized protein-protein interactions: the PICCOLO database. BMC bioinformatics, 12(1), 313.
- Cordero, B., Gómez, V., Platero-Prats, A. E., Revés, M.,
Echeverría, J., Cremades, E., ... & Alvarez, S. (2008). Covalent radii revisited. Dalton Transactions, (21), 2832-2838.
MIToS.PDB.vanderwaalsradius
— Constantvan der Waals radius in Å from the Additional file 1 of Bickerton et. al. 2011
- Bickerton, G. R., Higueruelo, A. P., & Blundell, T. L. (2011).
Comprehensive, atomic-level characterization of structurally characterized protein-protein interactions: the PICCOLO database. BMC bioinformatics, 12(1), 313.
MIToS.PDB.Coordinates
— TypeA Coordinates
object is a fixed size vector with the coordinates x,y,z.
MIToS.PDB.PDBAtom
— TypeA PDBAtom
object contains the information from a PDB atom, without information of the residue. It has the following fields that you can access at any moment for query purposes:
- `coordinates` : x,y,z coordinates, e.g. `Coordinates(109.641,73.162,42.7)`.
- `atom` : Atom name, e.g. `"CA"`.
- `element` : Element type of the atom, e.g. `"C"`.
- `occupancy` : A float number with the occupancy, e.g. `1.0`.
- `B` : B factor as a string, e.g. `"23.60"`.
MIToS.PDB.PDBFile
— TypePDBFile <: FileFormat
Protein Data Bank (PDB) format. It provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies.
MIToS.PDB.PDBML
— TypePDBML <: FileFormat
Protein Data Bank Markup Language (PDBML), a representation of PDB data in XML format.
MIToS.PDB.PDBResidue
— TypeA PDBResidue
object contains all the information about a PDB residue. It has the following fields that you can access at any moment for query purposes:
- `id` : A `PDBResidueIdentifier` object.
- `atoms` : A vector of `PDBAtom`s.
MIToS.PDB.PDBResidueIdentifier
— TypeA PDBResidueIdentifier
object contains the information needed to identity PDB residues. It has the following fields that you can access at any moment for query purposes:
- `PDBe_number` : It's only used when a PDBML is readed (PDBe number as a string).
- `number` : PDB residue number, it includes insertion codes, e.g. `"34A"`.
- `name` : Three letter residue name in PDB, e.g. `"LYS"`.
- `group` : It can be `"ATOM"` or `"HETATM"`.
- `model` : The model number as a string, e.g. `"1"`.
- `chain` : The chain as a string, e.g. `"A"`.
Base.angle
— Methodangle(a::Coordinates, b::Coordinates, c::Coordinates)
Angle (in degrees) at b
between a-b
and b-c
Base.any
— Methodany(f::Function, a::PDBResidue, b::PDBResidue, criteria::Function)
Test if the function f
is true for any pair of atoms between the residues a
and b
. This function only test atoms that returns true
for the fuction criteria
.
Base.any
— Methodany(f::Function, a::PDBResidue, b::PDBResidue)
Test if the function f
is true for any pair of atoms between the residues a
and b
Base.parse
— Methodparse(pdbml, ::Type{PDBML}; chain=All, model=All, group=All, atomname=All, onlyheavy=false, label=true, occupancyfilter=false)
Reads a LightXML.XMLDocument
representing a pdb file. Returns a list of PDBResidue
s (view MIToS.PDB.PDBResidues
). Setting chain
, model
, group
, atomname
and onlyheavy
values can be used to select of a subset of all residues. If not set, all residues are returned. If the keyword argument label
(default: true
) is false
,the auth_ attributes will be use instead of the label_ attributes for chain
, atom
and residue name
fields. The auth_ attributes are alternatives provided by an author in order to match the identification/values used in the publication that describes the structure. If the keyword argument occupancyfilter
(default: false
) is true
, only the atoms with the best occupancy are returned.
Base.parse
— Methodparse(io, ::Type{PDBFile}; chain=All, model=All, group=All, atomname=All, onlyheavy=false, occupancyfilter=false)
Reads a text file of a PDB entry. Returns a list of PDBResidue
(view MIToS.PDB.PDBResidues
). Setting chain
, model
, group
, atomname
and onlyheavy
values can be used to select of a subset of all residues. Group can be "ATOM"
or "HETATM"
. If not set, all residues are returned. If the keyword argument occupancyfilter
(default: false
) is true
, only the atoms with the best occupancy are returned.
Base.print
— Functionprint(io, res, format::Type{PDBFile})
print(res, format::Type{PDBFile})
Print a PDBResidue
or a vector of PDBResidue
s in PDB format.
MIToS.PDB.CAmatrix
— MethodReturns a matrix with the x, y and z coordinates of the Cα with best occupancy for each PDBResidue
of the ATOM group. If a residue doesn't have a Cα, its Cα coordinates are NaNs.
MIToS.PDB._change_B
— MethodReturns a new PDBAtom
but with a B
as B-factor
MIToS.PDB._generate_residues
— FunctionUsed for parsing a PDB file into Vector{PDBResidue}
MIToS.PDB._get_matched_Cαs
— MethodReturn the matching CA matrices after deleting the rows/residues where the CA is missing in at least one structure.
MIToS.PDB._rmsf_test
— MethodThis looks for errors in the input to rmsf methods
MIToS.PDB.aromatic
— MethodThere's an aromatic interaction if centriods are at 6.0 Å or less.
MIToS.PDB.aromaticsulphur
— MethodReturns true
if an sulphur and an aromatic atoms are 5.3 Å or less"
MIToS.PDB.atoms
— Functionatoms(residue_list, model, chain, group, residue, atom)
These return a vector of PDBAtom
s with the selected subset of atoms. You can use the type All
(default value of the positional arguments) to avoid filtering a that level.
MIToS.PDB.bestoccupancy
— MethodTakes a Vector
of PDBAtom
s and returns a Vector
of the PDBAtom
s with best occupancy.
MIToS.PDB.center!
— Methodcenter!(A::AbstractMatrix{Float64})
Takes a set of points A
as an NxD matrix (N: number of points, D: dimension). Translates A
in place so that its centroid is at the origin of coordinates
MIToS.PDB.centeredcoordinates
— FunctionReturns a Matrix{Float64}
with the centered coordinates of all the atoms in residues
. An optional positional argument CA
(default: true
) defines if only Cα carbons should be used to center the matrix.
MIToS.PDB.centeredresidues
— FunctionReturns a new Vector{PDBResidue}
with the PDBResidue
s having centered coordinates. An optional positional argument CA
(default: true
) defines if only Cα carbons should be used to center the matrix.
MIToS.PDB.change_coordinates
— Functionchange_coordinates(residue::PDBResidue, coordinates::AbstractMatrix{Float64}, offset::Int=1)
Returns a new PDBResidues
with (x,y,z) from a coordinates AbstractMatrix{Float64}
You can give an offset
indicating in wich matrix row starts the (x,y,z) coordinates of the residue.
MIToS.PDB.change_coordinates
— Methodchange_coordinates(residues::AbstractVector{PDBResidue}, coordinates::AbstractMatrix{Float64})
Returns a new Vector{PDBResidues}
with (x,y,z) from a coordinates Matrix{Float64}
MIToS.PDB.change_coordinates
— Methodchange_coordinates(atom::PDBAtom, coordinates::Coordinates)
Returns a new PDBAtom
but with a new coordinates
MIToS.PDB.check_atoms_for_interactions
— MethodThis function takes a PDBResidue
and returns true
only if all the atoms can be used for checking interactions.
MIToS.PDB.contact
— Methodcontact(a::Coordinates, b::Coordinates, limit::AbstractFloat)
It returns true if the distance is less or equal to the limit. It doesn't call sqrt
because it does squared_distance(a,b) <= limit^2
.
MIToS.PDB.contact
— Methodcontact(A::PDBResidue, B::PDBResidue, limit::AbstractFloat; criteria::String="All")
Returns true
if the residues A
and B
are at contact distance (limit
). The available distance criteria
are: Heavy
, All
, CA
, CB
(CA
for GLY
)
MIToS.PDB.contact
— Methodcontact(residues::Vector{PDBResidue}, limit::AbstractFloat; criteria::String="All")
If contact
takes a Vector{PDBResidue}
, It returns a matrix with all the pairwise comparisons (contact map).
MIToS.PDB.coordinatesmatrix
— MethodReturns a matrix with the x, y, z coordinates of each atom in each PDBResidue
MIToS.PDB.covalent
— MethodReturns true
if the distance between atoms is less than the sum of the covalentradius
of each atom.
MIToS.PDB.distance
— MethodIt calculates the squared euclidean distance.
MIToS.PDB.distance
— Methoddistance(residues::Vector{PDBResidue}; criteria::String="All")
If distance
takes a Vector{PDBResidue}
returns a PairwiseListMatrix{Float64, false}
with all the pairwise comparisons (distance matrix).
MIToS.PDB.disulphide
— MethodReturns true
if two CYS
's S
are at 2.08 Å or less
MIToS.PDB.downloadpdb
— MethodIt downloads a gzipped PDB file from PDB database. It requires a four character pdbcode
. Its default format
is PDBML
(PDB XML) and It uses the baseurl
"http://www.rcsb.org/pdb/files/". filename
is the path/name of the output file. This function calls MIToS.Utils.download_file
that calls download
from the HTTP.jl package. You can use keyword arguments from HTTP.request
. Use the headers
keyword argument to pass a Dict{String, String}
with the header information.
MIToS.PDB.downloadpdbheader
— MethodIt downloads a JSON file containing the PDB header information.
MIToS.PDB.findCB
— MethodReturns a vector of indices for CB
(CA
for GLY
)
MIToS.PDB.findatoms
— Methodfindatoms(res::PDBResidue, atom::String)
Returns a index vector of the atoms with the given atom
name.
MIToS.PDB.findheavy
— MethodReturns a list with the index of the heavy atoms (all atoms except hydrogen) in the PDBResidue
MIToS.PDB.getCA
— MethodReturns the Cα with best occupancy in the PDBResidue
. If the PDBResidue
has no Cα, missing
is returned.
MIToS.PDB.getpdbdescription
— MethodAccess general information about a PDB entry (e.g., Header information) using the GraphQL interface of the PDB database. It parses the JSON answer into a Dict.
MIToS.PDB.hydrogenbond
— MethodThis function only works if there are hydrogens in the structure. The criteria for a hydrogen bond are:
- d(Ai, Aj) < 3.9Å
- d(Ah, Aacc) < 2.5Å
- θ(Adon, Ah, Aacc) > 90°
- θ(Adon, Aacc, Aacc-antecedent) > 90°
- θ(Ah, Aacc, Aacc-antecedent) > 90°
Where Ah is the donated hydrogen atom, Adon is the hydrogen bond donor atom, Aacc is the hydrogen bond acceptor atom and Aacc-antecednt is the atom antecedent to the hydrogen bond acceptor atom.
MIToS.PDB.hydrophobic
— MethodThere's an hydrophobic interaction if two hydrophobic atoms are at 5.0 Å or less.
MIToS.PDB.ionic
— MethodThere's an ionic interaction if a cationic and an anionic atoms are at 6.0 Å or less.
MIToS.PDB.is_aminoacid
— Methodis_aminoacid(residue::PDBResidue)
is_aminoacid(residue_id::PDBResidueIdentifier)
This function returns true
if the PDB residue is an amino acid residue. It checks if the residue's three-letter name exists in the MIToS.Utils.THREE2ONE
dictionary, and returns false
otherwise.
MIToS.PDB.isanionic
— MethodReturns true if the atom, e.g. ("GLU","CD")
, is an anionic atom in the residue.
MIToS.PDB.isaromatic
— MethodReturns true if the atom, e.g. ("HIS","CG")
, is an aromatic atom in the residue.
MIToS.PDB.isatom
— MethodIt tests if the atom has the indicated atom name.
MIToS.PDB.iscationic
— MethodReturns true if the atom, e.g. ("ARG","NE")
, is a cationic atom in the residue.
MIToS.PDB.ishbondacceptor
— MethodReturns true if the atom, e.g. ("ARG","O")
, is an acceptor in H bonds.
MIToS.PDB.ishbonddonor
— MethodReturns true if the atom, e.g. ("ARG","N")
, is a donor in H bonds.
MIToS.PDB.isresidue
— MethodIt tests if the PDB residue has the indicated model, chain, group and residue number.
MIToS.PDB.kabsch
— Methodkabsch(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64})
This function takes two sets of points, A
(refrence) and B
as NxD matrices, where D is the dimension and N is the number of points. Assumes that the centroids of A
and B
are at the origin of coordinates. You can call center!
on each matrix before calling kabsch
to center the matrices in the (0.0, 0.0, 0.0)
. Rotates B
so that rmsd(A,B)
is minimized. Returns the rotation matrix. You should do B * RotationMatrix
to get the rotated B.
MIToS.PDB.mean_coordinates
— MethodCalculates the average/mean position of each atom in a set of structure. The function takes a vector (AbstractVector
) of vectors (AbstractVector{PDBResidue}
) or matrices (AbstractMatrix{Float64}
) as first argument. As second (optional) argument this function can take an AbstractVector{Float64}
of matrix/structure weights to return a weighted mean. When a AbstractVector{PDBResidue} is used, if the keyword argument calpha
is false
the RMSF is calculated for all the atoms. By default only alpha carbons are used (default: calpha=true
).
MIToS.PDB.modelled_sequences
— Methodmodelled_sequences(residue_list::AbstractArray{PDBResidue,N};
model::Union{String,Type{All}}=All, chain::Union{String,Type{All}}=All,
group::Union{String,Regex,Type{All}}=All) where N
This function returns an OrderedDict
where each key is a named tuple (containing the model and chain identifiers), and each value is the protein sequence corresponding to the modelled residues in those chains. Therefore, the obtained sequences do not contain missing residues. All modelled residues are included by default, but those that don't satisfy specified criteria based on the model
, chain
, or group
keyword arguments are excluded. One-letter residue names are obtained from the MIToS.Utils.THREE2ONE
dictionary for all residue names that return true
for is_aminoacid
.
MIToS.PDB.pication
— MethodThere's a Π-Cation interaction if a cationic and an aromatic atoms are at 6.0 Å or less
MIToS.PDB.proximitymean
— Methodproximitymean
calculates the proximity mean/average for each residue as the average score (from a scores
list) of all the residues within a certain physical distance to a given amino acid. The score of that residue is not included in the mean unless you set include
to true
. The default values are 6.05 for the distance threshold/limit
and "Heavy"
for the criteria
keyword argument. This function allows to calculate pMI (proximity mutual information) and pC (proximity conservation) as in Buslje et. al. 2010.
Buslje, Cristina Marino, Elin Teppa, Tomas Di Doménico, José María Delfino, and Morten Nielsen. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 6, no. 11 (2010): e1000978.
MIToS.PDB.residuepairsmatrix
— MethodIt creates a NamedArray
containing a PairwiseListMatrix
where each element (column, row) is identified with a PDBResidue
from the input vector. You can indicate the value type of the matrix (default to Float64
), if the list should have the diagonal values (default to Val{false}
) and the diagonal values (default to NaN
).
MIToS.PDB.residues
— Methodresidues(residue_list, model, chain, group, residue)
These return a new vector with the selected subset of residues from a list of residues. You can use the type All
(default value) to avoid filtering a that level.
MIToS.PDB.residuesdict
— Methodresiduesdict(residue_list, model, chain, group, residue)
These return a dictionary (using PDB residue numbers as keys) with the selected subset of residues. You can use the type All
(default value) to avoid filtering a that level.
MIToS.PDB.rmsd
— Methodrmsd(A::AbstractMatrix{Float64}, B::AbstractMatrix{Float64})
Return RMSD between two sets of points A
and B
, given as NxD matrices (N: number of points, D: dimension).
MIToS.PDB.rmsd
— Methodrmsd(A::AbstractVector{PDBResidue}, B::AbstractVector{PDBResidue}; superimposed::Bool=false)
Returns the Cα RMSD value between two PDB structures: A
and B
. If the structures are already superimposed between them, use superimposed=true
to avoid a new superimposition (superimposed
is false
by default).
MIToS.PDB.rmsf
— MethodCalculates the RMSF (Root Mean-Square-Fluctuation) between an atom and its average position in a set of structures. The function takes a vector (AbstractVector
) of vectors (AbstractVector{PDBResidue}
) or matrices (AbstractMatrix{Float64}
) as first argument. As second (optional) argument this function can take an AbstractVector{Float64}
of matrix/structure weights to return the root weighted mean-square-fluctuation around the weighted mean structure. When a Vector{PDBResidue} is used, if the keyword argument calpha
is false
the RMSF is calculated for all the atoms. By default only alpha carbons are used (default: calpha=true
).
MIToS.PDB.selectbestoccupancy
— MethodTakes a PDBResidue
and a Vector
of atom indices. Returns the index value of the Vector
with maximum occupancy.
MIToS.PDB.squared_distance
— MethodIt calculates the squared euclidean distance, i.e. it doesn't spend time in sqrt
MIToS.PDB.squared_distance
— Methodsquared_distance(A::PDBResidue, B::PDBResidue; criteria::String="All")
Returns the squared distance between the residues A
and B
. The available criteria
are: Heavy
, All
, CA
, CB
(CA
for GLY
)
MIToS.PDB.superimpose
— FunctionAsuper, Bsuper, RMSD = superimpose(A, B, matches=nothing)
This function takes A::AbstractVector{PDBResidue}
(reference) and B::AbstractVector{PDBResidue}
. Translates A
and B
to the origin of coordinates, and rotates B
so that rmsd(A,B)
is minimized with the Kabsch algorithm (using only their α carbons). Returns the rotated and translated versions of A
and B
, and the RMSD value.
Optionally provide matches
which iterates over matched index pairs in A
and B
, e.g., matches = [(3, 5), (4, 6), ...]
. The alignment will be constructed using just the matching residues.
MIToS.PDB.vanderwaals
— MethodTest if two atoms or residues are in van der Waals contact using: distance(a,b) <= 0.5 + vanderwaalsradius[a] + vanderwaalsradius[b]
. It returns distance <= 0.5
if the atoms aren't in vanderwaalsradius
.
MIToS.PDB.vanderwaalsclash
— MethodReturns true
if the distance between the atoms is less than the sum of the vanderwaalsradius
of the atoms. If the atoms aren't on the list (i.e. OXT
), the vanderwaalsradius
of the element is used. If there is not data in the dict, distance 0.0
is used.
MIToS.PDB.@atoms
— Macro@atoms ... model ... chain ... group ... residue ... atom ...
These return a vector of PDBAtom
s with the selected subset of atoms. You can use the type All
to avoid filtering that option.
MIToS.PDB.@residues
— Macro@residues ... model ... chain ... group ... residue ...
These return a new vector with the selected subset of residues from a list of residues. You can use the type All
to avoid filtering that option.
MIToS.PDB.@residuesdict
— Macro@residuesdict ... model ... chain ... group ... residue ...
These return a dictionary (using PDB residue numbers as keys) with the selected subset of residues. You can use the type All
to avoid filtering that option.
MIToS.Pfam
— ModuleThe Pfam
module, defines functions to measure the protein contact prediction performance of information measure between column pairs from a Pfam MSA.
Features
- Read and download Pfam MSAs
- Obtain PDB information from alignment annotations
- Map between sequence/alignment residues/columns and PDB structures
- Measure of AUC (ROC curve) for contact prediction of MI scores
using MIToS.Pfam
MIToS.Pfam.downloadpfam
— MethodIt downloads a gzipped Stockholm alignment from InterPro for the Pfam family with the given pfamcode
.
By default, it downloads the full
Pfam alignment. You can use the alignment
keyword argument to download the seed
or the uniprot
alignment instead. For example, downloadpfam("PF00069")
will download the full alignment for the PF00069 Pfam family, while downloadpfam("PF00069", alignment="seed")
will download the seed alignment of the family.
The extension of the downloaded file is .stockholm.gz
by default; you can change it using the filename
keyword argument, but the .gz
at the end is mandatory.
MIToS.Pfam.getcontactmasks
— MethodThis function takes a msacontacts
or its list of contacts contact_list
with 1.0 for true contacts and 0.0 for not contacts (NaN or other numbers for missing values). Returns two BitVector
s, the first with true
s where contact_list
is 1.0 and the second with true
s where contact_list
is 0.0. There are useful for AUC calculations.
MIToS.Pfam.getseq2pdb
— MethodGenerates from a Pfam msa
a Dict{String, Vector{Tuple{String,String}}}
. Keys are sequence IDs and each value is a list of tuples containing PDB code and chain.
julia> getseq2pdb(msa)
Dict{String,Array{Tuple{String,String},1}} with 1 entry:
"F112_SSV1/3-112" => [("2VQC","A")]
MIToS.Pfam.hasresidues
— MethodReturns a BitVector
where there is a true
for each column with PDB residue.
MIToS.Pfam.msacolumn2pdbresidue
— Methodmsacolumn2pdbresidue(msa, seqid, pdbid, chain, pfamid, siftsfile; strict=false, checkpdbname=false, missings=true)
This function returns a OrderedDict{Int,String}
with MSA column numbers on the input file as keys and PDB residue numbers (""
for missings) as values. The mapping is performed using SIFTS. This function needs correct ColMap and SeqMap annotations. This checks correspondence of the residues between the MSA sequence and SIFTS (It throws a warning if there are differences). Missing residues are included if the keyword argument missings
is true
(default: true
). If the keyword argument strict
is true
(default: false
), throws an Error, instead of a Warning, when residues don't match. If the keyword argument checkpdbname
is true
(default: false
), throws an Error if the three letter name of the PDB residue isn't the MSA residue. If you are working with a downloaded Pfam MSA without modifications, you should read
it using generatemapping=true
and useidcoordinates=true
. If you don't indicate the path to the siftsfile
used in the mapping, this function downloads the SIFTS file in the current folder. If you don't indicate the Pfam accession number (pfamid
), this function tries to read the AC file annotation.
MIToS.Pfam.msacontacts
— FunctionThis function takes an AnnotatedMultipleSequenceAlignment
with correct ColMap annotations and two dicts:
- The first is an
OrderedDict{String,PDBResidue}
from PDB residue number toPDBResidue
. - The second is a
Dict{Int,String}
from MSA column number on the input file to PDB residue number.
msacontacts
returns a PairwiseListMatrix{Float64,false}
of 0.0
and 1.0
where 1.0
indicates a residue contact. Contacts are defined with an inter residue distance less or equal to distance_limit
(default to 6.05
) angstroms between any heavy atom. NaN
indicates a missing value.
MIToS.Pfam.msaresidues
— MethodThis function takes an AnnotatedMultipleSequenceAlignment
with correct ColMap annotations and two dicts:
- The first is an
OrderedDict{String,PDBResidue}
from PDB residue number toPDBResidue
. - The second is a
Dict{Int,String}
from MSA column number on the input file to PDB residue number.
msaresidues
returns an OrderedDict{Int,PDBResidue}
from input column number (ColMap) to PDBResidue
. Residues on inserts are not included.