MSA
The MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).
using MIToS.MSA # to load the MSA module
Features
- Read and write MSAs in
Stockholm
,FASTA
,PIR
orRaw
format. - Handle MSA annotations.
- Edit the MSA, e.g. delete columns or sequences, change sequence order, shuffling...
- Keep track of positions and annotations after modifications on the MSA.
- Describe an MSA, e.g. mean percent identity, sequence coverage, gap percentage...
- Sequence clustering with Hobohm I.
Contents
- MSA
MSA IO
Reading MSA files
The main function for reading MSA files in MIToS is read
and it is defined in the Utils
module. This function takes a filename/path as a first argument followed by other arguments. It opens the file and uses the arguments to call the parse
function. read
decides how to open the file, using the prefixes (e.g. https) and suffixes (i.e. extensions) of the file name, while parse
does the actual parsing of the file. You can read
gzipped files if they have the .gz
extension and also urls pointing to a web file. The second argument of read
and parse
is the file FileFormat
. The supported MSA formats at the moment are Stockholm
, FASTA
, PIR
(NBRF) and Raw
. For example, reading with MIToS the full Stockholm MSA of the family PF07388 using the Pfam RESTful interface will be:
using MIToS.MSA
read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm)
AnnotatedMultipleSequenceAlignment with 35 annotations : 14×459 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 24 25 26 27 28 … 551 552 553 554 555
─────────────────────────┼────────────────────────────────────────────────────
A0A1S1T395_9RHIZ/224-427 │ - - - - - … - - - - -
M5A3D2_9ACTN/208-402 │ - - - - - - - - - -
A0A1B8PI30_MORNO/200-434 │ - - - - - - - - - -
A0A1W1V8Q8_9PAST/1-405 │ - - S K F - - - - -
A0A0B0IGB5_9BACI/201-416 │ - - - - - - - - - -
A0A0B0IGB5_9BACI/1-170 │ - - - - - - - - - -
Q4W584_NEIMB/1-222 │ M L K K I - - - - -
A0A1S1TKC5_9RHIZ/184-421 │ - - - - - - - - - -
A0A1S1TKC5_9RHIZ/2-147 │ - - - - - - - - - -
C5ZW53_9HELI/1-498 │ - - K K L K Y L Q L
M5A3D2_9ACTN/2-157 │ - - - - - - - - - -
A0A1B8PI30_MORNO/6-183 │ - - - - - - - - - -
A0A157SWM6_9BORD/1-413 │ - L K K L - - - - -
A0A1S1T395_9RHIZ/6-168 │ - - - - - … - - - - -
The third (and optional) argument of read
and parse
is the output MSA type:
Matrix{Residue}
: It only contains the aligned sequences.MultipleSequenceAlignment
: It contains the aligned sequences and their
names/identifiers.
AnnotatedMultipleSequenceAlignment
: It's the richest MIToS' MSA format and it's the
default. It includes the aligned sequences, their names and the MSA annotations.
Example of Matrix{Residue}
output using a Stockholm
file as input:
read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm, Matrix{Residue})
14×459 Array{MIToS.MSA.Residue,2}:
- - - - - - - - - - - - - … - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - S K F T K F I F N P K - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - … - - - - - - - - - - - -
M L K K I K K A L F Q P K - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - K K L S G L M Q D I K D F Q K Y R I K Y L Q L
- - - - - - - - - - - - - … - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- L K K L R K L I L H P V - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
Because read
calls parse
, you should look into the documentation of parse
to know the available keyword arguments. The optional keyword arguments of those functions are:
generatemapping
: Ifgeneratemapping
istrue
(default:false
), sequences and
columns mappings are generated and saved in the MSA annotations. The default is false
to not overwrite mappings by mistake when you read an annotated MSA file saved with MIToS.
useidcoordinates
: Ifuseidcoordinates
istrue
(default:false
) and the names
have the form seqname/start-end, MIToS uses this coordinates to generate sequence mappings. This is safe and useful with unmodified Pfam MSAs. Do not use it when reading an MSA saved with MIToS. MIToS deletes unaligned insert columns, therefore disrupts sequences that have them.
deletefullgaps
: Given that lowercase characters and dots are converted to gaps,
unaligned insert columns in the MSA (derived from a HMM profile) are converted into full gap columns. deletefullgaps
is true
by default, deleting full gaps columns and therefore insert columns.
If you want to keep the insert columns... Use the keyword argument keepinserts
to true
in read
/parse
. This only works with an AnnotatedMultipleSequenceAlignment
output. A column annotation ("Aligned"
) is stored in the annotations, where insert columns are marked with 0
and aligned columns with 1
.
When read
returns an AnnotatedMultipleSequenceAlignment
, it uses the MSA Annotations
to keep track of performed modifications. To access these notes, use printmodifications
:
msa = read("http://pfam.xfam.org/family/PF01565/alignment/full", Stockholm)
printmodifications(msa)
-------------------
2020-04-20T11:45:46.413
deletefullgaps! : Deletes 861 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
-------------------
2020-04-20T11:45:46.526
filtercolumns! : 861 columns have been deleted.
Writing MSA files
Julia REPL shows MSAs as Matrices. If you want to print them in another format, you should use the print
function with an MSA object as first argument and the FileFormat
FASTA
, Stockholm
, PIR
or Raw
as second argument.
using MIToS.MSA
msa = read("http://pfam.xfam.org/family/PF16996/alignment/full", Stockholm) # reads a Stockholm MSA file
print(msa, FASTA) # prints msa in FASTA format
>U5P5G4_9STRE/5-59
KKDLFYKEVEGRMNALKRRSAEKEKATRSEKVNLTLNVVIGLVILLGVLLTLFRV
>T0TMU0_9STRE/1-43
------------MEELKQKPITKEKETRGEKINKSFSIMLGLVIIIGLIFTLI--
>A0A139NPI6_9STRE/5-59
-EDLFYKEVEGRMADLQQKAPEKEKKTGAERLNTLFSLALGLVILLGLLFTLLR-
>A0A139NTT2_9STRE/3-57
KKDLFYKDVEQKLDSLKQGQPKKEKASLGEKLNKAFVIALGLVILIGLIFTLI--
>T0TY77_9STRE/5-59
KKDLFYKEVEGRMESLKRRPAEKEKTTRSEKINVTFNVIIGLVILLGVIFTLFRV
>A8AWV6_STRGC/3-57
KKDLFYKDIEGRLDELKHGRPKKEKASLGEKFNKAFVIALGLMILIGLIFTLIG-
>A0A139NMD7_9STRE/4-59
KDDIFYKDIEGRMDELKRKPPKKEKKTRAERISTFFSVSLGLVILIGLLFTLFRI
>A3CM62_STRSV/3-57
KKDLFYKDIEGRLDELKHGKPKKEKASLGENLNKAFVIVLGLMILIGLIFTLI--
To save an MSA object to a file, use the write
function. This function takes a filename as a first argument. If the filename ends with .gz
, the output will be a compressed (gzipped) file. The next two arguments of write
are passed to print
, so write
behaves as print
.
write("msa.gz", msa, FASTA) # writes msa in FASTA format in a gzipped file
MSA Annotations
MSA annotations are based on the Stockholm format mark-ups. There are four types of annotations stored as dictionaries. All the annotations have a feature name as part of the key, which should be a single "word" (without spaces) and less than 50 characters long.
- File annotations : The annotations can contain either file or MSA information. They
have feature names as keys and the values are strings (free text). Lines starting with #=GF
in Stockholm format.
- Column annotations : They have feature names as keys and strings with exactly 1 char
per column as values. Lines starting with #=GC
in Stockholm format.
- Sequence annotations : The keys are tuples with the sequence name and the feature
name. The values are free text (strings). Lines starting with #=GS
in Stockholm format. Annotations in the PIR
/NBRF format are also stored as sequence annotations. In particular, we use the names "Type"
and "Title"
to name the sequence type in the identifier line and the first comment line before the sequence in PIR files, respectively.
- Residue annotations : The keys are tuples with the sequence name and the feature
name. The values are strings with exactly 1 char per column/residues. #=GR
lines in Stockholm format.
Julia REPL shows the Annotations
type as they are represented in the Stockholm format. You can get the Annotations
inside an annotated MSA or sequence using the annotations
function.
annotations(msa)
#=GF ID Asp4
#=GF AC PF16996.5
#=GF DE Accessory secretory protein Sec Asp4
#=GF AU Coggill P;0000-0001-5731-1588
#=GF SE Pfam-B_7603 (release 27.0)
#=GF GA 25.70 25.70;
#=GF TC 55.00 54.80;
#=GF NC 21.60 20.60;
#=GF BM hmmbuild HMM.ann SEED.ann
#=GF SM hmmsearch -Z 45638612 -E 1000 --cpu 4 HMM pfamseq
#=GF TP Family
#=GF RN [1]
#=GF RM 23000954
#=GF RT Emerging themes in SecA2-mediated protein export.
#=GF RA Feltcher ME, Braunstein M;
#=GF RL Nat Rev Microbiol. 2012;10:779-789.
#=GF DR INTERPRO; IPR031551;
#=GF DR SO; 0100021; polypeptide_conserved_region;
#=GF CC Asp4 and Asp5 are putative accessory components of the SecY2
#=GF CC channel of the SecA2-SecY2 mediated export system, but they are
#=GF CC not present in all SecA2-SecY2 systems. This family of Asp4 is
#=GF CC found in Firmicutes [1].
#=GF SQ 8
#=GF MIToS_2020-04-20T11:45:46.998 deletefullgaps! : Deletes 3 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
#=GF MIToS_2020-04-20T11:45:46.998 filtercolumns! : 3 columns have been deleted.
#=GS A3CM62_STRSV/3-57 AC A3CM62.1
#=GS A0A139NTT2_9STRE/3-57 AC A0A139NTT2.1
#=GS A0A139NMD7_9STRE/4-59 AC A0A139NMD7.1
#=GS T0TMU0_9STRE/1-43 AC T0TMU0.1
#=GS U5P5G4_9STRE/5-59 AC U5P5G4.1
#=GS T0TY77_9STRE/5-59 AC T0TY77.1
#=GS A0A139NPI6_9STRE/5-59 AC A0A139NPI6.1
#=GS A8AWV6_STRGC/3-57 AC A8AWV6.1
#=GC seq_cons KKDLFYK-lEGRM--LK++sPcKEKsTtuEKlNpsFslsLGLVILIGLIFTLlt.
Particular annotations can be accessed using the functions getannot...
. These functions take the MSA/sequence as first argument and the feature name of the desired annotation as the last. In the case of getannotsequence
and getannotresidue
, the second argument should be the sequence name.
getannotsequence(msa, "A0A139NPI6_9STRE/5-59", "AC") # ("A0A139NPI6_9STRE/5-59", "AC") is the key in the dictionary
"A0A139NPI6.1"
If you want to add new annotations, you should use the setannot…!
functions. These functions have the same arguments that getannot...
functions except for an extra argument used to indicate the new annotation value.
setannotsequence!(msa, "A0A139NPI6_9STRE/5-59", "New_Feature_Name", "New_Annotation")
"New_Annotation"
A getannot...
function without the key (last arguments), returns the particular annotation dictionary. As you can see, the new sequence annotation is now part of our MSA annotations.
getannotsequence(msa)
Dict{Tuple{String,String},String} with 9 entries:
("A3CM62_STRSV/3-57", "AC") => "A3CM62.1"
("A0A139NTT2_9STRE/3-57", "AC") => "A0A139NTT2.1"
("A0A139NMD7_9STRE/4-59", "AC") => "A0A139NMD7.1"
("T0TMU0_9STRE/1-43", "AC") => "T0TMU0.1"
("U5P5G4_9STRE/5-59", "AC") => "U5P5G4.1"
("T0TY77_9STRE/5-59", "AC") => "T0TY77.1"
("A0A139NPI6_9STRE/5-59", "AC") => "A0A139NPI6.1"
("A0A139NPI6_9STRE/5-59", "New_Feature_Name") => "New_Annotation"
("A8AWV6_STRGC/3-57", "AC") => "A8AWV6.1"
Editing your MSA
MIToS offers functions to edit your MSA. Because these functions modify the msa, their names end with a bang !
, following the Julia convention. Some of these functions have an annotate
keyword argument (in general, it's true
by default) to indicate if the modification should be recorded in the MSA/sequence annotations.
One common task is to delete sequences or columns of the MSA. This could be done using the functions filtersequences!
and filtercolumns!
. These functions take the MSA or sequence (if it's possible) as first argument and a BitVector
or Vector{Bool}
mask as second argument. It deletes all the sequences or columns where the mask is false
. These functions are also defined for Annotations
, this allows to automatically update (modify) the annotations (and therefore, sequence and column mappings) in the MSA.
This two deleting operations are used in the second and third mutating functions of the following list:
setreference!
: Sets one of the sequences as the first sequence of the MSA (query or
reference sequence).
adjustreference!
: Deletes columns with gaps in the first sequence of the MSA
(reference).
gapstrip!
: This function first callsadjustreference!
, then deletes sequences with
low (user defined) MSA coverage and finally, columns with user defined % of gaps.
Also, there are several available funtions shuffle_…!
. These functions are useful to generate random alignments. The Information
module of MIToS
uses them to calculate the Z scores of MI values.
Example: Deleting sequences
For example, if you want to keep only the proteins from Actinobacteria you can delete all the sequences that don't have _9ACTN
in their UniProt entry names:
using MIToS.MSA
msa = read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm)
sequencenames(msa) # the function sequencenames returns the sequence names in the MSA
14-element Array{String,1}:
"A0A1S1T395_9RHIZ/224-427"
"M5A3D2_9ACTN/208-402"
"A0A1B8PI30_MORNO/200-434"
"A0A1W1V8Q8_9PAST/1-405"
"A0A0B0IGB5_9BACI/201-416"
"A0A0B0IGB5_9BACI/1-170"
"Q4W584_NEIMB/1-222"
"A0A1S1TKC5_9RHIZ/184-421"
"A0A1S1TKC5_9RHIZ/2-147"
"C5ZW53_9HELI/1-498"
"M5A3D2_9ACTN/2-157"
"A0A1B8PI30_MORNO/6-183"
"A0A157SWM6_9BORD/1-413"
"A0A1S1T395_9RHIZ/6-168"
mask = map(x -> occursin(r"_9ACTN", x), sequencenames(msa)) # an element of mask is true if "_9ACTN" is in the name
14-element Array{Bool,1}:
0
1
0
0
0
0
0
0
0
0
1
0
0
0
filtersequences!(msa, mask) # deletes all the sequences where mask is false
sequencenames(msa)
2-element Array{String,1}:
"M5A3D2_9ACTN/208-402"
"M5A3D2_9ACTN/2-157"
Example: Exporting a MSA for freecontact (part I)
The most simple input for the command line tool freecontact (if you don't want to set --mincontsep
) is a Raw
MSA file with a reference sequence without insertions or gaps. This is easy to get with MIToS using read
(deletes the insert columns), setreference!
(to choose a reference), adjustreference!
(to delete columns with gaps in the reference) and write
(to save it in Raw
format) functions.
julia> using MIToS.MSA
julia> msa = read("http://pfam.xfam.org/family/PF02476/alignment/full", Stockholm)
AnnotatedMultipleSequenceAlignment with 41 annotations : 20×126 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 9 10 11 12 13 … 205 206 207 208 209
─────────────────────────┼────────────────────────────────────────────────────
US02_GAHVM/120-237 │ M L E S E … F C C - -
A0A1R3T8S2_9ALPH/119-225 │ - - - - - - - - - -
B1A4T1_9ALPH/119-238 │ H L S S G A C C I -
A0A288CG55_EHV4/111-257 │ Y L N S S F C Q E -
E2IUH4_SHV1/110-233 │ - - - - - - - - - -
Q8V727_CHV1/110-256 │ - L C E P P C V A C
A0A060Q503_9ALPH/110-250 │ - L H G P S C P A -
Q77L60_MEHV1/120-238 │ M L E S E L C C C -
US02_HHV11/110-247 │ - L H R D P C C A C
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮
A0A0Y0DAF2_9ALPH/108-249 │ - - - - - - - - - -
A0A068EPE8_9ALPH/120-239 │ Y L E S G L C C - -
US02_HHV2H/110-246 │ L L H Q E P C F T C
J9R0A8_9ALPH/178-313 │ - - - - - - - - - -
A0A0A7D914_9ALPH/111-268 │ H L N S S T A - - -
Q67639_ILTV/100-201 │ H L S S G - - - - -
Q6R5Q3_9ALPH/120-237 │ M L E S E F C C - -
Q5PP72_9ALPH/135-256 │ Y L N S G R C T I -
Q5EGY0_CHV16/111-255 │ - L C D P … P C T A C
julia> msa_coverage = coverage(msa)
20×1 Named Array{Float64,2}
Seq ╲ Function │ coverage
─────────────────────────┼─────────
US02_GAHVM/120-237 │ 0.904762
A0A1R3T8S2_9ALPH/119-225 │ 0.65873
B1A4T1_9ALPH/119-238 │ 0.912698
A0A288CG55_EHV4/111-257 │ 0.960317
E2IUH4_SHV1/110-233 │ 0.801587
Q8V727_CHV1/110-256 │ 0.992063
A0A060Q503_9ALPH/110-250 │ 0.984127
Q77L60_MEHV1/120-238 │ 0.912698
US02_HHV11/110-247 │ 0.97619
⋮ ⋮
A0A0Y0DAF2_9ALPH/108-249 │ 0.65873
A0A068EPE8_9ALPH/120-239 │ 0.928571
US02_HHV2H/110-246 │ 0.984127
J9R0A8_9ALPH/178-313 │ 0.761905
A0A0A7D914_9ALPH/111-268 │ 0.960317
Q67639_ILTV/100-201 │ 0.325397
Q6R5Q3_9ALPH/120-237 │ 0.904762
Q5PP72_9ALPH/135-256 │ 0.880952
Q5EGY0_CHV16/111-255 │ 0.992063
julia> maxcoverage, maxindex = findmax(msa_coverage) # chooses the sequence with more coverage of the MSA
(0.9920634920634921, CartesianIndex(6, 1))
julia> setreference!(msa, maxindex[1])
AnnotatedMultipleSequenceAlignment with 42 annotations : 20×126 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 9 10 11 12 13 … 205 206 207 208 209
─────────────────────────┼────────────────────────────────────────────────────
Q8V727_CHV1/110-256 │ - L C E P … P C V A C
A0A1R3T8S2_9ALPH/119-225 │ - - - - - - - - - -
B1A4T1_9ALPH/119-238 │ H L S S G A C C I -
A0A288CG55_EHV4/111-257 │ Y L N S S F C Q E -
E2IUH4_SHV1/110-233 │ - - - - - - - - - -
US02_GAHVM/120-237 │ M L E S E F C C - -
A0A060Q503_9ALPH/110-250 │ - L H G P S C P A -
Q77L60_MEHV1/120-238 │ M L E S E L C C C -
US02_HHV11/110-247 │ - L H R D P C C A C
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮
A0A0Y0DAF2_9ALPH/108-249 │ - - - - - - - - - -
A0A068EPE8_9ALPH/120-239 │ Y L E S G L C C - -
US02_HHV2H/110-246 │ L L H Q E P C F T C
J9R0A8_9ALPH/178-313 │ - - - - - - - - - -
A0A0A7D914_9ALPH/111-268 │ H L N S S T A - - -
Q67639_ILTV/100-201 │ H L S S G - - - - -
Q6R5Q3_9ALPH/120-237 │ M L E S E F C C - -
Q5PP72_9ALPH/135-256 │ Y L N S G R C T I -
Q5EGY0_CHV16/111-255 │ - L C D P … P C T A C
julia> adjustreference!(msa)
AnnotatedMultipleSequenceAlignment with 43 annotations : 20×125 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 10 11 12 13 14 … 205 206 207 208 209
─────────────────────────┼────────────────────────────────────────────────────
Q8V727_CHV1/110-256 │ L C E P R … P C V A C
A0A1R3T8S2_9ALPH/119-225 │ - - - - - - - - - -
B1A4T1_9ALPH/119-238 │ L S S G I A C C I -
A0A288CG55_EHV4/111-257 │ L N S S I F C Q E -
E2IUH4_SHV1/110-233 │ - - - - - - - - - -
US02_GAHVM/120-237 │ L E S E V F C C - -
A0A060Q503_9ALPH/110-250 │ L H G P R S C P A -
Q77L60_MEHV1/120-238 │ L E S E A L C C C -
US02_HHV11/110-247 │ L H R D Q P C C A C
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮
A0A0Y0DAF2_9ALPH/108-249 │ - - - - - - - - - -
A0A068EPE8_9ALPH/120-239 │ L E S G I L C C - -
US02_HHV2H/110-246 │ L H Q E R P C F T C
J9R0A8_9ALPH/178-313 │ - - - - - - - - - -
A0A0A7D914_9ALPH/111-268 │ L N S S L T A - - -
Q67639_ILTV/100-201 │ L S S G A - - - - -
Q6R5Q3_9ALPH/120-237 │ L E S E V F C C - -
Q5PP72_9ALPH/135-256 │ L N S G A R C T I -
Q5EGY0_CHV16/111-255 │ L C D P R … P C T A C
julia> write("tofreecontact.msa", msa, Raw)
julia> print(read("tofreecontact.msa", String)) # It displays the contents of the output file
LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARRLVTIRVTDAWAGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPGWLRETVPRLWPGSESAAAEAAKTPHPARRPCVAC
------QTDRFRLWIIGAADICQQALSSFPKNMRRITVKVNGEWPGNPCFLDTNLEPVLISSWKPVSWPEK-ENNIDTSKLLCRYELLIP-----------------------------------
LSSGILGNHTFDMWIIGAADICKPAIEQLPNSKRFITIKVPGTWSGLTWEKPDGLSPLTVTEWDPCDDETMSKIEKKLVGIKCCYDLIGQ----------PAAAKHNSELDDSKDCCNGPACCI-
LNSSIIINRPYHLWVLGAADLCRPVFNLIPGPKRIVYVEIEDEF-NKSW-QPSETIPLTTVDARSLLQHKSSFVSPPPNFKRLIYAVVDP--MRLQENLCPQITNRTKTKRRSKKTYNGLFCQE-
------DQRLFHVWIVGAADLCEPLLACLSTGARCVVIEIGGLWGGRSWLPPAYARSGASTSWTPLPPPTGSAAESALEGCEYRYAIIPG----VEARRRPASRDENDERS--------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LHGPRDRPRLYHMWVVGAADLCAPFLESLWRMRRLITIKIPDGWVGASWHLPEPFLPATSVAWTPFPAPPNHPLESLLQNYEYRYGILSPRWLRNLVNKWKNMGGRPTLPPATSEHITQNSCPA-
LESEASDNVSYSLWILGAADICRTAIESIPLPKRLFAIKVPGTWAGMPWAIPCEIQTLLTSTWEPKFENIEDKAYFNDSNMACVYQIIGS----------PPDVPQLQGLGIESTPPKRNLCCC-
LHRDQPSPRLYHLWVVGAADLCVPFLEYAQKIRRFIAIKTPDAWVGEPWAVPTRFLPEWTVAWTPFPAAPNHPLETLLSRYEYQYGVVLPRWLRSLIALHKPHPATPGPLTTSHP--VRRPCCAC
LNSGIFGNYPLNLWVFGAADLCEPVISNIPGPKRLIYAYVSCEWPEPSW-KPENVKHIDDSGWVPVLDTFLSSFKVPPMLSDIFYGTTFNTPFNFESSPRCPSISSSSSSSSSSS----SSC---
LNSSVMGNRPYHLWVFGAADICRPVFDLLPGPKRLVYAELSEEWPGPSW-QPPP-----AVPWTAPERGTPADLEVPPRVARMVYAVVNRSWLRSRPLRHVPGRATPATPSVSTP-SGRGRC---
--------KLYHMWVIGAADICDPFLSCLQQKARFITIKMDFIWEGAAWLPPPRMCPKKTVPWVPCPISQDPALEQLLSNCSYEYGVVGPR----------------------------------
LESGIMSGGPFSLWIIGAADLCKEAIEHIPTPKRMFALQVKGTWSGMPWAIPEAIPVLLESSWSPQFTHPLDQRGFLESGMRGVYKVIGR-------ASDPPRNNSEPETRKPRGRLLKYLCC--
LHQERPGPRLYHLWVVGAADLCVPFFEYAQKTRRFIATKTNDAWVGEPWPLPDRFLPERTVSWTPFPAAPNHPLENLLSRYEYQYGVVVPRWLRSLVAPHKPRPASSRPHPATHP--TQRPCFTC
--------ELFAVWVVGGADICAPAMEAMMALGRVVAIAGAHSLTGSSWPISARLLPAVWHSWEPYPAAPTNPLKNTFAACTFVAGTVNARWLGALTSHTRPQV---------------------
LNSSLIINQPYHLWVLGAADLCKPVFDLIPGPKRMVYAEIADEF-HKSW-QPPETIPWTTVEHNGFSKHSSNSLVHPPTVKRVIYAVVDPAPGRPRQRRRPSEGGMRAPRRRSRAAPARSTA---
LSSGAYAAKEFHLWVLGTADICMAALN-LPAPKTFLITETG------------------------------------------------------------------------------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LNSGARGTAPIHLWILGAADLCDQVLLAAS---RSTAAGAPGAPTGARLT---RRRPGLTDA------DALDVIVAGIPATRAMFARVHNEALHAQIVTRGDVRRRRGGRGNGRE--RAPRCTI-
LCDPRPRPRLYHLWVVGAADLCAPFLEHLRDGRRLVAVRVPDAWAGGSWPLPEGLALGETVPWTPFPTAHNHPLAALFGTYEYRYGVLRPGWLREAASRLWAGPPATATEAAQKRHPARRPCTAC
Column and sequence mappings
Inserts in a Stockholm MSA allow to access the full fragment of the aligned sequences. Using this, combined with the sequence names that contain coordinates used in Pfam, you can know what is the UniProt residue number of each residue in the MSA.
"PROT_SPECI/3-15 .....insertALIGNED"
# 3456789111111
# 012345
MIToS read
and parse
functions delete the insert columns, but they do the mapping between each residue and its residue number before deleting insert columns when generatemapping
is true
. If you don't set useidcoordinates
to true
, the residue first i
residue will be 1 instead of 3 in the previous example.
using MIToS.MSA
msa = parse("PROT_SPECI/3-15 .....insertALIGNED", Stockholm, generatemapping=true, useidcoordinates=true)
AnnotatedMultipleSequenceAlignment with 4 annotations : 1×7 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 12 13 14 15 16 17 18
────────────────┼───────────────────────────
PROT_SPECI/3-15 │ A L I G N E D
MIToS also keeps the column number of the input MSA and its total number of columns. All this data is stored in the MSA annotations using the SeqMap
, ColMap
and NCol
feature names.
annotations(msa)
#=GF NCol 18
#=GF ColMap 12,13,14,15,16,17,18
#=GF MIToS_2020-04-20T11:45:52.202 deletefullgaps! : Deletes 11 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
#=GF MIToS_2020-04-20T11:45:52.202 filtercolumns! : 11 columns have been deleted.
#=GS PROT_SPECI/3-15 SeqMap 9,10,11,12,13,14,15
To have an easy access to mapping data, MIToS provides the getsequencemapping
and getcolumnmapping
functions.
getsequencemapping(msa, "PROT_SPECI/3-15")
7-element Array{Int64,1}:
9
10
11
12
13
14
15
getcolumnmapping(msa)
7-element Array{Int64,1}:
12
13
14
15
16
17
18
Example: Exporting a MSA for freecontact (part II)
If we want to use the --mincontsep
argument of freecontact
to calculate scores between distant residues, we will need to add a header to the MSA. This header should contains the residue number of the first residue of the sequence and the full fragment of that sequence (with the inserts). This data is used by FreeContact to calculate the residue number of each residue in the reference sequence. We are going to use MIToS mapping data to create this header, so we read the MSA with generatemapping
and useidcoordinates
set to true
.
using MIToS.MSA
msa = read( "http://pfam.xfam.org/family/PF02476/alignment/full", Stockholm,
generatemapping=true, useidcoordinates=true)
AnnotatedMultipleSequenceAlignment with 64 annotations : 20×126 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 9 10 11 12 13 … 205 206 207 208 209
─────────────────────────┼────────────────────────────────────────────────────
US02_GAHVM/120-237 │ M L E S E … F C C - -
A0A1R3T8S2_9ALPH/119-225 │ - - - - - - - - - -
B1A4T1_9ALPH/119-238 │ H L S S G A C C I -
A0A288CG55_EHV4/111-257 │ Y L N S S F C Q E -
E2IUH4_SHV1/110-233 │ - - - - - - - - - -
Q8V727_CHV1/110-256 │ - L C E P P C V A C
A0A060Q503_9ALPH/110-250 │ - L H G P S C P A -
Q77L60_MEHV1/120-238 │ M L E S E L C C C -
US02_HHV11/110-247 │ - L H R D P C C A C
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮
A0A0Y0DAF2_9ALPH/108-249 │ - - - - - - - - - -
A0A068EPE8_9ALPH/120-239 │ Y L E S G L C C - -
US02_HHV2H/110-246 │ L L H Q E P C F T C
J9R0A8_9ALPH/178-313 │ - - - - - - - - - -
A0A0A7D914_9ALPH/111-268 │ H L N S S T A - - -
Q67639_ILTV/100-201 │ H L S S G - - - - -
Q6R5Q3_9ALPH/120-237 │ M L E S E F C C - -
Q5PP72_9ALPH/135-256 │ Y L N S G R C T I -
Q5EGY0_CHV16/111-255 │ - L C D P … P C T A C
Here, we are going to choose the sequence with more coverage of the MSA as our reference sequence.
msa_coverage = coverage(msa)
maxcoverage, maxindex = findmax(msa_coverage)
setreference!(msa, maxindex[1])
adjustreference!(msa)
AnnotatedMultipleSequenceAlignment with 65 annotations : 20×125 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 10 11 12 13 14 … 205 206 207 208 209
─────────────────────────┼────────────────────────────────────────────────────
Q8V727_CHV1/110-256 │ L C E P R … P C V A C
A0A1R3T8S2_9ALPH/119-225 │ - - - - - - - - - -
B1A4T1_9ALPH/119-238 │ L S S G I A C C I -
A0A288CG55_EHV4/111-257 │ L N S S I F C Q E -
E2IUH4_SHV1/110-233 │ - - - - - - - - - -
US02_GAHVM/120-237 │ L E S E V F C C - -
A0A060Q503_9ALPH/110-250 │ L H G P R S C P A -
Q77L60_MEHV1/120-238 │ L E S E A L C C C -
US02_HHV11/110-247 │ L H R D Q P C C A C
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮
A0A0Y0DAF2_9ALPH/108-249 │ - - - - - - - - - -
A0A068EPE8_9ALPH/120-239 │ L E S G I L C C - -
US02_HHV2H/110-246 │ L H Q E R P C F T C
J9R0A8_9ALPH/178-313 │ - - - - - - - - - -
A0A0A7D914_9ALPH/111-268 │ L N S S L T A - - -
Q67639_ILTV/100-201 │ L S S G A - - - - -
Q6R5Q3_9ALPH/120-237 │ L E S E V F C C - -
Q5PP72_9ALPH/135-256 │ L N S G A R C T I -
Q5EGY0_CHV16/111-255 │ L C D P R … P C T A C
MIToS deletes the residues in insert columns, so we are going to use the sequence mapping to generate the whole fragment of the reference sequence (filling the missing regions with 'x'
).
seqmap = getsequencemapping(msa, 1) # seqmap will be a vector with the residue numbers of the first sequence (reference)
seq = collect( stringsequence(msa, 1) ) # seq will be a Vector of Chars with the reference sequence
sequence = map(seqmap[1]:seqmap[end]) do seqpos # for each position in the whole fragment
if seqpos in seqmap # if that position is in the MSA
popfirst!(seq) # the residue is taken from seq
else # otherwise
'x' # 'x' is included
end
end
sequence = join(sequence) # join the Chars on the Vector to create a string
"LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARxxxRLVTIRVTDAWAxGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPxxxxxxxxxxxxxxGWLRETVPRLWPGSESAAAEAAKTPxxxHPARRPCVAC"
Once we have the whole fragment of the sequence, we create the file and write the header in the required format (as in the man page of freecontact).
open("tofreecontact.msa", "w") do fh
println(fh, "# querystart=", seqmap[1])
println(fh, "# query=", sequence )
end
As last (optional) argument, write
takes the mode in which is opened the file. We use "a"
here to append the MSA to the header.
write("tofreecontact.msa", msa, Raw, "a")
print(read("tofreecontact.msa", String)) # It displays the contents of the output file
# querystart=111
# query=LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARxxxRLVTIRVTDAWAxGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPxxxxxxxxxxxxxxGWLRETVPRLWPGSESAAAEAAKTPxxxHPARRPCVAC
LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARRLVTIRVTDAWAGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPGWLRETVPRLWPGSESAAAEAAKTPHPARRPCVAC
------QTDRFRLWIIGAADICQQALSSFPKNMRRITVKVNGEWPGNPCFLDTNLEPVLISSWKPVSWPEK-ENNIDTSKLLCRYELLIP-----------------------------------
LSSGILGNHTFDMWIIGAADICKPAIEQLPNSKRFITIKVPGTWSGLTWEKPDGLSPLTVTEWDPCDDETMSKIEKKLVGIKCCYDLIGQ----------PAAAKHNSELDDSKDCCNGPACCI-
LNSSIIINRPYHLWVLGAADLCRPVFNLIPGPKRIVYVEIEDEF-NKSW-QPSETIPLTTVDARSLLQHKSSFVSPPPNFKRLIYAVVDP--MRLQENLCPQITNRTKTKRRSKKTYNGLFCQE-
------DQRLFHVWIVGAADLCEPLLACLSTGARCVVIEIGGLWGGRSWLPPAYARSGASTSWTPLPPPTGSAAESALEGCEYRYAIIPG----VEARRRPASRDENDERS--------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LHGPRDRPRLYHMWVVGAADLCAPFLESLWRMRRLITIKIPDGWVGASWHLPEPFLPATSVAWTPFPAPPNHPLESLLQNYEYRYGILSPRWLRNLVNKWKNMGGRPTLPPATSEHITQNSCPA-
LESEASDNVSYSLWILGAADICRTAIESIPLPKRLFAIKVPGTWAGMPWAIPCEIQTLLTSTWEPKFENIEDKAYFNDSNMACVYQIIGS----------PPDVPQLQGLGIESTPPKRNLCCC-
LHRDQPSPRLYHLWVVGAADLCVPFLEYAQKIRRFIAIKTPDAWVGEPWAVPTRFLPEWTVAWTPFPAAPNHPLETLLSRYEYQYGVVLPRWLRSLIALHKPHPATPGPLTTSHP--VRRPCCAC
LNSGIFGNYPLNLWVFGAADLCEPVISNIPGPKRLIYAYVSCEWPEPSW-KPENVKHIDDSGWVPVLDTFLSSFKVPPMLSDIFYGTTFNTPFNFESSPRCPSISSSSSSSSSSS----SSC---
LNSSVMGNRPYHLWVFGAADICRPVFDLLPGPKRLVYAELSEEWPGPSW-QPPP-----AVPWTAPERGTPADLEVPPRVARMVYAVVNRSWLRSRPLRHVPGRATPATPSVSTP-SGRGRC---
--------KLYHMWVIGAADICDPFLSCLQQKARFITIKMDFIWEGAAWLPPPRMCPKKTVPWVPCPISQDPALEQLLSNCSYEYGVVGPR----------------------------------
LESGIMSGGPFSLWIIGAADLCKEAIEHIPTPKRMFALQVKGTWSGMPWAIPEAIPVLLESSWSPQFTHPLDQRGFLESGMRGVYKVIGR-------ASDPPRNNSEPETRKPRGRLLKYLCC--
LHQERPGPRLYHLWVVGAADLCVPFFEYAQKTRRFIATKTNDAWVGEPWPLPDRFLPERTVSWTPFPAAPNHPLENLLSRYEYQYGVVVPRWLRSLVAPHKPRPASSRPHPATHP--TQRPCFTC
--------ELFAVWVVGGADICAPAMEAMMALGRVVAIAGAHSLTGSSWPISARLLPAVWHSWEPYPAAPTNPLKNTFAACTFVAGTVNARWLGALTSHTRPQV---------------------
LNSSLIINQPYHLWVLGAADLCKPVFDLIPGPKRMVYAEIADEF-HKSW-QPPETIPWTTVEHNGFSKHSSNSLVHPPTVKRVIYAVVDPAPGRPRQRRRPSEGGMRAPRRRSRAAPARSTA---
LSSGAYAAKEFHLWVLGTADICMAALN-LPAPKTFLITETG------------------------------------------------------------------------------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LNSGARGTAPIHLWILGAADLCDQVLLAAS---RSTAAGAPGAPTGARLT---RRRPGLTDA------DALDVIVAGIPATRAMFARVHNEALHAQIVTRGDVRRRRGGRGNGRE--RAPRCTI-
LCDPRPRPRLYHLWVVGAADLCAPFLEHLRDGRRLVAVRVPDAWAGGSWPLPEGLALGETVPWTPFPTAHNHPLAALFGTYEYRYGVLRPGWLREAASRLWAGPPATATEAAQKRHPARRPCTAC
Get sequences from a MSA
It's possible to index the MSA as any other matrix to get an aligned sequence. This will be return a Array
of Residue
s without annotations but keeping names/identifiers.
using MIToS.MSA
msa = read( "http://pfam.xfam.org/family/PF16996/alignment/full", Stockholm,
generatemapping=true, useidcoordinates=true)
AnnotatedMultipleSequenceAlignment with 39 annotations : 8×55 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 2 3 4 5 6 7 … 51 52 53 54 55 56
──────────────────────┼──────────────────────────────────────────────────
U5P5G4_9STRE/5-59 │ K K D L F Y … L T L F R V
T0TMU0_9STRE/1-43 │ - - - - - - F T L I - -
A0A139NPI6_9STRE/5-59 │ - E D L F Y F T L L R -
A0A139NTT2_9STRE/3-57 │ K K D L F Y F T L I - -
T0TY77_9STRE/5-59 │ K K D L F Y F T L F R V
A8AWV6_STRGC/3-57 │ K K D L F Y F T L I G -
A0A139NMD7_9STRE/4-59 │ K D D I F Y F T L F R I
A3CM62_STRSV/3-57 │ K K D L F Y … F T L I - -
msa[2,:] # second sequence of the MSA, it keeps column names
55-element Named Array{MIToS.MSA.Residue,1}
Col │
─────┼──
2 │ -
3 │ -
4 │ -
5 │ -
6 │ -
7 │ -
8 │ -
9 │ -
10 │ -
⋮ ⋮
48 │ G
49 │ L
50 │ I
51 │ F
52 │ T
53 │ L
54 │ I
55 │ -
56 │ -
msa[2:2,:] # Using the range 2:2 to select the second sequence, keeping also the sequence name
1×55 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 2 3 4 5 6 7 8 … 50 51 52 53 54 55 56
──────────────────┼──────────────────────────────────────────────────────────
T0TMU0_9STRE/1-43 │ - - - - - - - … I F T L I - -
If you want to obtain the aligned sequence with its name and annotations (and therefore sequence and column mappings), you should use the function getsequence
. This function returns an AlignedSequence
with the sequence name from a MultipleSequenceAlignment
or an AnnotatedAlignedSequence
, that also contains annotations, from an AnnotatedMultipleSequenceAlignment
.
secondsequence = getsequence(msa, 2)
AnnotatedAlignedSequence with 25 annotations : 1×55 Named Array{MIToS.MSA.Residue,2}
Seq ╲ Col │ 2 3 4 5 6 7 8 … 50 51 52 53 54 55 56
──────────────────┼──────────────────────────────────────────────────────────
T0TMU0_9STRE/1-43 │ - - - - - - - … I F T L I - -
annotations(secondsequence)
#=GF ID Asp4
#=GF AC PF16996.5
#=GF DE Accessory secretory protein Sec Asp4
#=GF AU Coggill P;0000-0001-5731-1588
#=GF SE Pfam-B_7603 (release 27.0)
#=GF GA 25.70 25.70;
#=GF TC 55.00 54.80;
#=GF NC 21.60 20.60;
#=GF BM hmmbuild HMM.ann SEED.ann
#=GF SM hmmsearch -Z 45638612 -E 1000 --cpu 4 HMM pfamseq
#=GF TP Family
#=GF RN [1]
#=GF RM 23000954
#=GF RT Emerging themes in SecA2-mediated protein export.
#=GF RA Feltcher ME, Braunstein M;
#=GF RL Nat Rev Microbiol. 2012;10:779-789.
#=GF DR INTERPRO; IPR031551;
#=GF DR SO; 0100021; polypeptide_conserved_region;
#=GF CC Asp4 and Asp5 are putative accessory components of the SecY2
#=GF CC channel of the SecA2-SecY2 mediated export system, but they are
#=GF CC not present in all SecA2-SecY2 systems. This family of Asp4 is
#=GF CC found in Firmicutes [1].
#=GF SQ 8
#=GF NCol 58
#=GF ColMap 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
#=GF MIToS_2020-04-20T11:45:52.998 deletefullgaps! : Deletes 3 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
#=GF MIToS_2020-04-20T11:45:52.998 filtercolumns! : 3 columns have been deleted.
#=GS T0TMU0_9STRE/1-43 SeqMap ,,,,,,,,,,,,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,,
#=GS T0TMU0_9STRE/1-43 AC T0TMU0.1
#=GC seq_cons KKDLFYK-lEGRM--LK++sPcKEKsTtuEKlNpsFslsLGLVILIGLIFTLlt.
Use stringsequence
if you want to get the sequence as a string.
stringsequence(msa, 2)
"------------MEELKQKPITKEKETRGEKINKSFSIMLGLVIIIGLIFTLI--"
Because matrices are stored columnwise in Julia, you will find useful the getresiduesequences
function when you need to heavily operate over sequences.
getresiduesequences(msa)
8-element Array{Array{MIToS.MSA.Residue,1},1}:
[K, K, D, L, F, Y, K, E, V, E … L, G, V, L, L, T, L, F, R, V]
[-, -, -, -, -, -, -, -, -, - … I, G, L, I, F, T, L, I, -, -]
[-, E, D, L, F, Y, K, E, V, E … L, G, L, L, F, T, L, L, R, -]
[K, K, D, L, F, Y, K, D, V, E … I, G, L, I, F, T, L, I, -, -]
[K, K, D, L, F, Y, K, E, V, E … L, G, V, I, F, T, L, F, R, V]
[K, K, D, L, F, Y, K, D, I, E … I, G, L, I, F, T, L, I, G, -]
[K, D, D, I, F, Y, K, D, I, E … I, G, L, L, F, T, L, F, R, I]
[K, K, D, L, F, Y, K, D, I, E … I, G, L, I, F, T, L, I, -, -]
Describing your MSA
The MSA module has a number of functions to gain insight about your MSA. Using MIToS.MSA
, one can easily ask for...
- The number of columns and sequences with the
ncolumns
andnsequences
functions. - The fraction of columns with residues (coverage) for each sequence making use of the
coverage
method.
- The fraction or percentage of gaps/residues using with the functions
gapfraction
,
residuefraction
and columngapfraction
.
- The percentage of identity (PID) between each sequence of the MSA or its mean value
with percentidentity
and meanpercentidentity
.
The percentage identity between two aligned sequences is a common measure of sequence similarity and is used by the hobohmI
method to estimate and reduce MSA redundancy. MIToS functions to calculate percent identity don't align the sequences, they need already aligned sequences. Full gaps columns don't count to the alignment length.
using MIToS.MSA
msa = permutedims(
hcat( res"--GGG-", # res"..." uses the @res_str macro to create a (column) Vector{Residue}
res"---GGG" ), (2,1))
# identities 000110 sum 2
# aligned residues 001111 sum 4
percentidentity(msa[1,:], msa[2,:]) # 2 / 4
50.0
To quickly calculate if the percentage of identity is greater than a determined value, use that threshold as third argument. percentidentity(seqa, seqb, pid)
is a lot more faster than percentidentity(seqa, seqb) >= pid
.
percentidentity(msa[1,:], msa[2,:], 62) # 50% >= 62%
false
Example: Plotting gap percentage per column and coverage per sequence
The gapfraction
and coverage
functions return a vector of numbers between 0.0
and 1.0
(fraction of...). Sometime it's useful to plot this data to quickly understand the MSA structure. In this example, we are going to use the Plots package for plotting, with the GR backend, but you are free to use any of the Julia plotting libraries.
using MIToS.MSA
msa = read("http://pfam.xfam.org/family/PF09776/alignment/full", Stockholm)
using Plots
gr(size=(600,300))
plot( 1:ncolumns(msa), # x is a range from 1 to the number of columns
vec(columngapfraction(msa)) .* 100.0, # y is a Vector{Float64} with the percentage of gaps of each column
linetype = :line,
ylabel = "gaps [%]",
xlabel = "columns",
legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
plot( 1:nsequences(msa), # x is a range from 1 to the number of sequences
vec(coverage(msa)) .* 100, # y is a Vector{Float64} with the coverage of each sequence
linetype = :line,
ylabel = "coverage [%]",
xlabel = "sequences",
legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
plot(msa)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
Example: Filter sequences per coverage and columns per gap fraction
Taking advantage of the filter...!
functions and the coverage
and columngapfraction
functions, it's possible to delete short sequences or columns with a lot of gaps.
println("\tsequences\tcolumns")
println( "Before:\t", nsequences(msa), "\t\t", ncolumns(msa) )
# delete sequences with less than 90% coverage of the MSA length:
filtersequences!(msa, coverage(msa) .>= 0.9)
# delete columns with more than 10% of gaps:
filtercolumns!(msa, columngapfraction(msa) .<= 0.1)
println( "After:\t", nsequences(msa), "\t\t", ncolumns(msa) )
sequences columns
Before: 251 116
After: 82 106
histogram( vec(columngapfraction(msa)),
# Using vec() to get a Vector{Float64} with the fraction of gaps of each column
xlabel = "gap fraction in [0,1]", bins = 20, legend = false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
histogram( vec(coverage(msa) .* 100.0), # Column with the coverage of each sequence
xlabel = "coverage [%]", legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
Example: Plotting the percentage of identity between sequences
The distribution of the percentage of identity between every pair of sequences in an MSA, gives an idea of the MSA diversity. In this example, we are using percentidentity
over an MSA to get those identity values.
using MIToS.MSA
msa = read("http://pfam.xfam.org/family/PF09776/alignment/full", Stockholm)
pid = percentidentity(msa)
MIToS stores the matrix of percentage of identity between the aligned sequences as a PairwiseListMatrix from the PairwiseListMatrices package. This matrix type saves RAM, allowing the storage of big matrices. In this example, we use the to_table
function of PairwiseListMatrices to convert the matrix into a table with indices.
using PairwiseListMatrices
pidtable = to_table(pid, diagonal=false)
31375×3 Array{Any,2}:
"A0A2A2LU69_9BILA/10-118" "G1RVK4_NOMLE/46-161" 27.8261
"A0A2A2LU69_9BILA/10-118" "A0A194PY65_PAPXU/3-116" 40.5941
"A0A2A2LU69_9BILA/10-118" "A0A0V0S2G9_9BILA/731-840" 33.6634
"A0A2A2LU69_9BILA/10-118" "A0A016T9R4_9BILA/3-89" 45.9184
"A0A2A2LU69_9BILA/10-118" "A0A1V4KDS6_PATFA/159-270" 33.3333
"A0A2A2LU69_9BILA/10-118" "M4A0G1_XIPMA/29-139" 29.1262
"A0A2A2LU69_9BILA/10-118" "A0A1B0BJN8_9MUSC/16-132" 33.6634
"A0A2A2LU69_9BILA/10-118" "A0A1I8N4P6_MUSDO/6-107" 36.6337
"A0A2A2LU69_9BILA/10-118" "A0A1U7TPR2_TARSY/10-125" 23.4783
"A0A2A2LU69_9BILA/10-118" "B4ILV0_DROSE/3-106" 37.6238
⋮
"A0A2I3GIP3_NOMLE/10-125" "RM55_MOUSE/9-124" 73.2759
"A0A2I3GIP3_NOMLE/10-125" "M3XPJ7_MUSPF/10-125" 72.1739
"A0A2I3GIP3_NOMLE/10-125" "A0A182GHC0_AEDAL/6-119" 34.7826
"G3WPK1_SARHA/11-126" "RM55_MOUSE/9-124" 54.3103
"G3WPK1_SARHA/11-126" "M3XPJ7_MUSPF/10-125" 62.6087
"G3WPK1_SARHA/11-126" "A0A182GHC0_AEDAL/6-119" 40.8696
"RM55_MOUSE/9-124" "M3XPJ7_MUSPF/10-125" 70.6897
"RM55_MOUSE/9-124" "A0A182GHC0_AEDAL/6-119" 33.6207
"M3XPJ7_MUSPF/10-125" "A0A182GHC0_AEDAL/6-119" 35.6522
The function quantile
gives a quick idea of the percentage identity distribution of the MSA.
using Statistics
quantile(convert(Vector{Float64}, pidtable[:,3]), [0.00, 0.25, 0.50, 0.75, 1.00])
5-element Array{Float64,1}:
0.0
28.07017543859649
32.6530612244898
39.130434782608695
100.0
The function meanpercentidentity
gives the mean value of the percent identity distribution for MSA with less than 300 sequences, or a quick estimate (mean PID in a random sample of sequence pairs) otherwise unless you set exact
to true
.
meanpercentidentity(msa)
35.41589479524384
One can easily plot that matrix and its distribution using the heatmap
and histogram
functions of the Plots package.
using Plots
gr()
heatmap(convert(Matrix, pid), yflip=true, ratio=:equal)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
histogram(pidtable[:,3], xlabel ="Percentage of identity", legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
Sequence clustering
The MSA
module allows to clusterize sequences in an MSA. The hobohmI
function takes as input an MSA followed by an identity threshold value, and returns a Clusters
type with the result of a Hobohm I sequence clustering. The Hobohm I algorithm will add a sequence to an existing cluster, if the percentage of identity is equal or greater than the threshold. The Clusters
is sub-type of ClusteringResult
from the Clustering.jl package. One advantage of use a sub-type of ClusteringResult
is that you are able to use any method defined on Clustering.jl
like varinfo
(Variation of Information) for example. Also, you can use any clustering algorithm included in Clustering.jl, and convert its result to an Clusters
object to use it with MIToS. MSA
defines the functions nclusters
to get the resulting number of clusters, counts
to get the number of sequences on each cluster and assignments
to get the cluster number of each sequence. The most important method is getweight
, which returns the weight of each sequence. This method is used in the Information
module of MIToS to reduce redundancy.
Example: Reducing redundancy of a MSA
MSAs can suffer from an unnatural sequence redundancy and a high number of protein fragments. In this example, we are using a sequence clustering to make a non-redundant set of representative sequences. We are going to use the function hobohmI
to perform the clustering with the Hobohm I algorithm at 62% identity.
using MIToS.MSA
msa = read("http://pfam.xfam.org/family/PF09776/alignment/full", Stockholm)
println("This MSA has ", nsequences(msa), " sequences...")
This MSA has 251 sequences...
clusters = hobohmI(msa, 62)
MIToS.MSA.Clusters([2, 50, 36, 2, 8, 5, 7, 3, 1, 2 … 1, 1, 1, 1, 2, 2, 1, 1, 2, 1], [1, 2, 3, 4, 5, 6, 7, 8, 3, 2 … 16, 3, 20, 3, 82, 2, 15, 2, 2, 3], [0.5, 0.02, 0.027777777777777776, 0.5, 0.125, 0.2, 0.14285714285714285, 0.3333333333333333, 0.027777777777777776, 0.02 … 0.25, 0.027777777777777776, 0.09090909090909091, 0.027777777777777776, 1.0, 0.02, 0.3333333333333333, 0.02, 0.02, 0.027777777777777776])
println("...but has only ", nclusters(clusters), " sequence clusters after a clustering at 62% identity.")
...but has only 82 sequence clusters after a clustering at 62% identity.
using Plots
gr()
plot(msa)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
We are going to use the DataFrames package to easily select the sequence with the highest coverage of each cluster.
using DataFrames
df = DataFrame( seqnum = 1:nsequences(msa),
seqname = sequencenames(msa),
cluster = assignments(clusters), # the cluster number/index of each sequence
coverage = vec(coverage(msa)))
251 rows × 4 columns
seqnum | seqname | cluster | coverage | |
---|---|---|---|---|
Int64 | String | Int64 | Float64 | |
1 | 1 | A0A2A2LU69_9BILA/10-118 | 1 | 0.844828 |
2 | 2 | G1RVK4_NOMLE/46-161 | 2 | 0.991379 |
3 | 3 | A0A194PY65_PAPXU/3-116 | 3 | 0.793103 |
4 | 4 | A0A0V0S2G9_9BILA/731-840 | 4 | 0.853448 |
5 | 5 | A0A016T9R4_9BILA/3-89 | 5 | 0.62931 |
6 | 6 | A0A1V4KDS6_PATFA/159-270 | 6 | 0.844828 |
7 | 7 | M4A0G1_XIPMA/29-139 | 7 | 0.87069 |
8 | 8 | A0A1B0BJN8_9MUSC/16-132 | 8 | 0.775862 |
9 | 9 | A0A1I8N4P6_MUSDO/6-107 | 3 | 0.775862 |
10 | 10 | A0A1U7TPR2_TARSY/10-125 | 2 | 0.991379 |
11 | 11 | B4ILV0_DROSE/3-106 | 3 | 0.775862 |
12 | 12 | G3SRJ5_LOXAF/9-123 | 2 | 0.991379 |
13 | 13 | A0A0L8FVE8_OCTBM/8-111 | 9 | 0.818966 |
14 | 14 | W4XCA5_STRPU/9-121 | 10 | 0.887931 |
15 | 15 | RM55_HUMAN/10-125 | 2 | 0.991379 |
16 | 16 | A0A1U8D4J6_MESAU/9-124 | 2 | 1.0 |
17 | 17 | A0A1A9VR26_GLOAU/4-110 | 3 | 0.784483 |
18 | 18 | C4WSU2_ACYPI/3-101 | 11 | 0.775862 |
19 | 19 | B0WMT9_CULQU/8-116 | 12 | 0.827586 |
20 | 20 | A0A182TGF7_9DIPT/27-135 | 12 | 0.784483 |
21 | 21 | A0A1I7TQ78_9PELO/1-113 | 13 | 0.965517 |
22 | 22 | G1TZ81_RABIT/10-122 | 14 | 0.965517 |
23 | 23 | G3WPK0_SARHA/10-125 | 15 | 0.991379 |
24 | 24 | A0A226ND80_CALSU/6-88 | 16 | 0.706897 |
25 | 25 | A0A068WZ28_HYMMI/1-103 | 17 | 0.827586 |
26 | 26 | B4JSF8_DROGR/5-105 | 3 | 0.784483 |
27 | 27 | A0A044VAC1_ONCVO/26-140 | 18 | 0.905172 |
28 | 28 | G1KFV5_ANOCA/11-126 | 19 | 0.948276 |
29 | 29 | E2BEW2_HARSA/189-296 | 20 | 0.784483 |
30 | 30 | A0A1S3F3D0_DIPOR/10-125 | 2 | 1.0 |
31 | 31 | A0A0M3IY57_ANISI/12-103 | 21 | 0.62069 |
32 | 32 | A0A1I7XRJ6_HETBA/24-82 | 22 | 0.37931 |
33 | 33 | A0A151XA25_9HYME/324-435 | 20 | 0.784483 |
34 | 34 | A0A1A9ZL90_GLOPL/3-110 | 3 | 0.793103 |
35 | 35 | G1Q793_MYOLU/11-126 | 2 | 1.0 |
36 | 36 | S7NPL6_MYOBR/845-960 | 2 | 1.0 |
37 | 37 | E2RDI6_CANLF/10-123 | 2 | 0.974138 |
38 | 38 | A0A1S3T723_SALSA/16-92 | 23 | 0.646552 |
39 | 39 | D3ZF99_RAT/9-124 | 2 | 1.0 |
40 | 40 | A0A2D0QYE5_ICTPU/16-131 | 24 | 0.982759 |
41 | 41 | K7FVJ9_PELSI/10-124 | 19 | 0.982759 |
42 | 42 | A0A154PA44_9HYME/355-466 | 25 | 0.827586 |
43 | 43 | A0A075AF58_9TREM/385-439 | 26 | 0.37931 |
44 | 44 | A0A0N4ZNS7_PARTI/18-127 | 27 | 0.793103 |
45 | 45 | F6V4D5_CALJA/10-125 | 2 | 0.991379 |
46 | 46 | A0A091DB95_FUKDA/10-125 | 2 | 0.991379 |
47 | 47 | A0A158NN71_ATTCE/2-108 | 20 | 0.801724 |
48 | 48 | M7BRF2_CHEMY/1-109 | 19 | 0.931034 |
49 | 49 | G1P8D1_MYOLU/10-125 | 2 | 1.0 |
50 | 50 | A0A0R3RPH9_9BILA/11-122 | 18 | 0.896552 |
51 | 51 | A0A1I8CN28_9BILA/12-125 | 28 | 0.853448 |
52 | 52 | A0A0R3QN38_9BILA/6-118 | 18 | 0.913793 |
53 | 53 | A0A195CSW6_9HYME/356-466 | 20 | 0.801724 |
54 | 54 | T1JZX0_TETUR/9-124 | 29 | 0.793103 |
55 | 55 | A0A0Q3TJT0_AMAAE/8-119 | 6 | 0.862069 |
56 | 56 | A0A0N4XDL0_NIPBR/12-122 | 5 | 0.810345 |
57 | 57 | A0A2I4BYR2_9TELE/24-137 | 7 | 0.896552 |
58 | 58 | A0A0J7L295_LASNI/1-107 | 20 | 0.818966 |
59 | 59 | A0A1A6FY34_NEOLE/9-124 | 2 | 0.991379 |
60 | 60 | RM55_BOVIN/10-125 | 2 | 0.991379 |
61 | 61 | A0A151I3L2_9HYME/359-470 | 20 | 0.827586 |
62 | 62 | K7J8V2_NASVI/9-122 | 30 | 0.939655 |
63 | 63 | A0A182N5M4_9DIPT/11-112 | 3 | 0.775862 |
64 | 64 | Q2KHY3_BOVIN/10-81 | 31 | 0.586207 |
65 | 65 | U3JEB6_FICAL/51-139 | 32 | 0.62069 |
66 | 66 | H9JKD3_BOMMO/5-113 | 3 | 0.827586 |
67 | 67 | A0A096NE75_PAPAN/10-125 | 2 | 1.0 |
68 | 68 | A0A0L7K4G1_9NEOP/2-107 | 3 | 0.784483 |
69 | 69 | U3JEB8_FICAL/1-96 | 6 | 0.801724 |
70 | 70 | A0A0K0G1S2_9BILA/1-106 | 33 | 0.853448 |
71 | 71 | Q177J1_AEDAE/2-112 | 12 | 0.87069 |
72 | 72 | A0A183IMM9_9BILA/3-111 | 34 | 0.87069 |
73 | 73 | H0X3U7_OTOGA/12-127 | 2 | 1.0 |
74 | 74 | B5X5L6_SALSA/16-131 | 24 | 0.948276 |
75 | 75 | G7Y6C1_CLOSI/31-139 | 35 | 0.758621 |
76 | 76 | Q7QBA3_ANOGA/3-112 | 3 | 0.784483 |
77 | 77 | A0A0P7WDW8_9TELE/6-121 | 24 | 1.0 |
78 | 78 | A0A0N5DDH1_TRIMR/1044-1157 | 36 | 0.905172 |
79 | 79 | A0A0R3W9U0_TAEAS/1-118 | 17 | 0.818966 |
80 | 80 | A0A183I3F0_9BILA/5-118 | 18 | 0.896552 |
81 | 81 | A0A0L0CIQ9_LUCCU/4-107 | 3 | 0.801724 |
82 | 82 | A0A177AYT9_9METZ/1-103 | 37 | 0.810345 |
83 | 83 | F7H239_MACMU/31-146 | 2 | 1.0 |
84 | 84 | A0A090MXF3_STRRB/1-104 | 33 | 0.844828 |
85 | 85 | F6V3G4_CALJA/44-159 | 2 | 0.991379 |
86 | 86 | W5L7V8_ASTMX/3-117 | 24 | 0.974138 |
87 | 87 | K7AP45_PANTR/10-125 | 2 | 0.991379 |
88 | 88 | A0A0F8BZB4_LARCR/26-137 | 7 | 0.87931 |
89 | 89 | H0VQK5_CAVPO/10-125 | 2 | 0.991379 |
90 | 90 | A0A0D9RV95_CHLSB/10-125 | 2 | 1.0 |
91 | 91 | W2TIQ9_NECAM/261-369 | 38 | 0.862069 |
92 | 92 | A0A0N5AUA8_9BILA/2-96 | 39 | 0.793103 |
93 | 93 | A0A0A0MST4_HUMAN/10-43 | 40 | 0.284483 |
94 | 94 | A0A2G5UEV7_9PELO/1-113 | 13 | 0.948276 |
95 | 95 | G1MUJ2_MELGA/2-113 | 16 | 0.87931 |
96 | 96 | V4BHI3_LOTGI/4-117 | 41 | 0.905172 |
97 | 97 | B4NHY3_DROWI/3-106 | 3 | 0.784483 |
98 | 98 | A0A1Y9H9E9_9DIPT/2-112 | 3 | 0.784483 |
99 | 99 | A0A131ZZJ6_SARSC/3-88 | 42 | 0.603448 |
100 | 100 | A0A2G8JL66_STIJA/34-145 | 10 | 0.87069 |
101 | 101 | G1T9B1_RABIT/10-124 | 2 | 0.982759 |
102 | 102 | A0A183FDR8_HELBK/28-137 | 5 | 0.663793 |
103 | 103 | A0A2K5J272_COLAP/10-125 | 2 | 1.0 |
104 | 104 | W6NTD8_HAECO/10-122 | 38 | 0.853448 |
105 | 105 | A0A182JYN4_9DIPT/5-113 | 3 | 0.784483 |
106 | 106 | B5DTE4_DROPS/4-106 | 3 | 0.775862 |
107 | 107 | H3B9B9_LATCH/16-131 | 24 | 0.982759 |
108 | 108 | RM55_DROME/3-106 | 3 | 0.775862 |
109 | 109 | A0A182YGH3_ANOST/3-112 | 3 | 0.784483 |
110 | 110 | A8WHP2_DANRE/18-132 | 24 | 0.956897 |
111 | 111 | L9KWN5_TUPCH/1-104 | 2 | 0.887931 |
112 | 112 | A0A2K5KHX2_CERAT/10-125 | 2 | 1.0 |
113 | 113 | A0A1W4UAL9_DROFC/4-106 | 3 | 0.775862 |
114 | 114 | A0A151LZI2_ALLMI/58-172 | 43 | 0.982759 |
115 | 115 | G3RI37_GORGO/46-161 | 2 | 0.991379 |
116 | 116 | A0A0N4SYK1_BRUPA/6-118 | 18 | 0.922414 |
117 | 117 | A0A0K0DW22_STRER/1-104 | 33 | 0.844828 |
118 | 118 | A0A1B0GPI2_PHLPP/76-188 | 44 | 0.939655 |
119 | 119 | A0A1J1HIQ9_9DIPT/7-113 | 45 | 0.818966 |
120 | 120 | W5J6W6_ANODA/2-112 | 3 | 0.784483 |
121 | 121 | A0A0I9N7W2_BRUMA/6-118 | 18 | 0.922414 |
122 | 122 | U4U9N4_DENPD/3-106 | 46 | 0.784483 |
123 | 123 | I3JL90_ORENI/24-134 | 7 | 0.887931 |
124 | 124 | B3M1P4_DROAN/4-106 | 3 | 0.775862 |
125 | 125 | H2S474_TAKRU/2-105 | 7 | 0.853448 |
126 | 126 | B4MB45_DROVI/2-105 | 3 | 0.775862 |
127 | 127 | A0A0R3WIR7_HYDTA/14-118 | 17 | 0.775862 |
128 | 128 | A0A077ZGK7_TRITR/1-108 | 36 | 0.896552 |
129 | 129 | F6TE40_MONDO/1-115 | 15 | 0.974138 |
130 | 130 | A0A1X7U318_AMPQE/41-159 | 47 | 0.663793 |
131 | 131 | A0A0N4VDZ6_ENTVE/2-95 | 48 | 0.793103 |
132 | 132 | G5C999_HETGA/4-102 | 2 | 0.793103 |
133 | 133 | A0A1W0XDC4_HYPDU/21-129 | 49 | 0.767241 |
134 | 134 | A0A1B0G6D5_GLOMM/2-110 | 8 | 0.818966 |
135 | 135 | A0A1L8FV00_XENLA/17-133 | 50 | 0.991379 |
136 | 136 | A0A1I7W0F6_LOALO/17-130 | 18 | 0.87931 |
137 | 137 | A0A183WRR5_TRIRE/1-95 | 51 | 0.818966 |
138 | 138 | G3SD22_GORGO/10-125 | 2 | 0.991379 |
139 | 139 | K1RC04_CRAGI/1-50 | 52 | 0.431034 |
140 | 140 | E2AMM9_CAMFO/352-464 | 20 | 0.862069 |
141 | 141 | A0A182E334_ONCOC/37-140 | 18 | 0.784483 |
142 | 142 | A0A2A2LU44_9BILA/10-118 | 1 | 0.844828 |
143 | 143 | G4LYQ5_SCHMA/4-108 | 51 | 0.801724 |
144 | 144 | Q9TYJ8_CAEEL/5-117 | 13 | 0.965517 |
145 | 145 | D6X0J3_TRICA/2-106 | 46 | 0.784483 |
146 | 146 | A0A2K5KHY3_CERAT/33-148 | 2 | 1.0 |
147 | 147 | A0A2K5J2K0_COLAP/39-154 | 2 | 1.0 |
148 | 148 | H2YZU3_CIOSA/21-124 | 53 | 0.767241 |
149 | 149 | A0A1I8BZY4_MELHA/1-67 | 54 | 0.560345 |
150 | 150 | A0A1A9X193_9MUSC/4-107 | 3 | 0.793103 |
151 | 151 | S9YLR7_CAMFR/19-135 | 2 | 0.991379 |
152 | 152 | A0A1V9XBH2_9ACAR/7-121 | 55 | 0.896552 |
153 | 153 | A0A183SQK1_SCHSO/64-115 | 56 | 0.422414 |
154 | 154 | A0A1D1VKA4_RAMVA/12-123 | 57 | 0.853448 |
155 | 155 | A0A0K0D594_ANGCA/3-99 | 5 | 0.775862 |
156 | 156 | A0A2B4RQ34_STYPI/26-138 | 58 | 0.87069 |
157 | 157 | A0A1I8QBY2_STOCA/5-107 | 3 | 0.810345 |
158 | 158 | V8PEJ2_OPHHA/1-78 | 59 | 0.672414 |
159 | 159 | F6SS40_CIOIN/11-120 | 60 | 0.836207 |
160 | 160 | A0A0K0G1J5_9BILA/30-107 | 61 | 0.560345 |
161 | 161 | B4G668_DROPE/4-106 | 3 | 0.775862 |
162 | 162 | A0A0N4W4F7_HAEPC/11-123 | 38 | 0.853448 |
163 | 163 | A0A0L7QW11_9HYME/370-465 | 62 | 0.637931 |
164 | 164 | A7RNQ0_NEMVE/26-143 | 63 | 0.818966 |
165 | 165 | A0A068YAY6_ECHMU/8-118 | 17 | 0.810345 |
166 | 166 | A0A182XIX4_ANOQN/3-112 | 3 | 0.784483 |
167 | 167 | G7NXC9_MACFA/10-125 | 2 | 1.0 |
168 | 168 | T1J5Q4_STRMM/5-116 | 64 | 0.827586 |
169 | 169 | A0A182FKY5_ANOAL/50-161 | 3 | 0.784483 |
170 | 170 | A0A0R3SVQ2_HYMDI/1-110 | 17 | 0.896552 |
171 | 171 | A0A0N4U7G0_DRAME/3-89 | 65 | 0.612069 |
172 | 172 | G3HHN0_CRIGR/9-122 | 2 | 0.965517 |
173 | 173 | A0A0M3HSN6_ASCLU/9-120 | 66 | 0.818966 |
174 | 174 | L5M5N5_MYODS/10-125 | 2 | 1.0 |
175 | 175 | A0A2G8K792_STIJA/34-145 | 67 | 0.87069 |
176 | 176 | A0A0R3PMB8_ANGCS/12-119 | 5 | 0.818966 |
177 | 177 | G1MCX3_AILME/10-125 | 2 | 1.0 |
178 | 178 | A0A182L7I7_9DIPT/46-160 | 3 | 0.784483 |
179 | 179 | A0A1Y3BGN7_EURMA/199-316 | 68 | 0.87069 |
180 | 180 | A0A1U7S2P1_ALLSI/28-142 | 19 | 0.956897 |
181 | 181 | A0A1D5PF37_CHICK/13-125 | 16 | 0.922414 |
182 | 182 | E0VEI6_PEDHC/5-112 | 69 | 0.836207 |
183 | 183 | A0A182V438_ANOME/2-112 | 3 | 0.784483 |
184 | 184 | E3NM70_CAERE/1-113 | 13 | 0.931034 |
185 | 185 | I3L9J0_PIG/10-125 | 2 | 0.991379 |
186 | 186 | A0A2I3LGQ6_PAPAN/32-147 | 2 | 1.0 |
187 | 187 | A0A1S3WRB9_ERIEU/9-121 | 2 | 0.948276 |
188 | 188 | L5JXI7_PTEAL/1-97 | 2 | 0.827586 |
189 | 189 | G3NBS4_GASAC/1-91 | 7 | 0.784483 |
190 | 190 | S7MBG1_MYOBR/10-125 | 2 | 1.0 |
191 | 191 | G5C106_HETGA/10-124 | 2 | 0.974138 |
192 | 192 | A0A182RYH6_ANOFN/1-112 | 12 | 0.853448 |
193 | 193 | E9ICI6_SOLIN/2-109 | 20 | 0.767241 |
194 | 194 | A0A087T3L9_9ARAC/2-109 | 70 | 0.818966 |
195 | 195 | A0A1I7S389_BURXY/3-113 | 71 | 0.87931 |
196 | 196 | E3MY88_CAERE/200-252 | 72 | 0.422414 |
197 | 197 | U6JFQ1_ECHGR/10-118 | 17 | 0.801724 |
198 | 198 | B7Q0Y3_IXOSC/8-117 | 73 | 0.862069 |
199 | 199 | E4XN03_OIKDI/34-148 | 74 | 0.724138 |
200 | 200 | A0A084W390_ANOSI/1-69 | 75 | 0.594828 |
201 | 201 | A0A212FFC4_DANPL/1-73 | 3 | 0.62931 |
202 | 202 | A8XI10_CAEBR/1-113 | 13 | 0.939655 |
203 | 203 | J9EWJ7_WUCBA/1-94 | 76 | 0.715517 |
204 | 204 | A0A1V4KE06_PATFA/103-214 | 6 | 0.844828 |
205 | 205 | A0A0D8XRC7_DICVI/3-104 | 5 | 0.741379 |
206 | 206 | A0A0V0TDB0_9BILA/725-833 | 4 | 0.862069 |
207 | 207 | A0A2C9LIH9_BIOGL/13-128 | 77 | 0.948276 |
208 | 208 | A0A0M9A9F3_9HYME/351-460 | 78 | 0.862069 |
209 | 209 | A0A195EEX3_9HYME/349-459 | 20 | 0.758621 |
210 | 210 | A0A067RHM1_ZOONE/8-126 | 3 | 0.844828 |
211 | 211 | F4WNV6_ACREC/304-415 | 20 | 0.801724 |
212 | 212 | A0A088ARM4_APIME/14-122 | 78 | 0.793103 |
213 | 213 | A0A182W0I3_9DIPT/2-112 | 12 | 0.844828 |
214 | 214 | F7DF84_HORSE/10-124 | 2 | 0.991379 |
215 | 215 | A0A1D2MP76_ORCCI/27-138 | 79 | 0.827586 |
216 | 216 | A0A0B2V267_TOXCA/8-119 | 66 | 0.87069 |
217 | 217 | A0A2C9LJ85_BIOGL/16-131 | 77 | 0.948276 |
218 | 218 | A0A1A9XI44_GLOFF/16-132 | 8 | 0.775862 |
219 | 219 | H2Q197_PANTR/46-161 | 2 | 0.991379 |
220 | 220 | A0A218UFN6_9PASE/5-115 | 6 | 0.87069 |
221 | 221 | A0A1W4XV80_AGRPL/2-113 | 46 | 0.793103 |
222 | 222 | A0A232FIY4_9HYME/9-122 | 30 | 0.931034 |
223 | 223 | A0A261BVI0_9PELO/63-176 | 13 | 0.939655 |
224 | 224 | A0A087Y3U8_POEFO/26-139 | 7 | 0.905172 |
225 | 225 | A0A182PNP6_9DIPT/2-112 | 3 | 0.784483 |
226 | 226 | H3DW02_PRIPA/2-128 | 80 | 0.87931 |
227 | 227 | W5P8R5_SHEEP/10-125 | 2 | 0.991379 |
228 | 228 | X6RIW1_HUMAN/10-76 | 31 | 0.568966 |
229 | 229 | A0A0N5CW25_THECL/95-207 | 18 | 0.836207 |
230 | 230 | X6R631_HUMAN/1-52 | 81 | 0.448276 |
231 | 231 | A0A0R3TCC4_HYMNN/2-99 | 17 | 0.801724 |
232 | 232 | B4KBB5_DROMO/3-105 | 3 | 0.784483 |
233 | 233 | B4QTH5_DROSI/3-106 | 3 | 0.784483 |
234 | 234 | A0A182LXD1_9DIPT/3-113 | 12 | 0.706897 |
235 | 235 | H2N3H3_PONAB/10-125 | 2 | 0.991379 |
236 | 236 | F7GI04_MACMU/1-52 | 81 | 0.448276 |
237 | 237 | A0A094ZMR1_SCHHA/3-104 | 51 | 0.801724 |
238 | 238 | A0A0B1SAW7_OESDE/11-105 | 5 | 0.689655 |
239 | 239 | M3WM55_FELCA/10-125 | 2 | 1.0 |
240 | 240 | A0A0C2BEU2_9BILA/5-105 | 5 | 0.801724 |
241 | 241 | A0A0N5CCD5_STREA/1-104 | 33 | 0.844828 |
242 | 242 | A0A226P586_COLVI/2-108 | 16 | 0.87069 |
243 | 243 | A0A0M3QXL3_DROBS/2-105 | 3 | 0.784483 |
244 | 244 | A0A195FWH0_9HYME/356-468 | 20 | 0.784483 |
245 | 245 | A0A182I7Y1_ANOAR/4-112 | 3 | 0.784483 |
246 | 246 | T1FXF3_HELRO/1-90 | 82 | 0.775862 |
247 | 247 | A0A2I3GIP3_NOMLE/10-125 | 2 | 0.991379 |
248 | 248 | G3WPK1_SARHA/11-126 | 15 | 0.991379 |
249 | 249 | RM55_MOUSE/9-124 | 2 | 1.0 |
250 | 250 | M3XPJ7_MUSPF/10-125 | 2 | 0.991379 |
251 | 251 | A0A182GHC0_AEDAL/6-119 | 3 | 0.836207 |
It is possible to use this DataFrame
and Plots
to plot the sequence coverage of the MSA and also an histogram of the number of sequences in each cluster:
using StatPlots # Plotting DataFrames
h = @df df histogram(:cluster, ylabel="nseq")
p = @df df plot(:cluster, :coverage, linetype=:scatter)
plot(p, h, nc=1, xlim=(0, nclusters(clusters)+1 ), legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS
We use the Split-Apply-Combine strategy, though the by
function of the DataFrames
package, to select the sequence of highest coverage for each cluster.
maxcoverage = by(df, :cluster, cl -> cl[ findmax(cl[:coverage])[2] ,
[:seqnum, :seqname, :coverage]])
p = @df maxcoverage plot(:cluster, :coverage, linetype=:scatter)
h = @df maxcoverage histogram(:cluster, ylabel="nseq")
plot(p, h, nc=1, xlim=(0, nclusters(clusters)+1 ), legend=false)
png("msa_clusters_iii.png") # hide
nothing # hide
We can easily generate a mask using list comprehension, to select only the representative sequences of the MSA (deleting the rest of the sequences with filtersequences!
).
cluster_references = Bool[ seqnum in maxcoverage[:seqnum] for seqnum in 1:nsequences(msa) ]
filtersequences!(msa, cluster_references)
plot(msa)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application
GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS