MSA

The MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).

using MIToS.MSA # to load the MSA module

Features

Contents

MSA IO

Reading MSA files

The main function for reading MSA files in MIToS is read and it is defined in the Utils module. This function takes a filename/path as a first argument followed by other arguments. It opens the file and uses the arguments to call the parse function. read decides how to open the file, using the prefixes (e.g. https) and suffixes (i.e. extensions) of the file name, while parse does the actual parsing of the file. You can readgzipped files if they have the .gz extension and also urls pointing to a web file. The second argument of read and parse is the file FileFormat. The supported MSA formats at the moment are Stockholm, FASTA, PIR (NBRF) and Raw. For example, reading with MIToS the full Stockholm MSA of the family PF07388 using the Pfam RESTful interface will be:

using MIToS.MSA

read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm)
AnnotatedMultipleSequenceAlignment with 35 annotations : 14×459 Named Array{MIToS.MSA.Residue,2}
               Seq ╲ Col │  24   25   26   27   28  …  551  552  553  554  555
─────────────────────────┼────────────────────────────────────────────────────
A0A1S1T395_9RHIZ/224-427 │   -    -    -    -    -  …    -    -    -    -    -
M5A3D2_9ACTN/208-402     │   -    -    -    -    -       -    -    -    -    -
A0A1B8PI30_MORNO/200-434 │   -    -    -    -    -       -    -    -    -    -
A0A1W1V8Q8_9PAST/1-405   │   -    -    S    K    F       -    -    -    -    -
A0A0B0IGB5_9BACI/201-416 │   -    -    -    -    -       -    -    -    -    -
A0A0B0IGB5_9BACI/1-170   │   -    -    -    -    -       -    -    -    -    -
Q4W584_NEIMB/1-222       │   M    L    K    K    I       -    -    -    -    -
A0A1S1TKC5_9RHIZ/184-421 │   -    -    -    -    -       -    -    -    -    -
A0A1S1TKC5_9RHIZ/2-147   │   -    -    -    -    -       -    -    -    -    -
C5ZW53_9HELI/1-498       │   -    -    K    K    L       K    Y    L    Q    L
M5A3D2_9ACTN/2-157       │   -    -    -    -    -       -    -    -    -    -
A0A1B8PI30_MORNO/6-183   │   -    -    -    -    -       -    -    -    -    -
A0A157SWM6_9BORD/1-413   │   -    L    K    K    L       -    -    -    -    -
A0A1S1T395_9RHIZ/6-168   │   -    -    -    -    -  …    -    -    -    -    -

The third (and optional) argument of read and parse is the output MSA type:

  • Matrix{Residue} : It only contains the aligned sequences.
  • MultipleSequenceAlignment : It contains the aligned sequences and their

names/identifiers.

  • AnnotatedMultipleSequenceAlignment : It's the richest MIToS' MSA format and it's the

default. It includes the aligned sequences, their names and the MSA annotations.

Example of Matrix{Residue} output using a Stockholm file as input:

read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm, Matrix{Residue})
14×459 Array{MIToS.MSA.Residue,2}:
 -  -  -  -  -  -  -  -  -  -  -  -  -  …  -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  S  K  F  T  K  F  I  F  N  P  K     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -  …  -  -  -  -  -  -  -  -  -  -  -  -
 M  L  K  K  I  K  K  A  L  F  Q  P  K     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  K  K  L  S  G  L  M  Q  D  I  K     D  F  Q  K  Y  R  I  K  Y  L  Q  L
 -  -  -  -  -  -  -  -  -  -  -  -  -  …  -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -
 -  L  K  K  L  R  K  L  I  L  H  P  V     -  -  -  -  -  -  -  -  -  -  -  -
 -  -  -  -  -  -  -  -  -  -  -  -  -     -  -  -  -  -  -  -  -  -  -  -  -

Because read calls parse, you should look into the documentation of parse to know the available keyword arguments. The optional keyword arguments of those functions are:

  • generatemapping : If generatemapping is true (default: false), sequences and

columns mappings are generated and saved in the MSA annotations. The default is false to not overwrite mappings by mistake when you read an annotated MSA file saved with MIToS.

  • useidcoordinates : If useidcoordinates is true (default: false) and the names

have the form seqname/start-end, MIToS uses this coordinates to generate sequence mappings. This is safe and useful with unmodified Pfam MSAs. Do not use it when reading an MSA saved with MIToS. MIToS deletes unaligned insert columns, therefore disrupts sequences that have them.

  • deletefullgaps : Given that lowercase characters and dots are converted to gaps,

unaligned insert columns in the MSA (derived from a HMM profile) are converted into full gap columns. deletefullgaps is true by default, deleting full gaps columns and therefore insert columns.

Note

If you want to keep the insert columns... Use the keyword argument keepinserts to true in read/parse. This only works with an AnnotatedMultipleSequenceAlignment output. A column annotation ("Aligned") is stored in the annotations, where insert columns are marked with 0 and aligned columns with 1.

When read returns an AnnotatedMultipleSequenceAlignment, it uses the MSA Annotations to keep track of performed modifications. To access these notes, use printmodifications:

msa = read("http://pfam.xfam.org/family/PF01565/alignment/full", Stockholm)

printmodifications(msa)
-------------------
2020-04-20T11:45:46.413

deletefullgaps!  :  Deletes 861 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
-------------------
2020-04-20T11:45:46.526

filtercolumns! : 861 columns have been deleted.

Writing MSA files

Julia REPL shows MSAs as Matrices. If you want to print them in another format, you should use the print function with an MSA object as first argument and the FileFormatFASTA, Stockholm, PIR or Raw as second argument.

using MIToS.MSA

msa = read("http://pfam.xfam.org/family/PF16996/alignment/full", Stockholm) # reads a Stockholm MSA file

print(msa, FASTA) # prints msa in FASTA format
>U5P5G4_9STRE/5-59
KKDLFYKEVEGRMNALKRRSAEKEKATRSEKVNLTLNVVIGLVILLGVLLTLFRV
>T0TMU0_9STRE/1-43
------------MEELKQKPITKEKETRGEKINKSFSIMLGLVIIIGLIFTLI--
>A0A139NPI6_9STRE/5-59
-EDLFYKEVEGRMADLQQKAPEKEKKTGAERLNTLFSLALGLVILLGLLFTLLR-
>A0A139NTT2_9STRE/3-57
KKDLFYKDVEQKLDSLKQGQPKKEKASLGEKLNKAFVIALGLVILIGLIFTLI--
>T0TY77_9STRE/5-59
KKDLFYKEVEGRMESLKRRPAEKEKTTRSEKINVTFNVIIGLVILLGVIFTLFRV
>A8AWV6_STRGC/3-57
KKDLFYKDIEGRLDELKHGRPKKEKASLGEKFNKAFVIALGLMILIGLIFTLIG-
>A0A139NMD7_9STRE/4-59
KDDIFYKDIEGRMDELKRKPPKKEKKTRAERISTFFSVSLGLVILIGLLFTLFRI
>A3CM62_STRSV/3-57
KKDLFYKDIEGRLDELKHGKPKKEKASLGENLNKAFVIVLGLMILIGLIFTLI--

To save an MSA object to a file, use the write function. This function takes a filename as a first argument. If the filename ends with .gz, the output will be a compressed (gzipped) file. The next two arguments of write are passed to print, so write behaves as print.

write("msa.gz", msa, FASTA) # writes msa in FASTA format in a gzipped file

MSA Annotations

MSA annotations are based on the Stockholm format mark-ups. There are four types of annotations stored as dictionaries. All the annotations have a feature name as part of the key, which should be a single "word" (without spaces) and less than 50 characters long.

  • File annotations : The annotations can contain either file or MSA information. They

have feature names as keys and the values are strings (free text). Lines starting with #=GF in Stockholm format.

  • Column annotations : They have feature names as keys and strings with exactly 1 char

per column as values. Lines starting with #=GC in Stockholm format.

  • Sequence annotations : The keys are tuples with the sequence name and the feature

name. The values are free text (strings). Lines starting with #=GS in Stockholm format. Annotations in the PIR/NBRF format are also stored as sequence annotations. In particular, we use the names "Type" and "Title" to name the sequence type in the identifier line and the first comment line before the sequence in PIR files, respectively.

  • Residue annotations : The keys are tuples with the sequence name and the feature

name. The values are strings with exactly 1 char per column/residues. #=GR lines in Stockholm format.

Julia REPL shows the Annotations type as they are represented in the Stockholm format. You can get the Annotations inside an annotated MSA or sequence using the annotations function.

annotations(msa)
#=GF ID	Asp4
#=GF AC	PF16996.5
#=GF DE	Accessory secretory protein Sec Asp4
#=GF AU	Coggill P;0000-0001-5731-1588
#=GF SE	Pfam-B_7603 (release 27.0)
#=GF GA	25.70 25.70;
#=GF TC	55.00 54.80;
#=GF NC	21.60 20.60;
#=GF BM	hmmbuild HMM.ann SEED.ann
#=GF SM	hmmsearch -Z 45638612 -E 1000 --cpu 4 HMM pfamseq
#=GF TP	Family
#=GF RN	[1]
#=GF RM	23000954
#=GF RT	Emerging themes in SecA2-mediated protein export.
#=GF RA	Feltcher ME, Braunstein M;
#=GF RL	Nat Rev Microbiol. 2012;10:779-789.
#=GF DR	INTERPRO; IPR031551;
#=GF DR	SO; 0100021; polypeptide_conserved_region;
#=GF CC	Asp4 and Asp5 are putative accessory components of the SecY2
#=GF CC	channel of the SecA2-SecY2 mediated export system, but they are
#=GF CC	not present in all SecA2-SecY2 systems. This family of Asp4 is
#=GF CC	found in Firmicutes [1].
#=GF SQ	8
#=GF MIToS_2020-04-20T11:45:46.998	deletefullgaps!  :  Deletes 3 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
#=GF MIToS_2020-04-20T11:45:46.998	filtercolumns! : 3 columns have been deleted.
#=GS A3CM62_STRSV/3-57	AC	A3CM62.1
#=GS A0A139NTT2_9STRE/3-57	AC	A0A139NTT2.1
#=GS A0A139NMD7_9STRE/4-59	AC	A0A139NMD7.1
#=GS T0TMU0_9STRE/1-43	AC	T0TMU0.1
#=GS U5P5G4_9STRE/5-59	AC	U5P5G4.1
#=GS T0TY77_9STRE/5-59	AC	T0TY77.1
#=GS A0A139NPI6_9STRE/5-59	AC	A0A139NPI6.1
#=GS A8AWV6_STRGC/3-57	AC	A8AWV6.1
#=GC seq_cons			KKDLFYK-lEGRM--LK++sPcKEKsTtuEKlNpsFslsLGLVILIGLIFTLlt.

Particular annotations can be accessed using the functions getannot.... These functions take the MSA/sequence as first argument and the feature name of the desired annotation as the last. In the case of getannotsequence and getannotresidue, the second argument should be the sequence name.

getannotsequence(msa, "A0A139NPI6_9STRE/5-59", "AC") # ("A0A139NPI6_9STRE/5-59", "AC") is the key in the dictionary
"A0A139NPI6.1"

If you want to add new annotations, you should use the setannot…! functions. These functions have the same arguments that getannot... functions except for an extra argument used to indicate the new annotation value.

setannotsequence!(msa, "A0A139NPI6_9STRE/5-59", "New_Feature_Name", "New_Annotation")
"New_Annotation"

A getannot... function without the key (last arguments), returns the particular annotation dictionary. As you can see, the new sequence annotation is now part of our MSA annotations.

getannotsequence(msa)
Dict{Tuple{String,String},String} with 9 entries:
  ("A3CM62_STRSV/3-57", "AC")                   => "A3CM62.1"
  ("A0A139NTT2_9STRE/3-57", "AC")               => "A0A139NTT2.1"
  ("A0A139NMD7_9STRE/4-59", "AC")               => "A0A139NMD7.1"
  ("T0TMU0_9STRE/1-43", "AC")                   => "T0TMU0.1"
  ("U5P5G4_9STRE/5-59", "AC")                   => "U5P5G4.1"
  ("T0TY77_9STRE/5-59", "AC")                   => "T0TY77.1"
  ("A0A139NPI6_9STRE/5-59", "AC")               => "A0A139NPI6.1"
  ("A0A139NPI6_9STRE/5-59", "New_Feature_Name") => "New_Annotation"
  ("A8AWV6_STRGC/3-57", "AC")                   => "A8AWV6.1"

Editing your MSA

MIToS offers functions to edit your MSA. Because these functions modify the msa, their names end with a bang !, following the Julia convention. Some of these functions have an annotate keyword argument (in general, it's true by default) to indicate if the modification should be recorded in the MSA/sequence annotations.

One common task is to delete sequences or columns of the MSA. This could be done using the functions filtersequences! and filtercolumns!. These functions take the MSA or sequence (if it's possible) as first argument and a BitVector or Vector{Bool} mask as second argument. It deletes all the sequences or columns where the mask is false. These functions are also defined for Annotations, this allows to automatically update (modify) the annotations (and therefore, sequence and column mappings) in the MSA.

This two deleting operations are used in the second and third mutating functions of the following list:

  • setreference! : Sets one of the sequences as the first sequence of the MSA (query or

reference sequence).

  • adjustreference! : Deletes columns with gaps in the first sequence of the MSA

(reference).

  • gapstrip! : This function first calls adjustreference!, then deletes sequences with

low (user defined) MSA coverage and finally, columns with user defined % of gaps.

Also, there are several available funtions shuffle_…!. These functions are useful to generate random alignments. The Information module of MIToS uses them to calculate the Z scores of MI values.

Example: Deleting sequences

For example, if you want to keep only the proteins from Actinobacteria you can delete all the sequences that don't have _9ACTN in their UniProt entry names:

using MIToS.MSA

msa = read("http://pfam.xfam.org/family/PF07388/alignment/full", Stockholm)

sequencenames(msa) # the function sequencenames returns the sequence names in the MSA
14-element Array{String,1}:
 "A0A1S1T395_9RHIZ/224-427"
 "M5A3D2_9ACTN/208-402"
 "A0A1B8PI30_MORNO/200-434"
 "A0A1W1V8Q8_9PAST/1-405"
 "A0A0B0IGB5_9BACI/201-416"
 "A0A0B0IGB5_9BACI/1-170"
 "Q4W584_NEIMB/1-222"
 "A0A1S1TKC5_9RHIZ/184-421"
 "A0A1S1TKC5_9RHIZ/2-147"
 "C5ZW53_9HELI/1-498"
 "M5A3D2_9ACTN/2-157"
 "A0A1B8PI30_MORNO/6-183"
 "A0A157SWM6_9BORD/1-413"
 "A0A1S1T395_9RHIZ/6-168"
mask = map(x -> occursin(r"_9ACTN", x), sequencenames(msa)) # an element of mask is true if "_9ACTN" is in the name
14-element Array{Bool,1}:
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 1
 0
 0
 0
filtersequences!(msa, mask) # deletes all the sequences where mask is false

sequencenames(msa)
2-element Array{String,1}:
 "M5A3D2_9ACTN/208-402"
 "M5A3D2_9ACTN/2-157"

Example: Exporting a MSA for freecontact (part I)

The most simple input for the command line tool freecontact (if you don't want to set --mincontsep) is a Raw MSA file with a reference sequence without insertions or gaps. This is easy to get with MIToS using read (deletes the insert columns), setreference! (to choose a reference), adjustreference! (to delete columns with gaps in the reference) and write (to save it in Raw format) functions.

julia> using MIToS.MSA

julia> msa = read("http://pfam.xfam.org/family/PF02476/alignment/full", Stockholm)
AnnotatedMultipleSequenceAlignment with 41 annotations : 20×126 Named Array{MIToS.MSA.Residue,2}
               Seq ╲ Col │   9   10   11   12   13  …  205  206  207  208  209
─────────────────────────┼────────────────────────────────────────────────────
US02_GAHVM/120-237       │   M    L    E    S    E  …    F    C    C    -    -
A0A1R3T8S2_9ALPH/119-225 │   -    -    -    -    -       -    -    -    -    -
B1A4T1_9ALPH/119-238     │   H    L    S    S    G       A    C    C    I    -
A0A288CG55_EHV4/111-257  │   Y    L    N    S    S       F    C    Q    E    -
E2IUH4_SHV1/110-233      │   -    -    -    -    -       -    -    -    -    -
Q8V727_CHV1/110-256      │   -    L    C    E    P       P    C    V    A    C
A0A060Q503_9ALPH/110-250 │   -    L    H    G    P       S    C    P    A    -
Q77L60_MEHV1/120-238     │   M    L    E    S    E       L    C    C    C    -
US02_HHV11/110-247       │   -    L    H    R    D       P    C    C    A    C
⋮                            ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
A0A0Y0DAF2_9ALPH/108-249 │   -    -    -    -    -       -    -    -    -    -
A0A068EPE8_9ALPH/120-239 │   Y    L    E    S    G       L    C    C    -    -
US02_HHV2H/110-246       │   L    L    H    Q    E       P    C    F    T    C
J9R0A8_9ALPH/178-313     │   -    -    -    -    -       -    -    -    -    -
A0A0A7D914_9ALPH/111-268 │   H    L    N    S    S       T    A    -    -    -
Q67639_ILTV/100-201      │   H    L    S    S    G       -    -    -    -    -
Q6R5Q3_9ALPH/120-237     │   M    L    E    S    E       F    C    C    -    -
Q5PP72_9ALPH/135-256     │   Y    L    N    S    G       R    C    T    I    -
Q5EGY0_CHV16/111-255     │   -    L    C    D    P  …    P    C    T    A    C

julia> msa_coverage = coverage(msa)
20×1 Named Array{Float64,2}
          Seq ╲ Function │ coverage
─────────────────────────┼─────────
US02_GAHVM/120-237       │ 0.904762
A0A1R3T8S2_9ALPH/119-225 │  0.65873
B1A4T1_9ALPH/119-238     │ 0.912698
A0A288CG55_EHV4/111-257  │ 0.960317
E2IUH4_SHV1/110-233      │ 0.801587
Q8V727_CHV1/110-256      │ 0.992063
A0A060Q503_9ALPH/110-250 │ 0.984127
Q77L60_MEHV1/120-238     │ 0.912698
US02_HHV11/110-247       │  0.97619
⋮                                 ⋮
A0A0Y0DAF2_9ALPH/108-249 │  0.65873
A0A068EPE8_9ALPH/120-239 │ 0.928571
US02_HHV2H/110-246       │ 0.984127
J9R0A8_9ALPH/178-313     │ 0.761905
A0A0A7D914_9ALPH/111-268 │ 0.960317
Q67639_ILTV/100-201      │ 0.325397
Q6R5Q3_9ALPH/120-237     │ 0.904762
Q5PP72_9ALPH/135-256     │ 0.880952
Q5EGY0_CHV16/111-255     │ 0.992063

julia> maxcoverage, maxindex = findmax(msa_coverage) # chooses the sequence with more coverage of the MSA
(0.9920634920634921, CartesianIndex(6, 1))

julia> setreference!(msa, maxindex[1])
AnnotatedMultipleSequenceAlignment with 42 annotations : 20×126 Named Array{MIToS.MSA.Residue,2}
               Seq ╲ Col │   9   10   11   12   13  …  205  206  207  208  209
─────────────────────────┼────────────────────────────────────────────────────
Q8V727_CHV1/110-256      │   -    L    C    E    P  …    P    C    V    A    C
A0A1R3T8S2_9ALPH/119-225 │   -    -    -    -    -       -    -    -    -    -
B1A4T1_9ALPH/119-238     │   H    L    S    S    G       A    C    C    I    -
A0A288CG55_EHV4/111-257  │   Y    L    N    S    S       F    C    Q    E    -
E2IUH4_SHV1/110-233      │   -    -    -    -    -       -    -    -    -    -
US02_GAHVM/120-237       │   M    L    E    S    E       F    C    C    -    -
A0A060Q503_9ALPH/110-250 │   -    L    H    G    P       S    C    P    A    -
Q77L60_MEHV1/120-238     │   M    L    E    S    E       L    C    C    C    -
US02_HHV11/110-247       │   -    L    H    R    D       P    C    C    A    C
⋮                            ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
A0A0Y0DAF2_9ALPH/108-249 │   -    -    -    -    -       -    -    -    -    -
A0A068EPE8_9ALPH/120-239 │   Y    L    E    S    G       L    C    C    -    -
US02_HHV2H/110-246       │   L    L    H    Q    E       P    C    F    T    C
J9R0A8_9ALPH/178-313     │   -    -    -    -    -       -    -    -    -    -
A0A0A7D914_9ALPH/111-268 │   H    L    N    S    S       T    A    -    -    -
Q67639_ILTV/100-201      │   H    L    S    S    G       -    -    -    -    -
Q6R5Q3_9ALPH/120-237     │   M    L    E    S    E       F    C    C    -    -
Q5PP72_9ALPH/135-256     │   Y    L    N    S    G       R    C    T    I    -
Q5EGY0_CHV16/111-255     │   -    L    C    D    P  …    P    C    T    A    C

julia> adjustreference!(msa)
AnnotatedMultipleSequenceAlignment with 43 annotations : 20×125 Named Array{MIToS.MSA.Residue,2}
               Seq ╲ Col │  10   11   12   13   14  …  205  206  207  208  209
─────────────────────────┼────────────────────────────────────────────────────
Q8V727_CHV1/110-256      │   L    C    E    P    R  …    P    C    V    A    C
A0A1R3T8S2_9ALPH/119-225 │   -    -    -    -    -       -    -    -    -    -
B1A4T1_9ALPH/119-238     │   L    S    S    G    I       A    C    C    I    -
A0A288CG55_EHV4/111-257  │   L    N    S    S    I       F    C    Q    E    -
E2IUH4_SHV1/110-233      │   -    -    -    -    -       -    -    -    -    -
US02_GAHVM/120-237       │   L    E    S    E    V       F    C    C    -    -
A0A060Q503_9ALPH/110-250 │   L    H    G    P    R       S    C    P    A    -
Q77L60_MEHV1/120-238     │   L    E    S    E    A       L    C    C    C    -
US02_HHV11/110-247       │   L    H    R    D    Q       P    C    C    A    C
⋮                            ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
A0A0Y0DAF2_9ALPH/108-249 │   -    -    -    -    -       -    -    -    -    -
A0A068EPE8_9ALPH/120-239 │   L    E    S    G    I       L    C    C    -    -
US02_HHV2H/110-246       │   L    H    Q    E    R       P    C    F    T    C
J9R0A8_9ALPH/178-313     │   -    -    -    -    -       -    -    -    -    -
A0A0A7D914_9ALPH/111-268 │   L    N    S    S    L       T    A    -    -    -
Q67639_ILTV/100-201      │   L    S    S    G    A       -    -    -    -    -
Q6R5Q3_9ALPH/120-237     │   L    E    S    E    V       F    C    C    -    -
Q5PP72_9ALPH/135-256     │   L    N    S    G    A       R    C    T    I    -
Q5EGY0_CHV16/111-255     │   L    C    D    P    R  …    P    C    T    A    C

julia> write("tofreecontact.msa", msa, Raw)

julia> print(read("tofreecontact.msa", String)) # It displays the contents of the output file
LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARRLVTIRVTDAWAGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPGWLRETVPRLWPGSESAAAEAAKTPHPARRPCVAC
------QTDRFRLWIIGAADICQQALSSFPKNMRRITVKVNGEWPGNPCFLDTNLEPVLISSWKPVSWPEK-ENNIDTSKLLCRYELLIP-----------------------------------
LSSGILGNHTFDMWIIGAADICKPAIEQLPNSKRFITIKVPGTWSGLTWEKPDGLSPLTVTEWDPCDDETMSKIEKKLVGIKCCYDLIGQ----------PAAAKHNSELDDSKDCCNGPACCI-
LNSSIIINRPYHLWVLGAADLCRPVFNLIPGPKRIVYVEIEDEF-NKSW-QPSETIPLTTVDARSLLQHKSSFVSPPPNFKRLIYAVVDP--MRLQENLCPQITNRTKTKRRSKKTYNGLFCQE-
------DQRLFHVWIVGAADLCEPLLACLSTGARCVVIEIGGLWGGRSWLPPAYARSGASTSWTPLPPPTGSAAESALEGCEYRYAIIPG----VEARRRPASRDENDERS--------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LHGPRDRPRLYHMWVVGAADLCAPFLESLWRMRRLITIKIPDGWVGASWHLPEPFLPATSVAWTPFPAPPNHPLESLLQNYEYRYGILSPRWLRNLVNKWKNMGGRPTLPPATSEHITQNSCPA-
LESEASDNVSYSLWILGAADICRTAIESIPLPKRLFAIKVPGTWAGMPWAIPCEIQTLLTSTWEPKFENIEDKAYFNDSNMACVYQIIGS----------PPDVPQLQGLGIESTPPKRNLCCC-
LHRDQPSPRLYHLWVVGAADLCVPFLEYAQKIRRFIAIKTPDAWVGEPWAVPTRFLPEWTVAWTPFPAAPNHPLETLLSRYEYQYGVVLPRWLRSLIALHKPHPATPGPLTTSHP--VRRPCCAC
LNSGIFGNYPLNLWVFGAADLCEPVISNIPGPKRLIYAYVSCEWPEPSW-KPENVKHIDDSGWVPVLDTFLSSFKVPPMLSDIFYGTTFNTPFNFESSPRCPSISSSSSSSSSSS----SSC---
LNSSVMGNRPYHLWVFGAADICRPVFDLLPGPKRLVYAELSEEWPGPSW-QPPP-----AVPWTAPERGTPADLEVPPRVARMVYAVVNRSWLRSRPLRHVPGRATPATPSVSTP-SGRGRC---
--------KLYHMWVIGAADICDPFLSCLQQKARFITIKMDFIWEGAAWLPPPRMCPKKTVPWVPCPISQDPALEQLLSNCSYEYGVVGPR----------------------------------
LESGIMSGGPFSLWIIGAADLCKEAIEHIPTPKRMFALQVKGTWSGMPWAIPEAIPVLLESSWSPQFTHPLDQRGFLESGMRGVYKVIGR-------ASDPPRNNSEPETRKPRGRLLKYLCC--
LHQERPGPRLYHLWVVGAADLCVPFFEYAQKTRRFIATKTNDAWVGEPWPLPDRFLPERTVSWTPFPAAPNHPLENLLSRYEYQYGVVVPRWLRSLVAPHKPRPASSRPHPATHP--TQRPCFTC
--------ELFAVWVVGGADICAPAMEAMMALGRVVAIAGAHSLTGSSWPISARLLPAVWHSWEPYPAAPTNPLKNTFAACTFVAGTVNARWLGALTSHTRPQV---------------------
LNSSLIINQPYHLWVLGAADLCKPVFDLIPGPKRMVYAEIADEF-HKSW-QPPETIPWTTVEHNGFSKHSSNSLVHPPTVKRVIYAVVDPAPGRPRQRRRPSEGGMRAPRRRSRAAPARSTA---
LSSGAYAAKEFHLWVLGTADICMAALN-LPAPKTFLITETG------------------------------------------------------------------------------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LNSGARGTAPIHLWILGAADLCDQVLLAAS---RSTAAGAPGAPTGARLT---RRRPGLTDA------DALDVIVAGIPATRAMFARVHNEALHAQIVTRGDVRRRRGGRGNGRE--RAPRCTI-
LCDPRPRPRLYHLWVVGAADLCAPFLEHLRDGRRLVAVRVPDAWAGGSWPLPEGLALGETVPWTPFPTAHNHPLAALFGTYEYRYGVLRPGWLREAASRLWAGPPATATEAAQKRHPARRPCTAC

Column and sequence mappings

Inserts in a Stockholm MSA allow to access the full fragment of the aligned sequences. Using this, combined with the sequence names that contain coordinates used in Pfam, you can know what is the UniProt residue number of each residue in the MSA.

"PROT_SPECI/3-15 .....insertALIGNED"
#                     3456789111111
#                            012345

MIToS read and parse functions delete the insert columns, but they do the mapping between each residue and its residue number before deleting insert columns when generatemapping is true. If you don't set useidcoordinates to true, the residue first i residue will be 1 instead of 3 in the previous example.

using MIToS.MSA

msa = parse("PROT_SPECI/3-15 .....insertALIGNED", Stockholm, generatemapping=true, useidcoordinates=true)
AnnotatedMultipleSequenceAlignment with 4 annotations : 1×7 Named Array{MIToS.MSA.Residue,2}
      Seq ╲ Col │ 12  13  14  15  16  17  18
────────────────┼───────────────────────────
PROT_SPECI/3-15 │  A   L   I   G   N   E   D

MIToS also keeps the column number of the input MSA and its total number of columns. All this data is stored in the MSA annotations using the SeqMap, ColMap and NCol feature names.

annotations(msa)
#=GF NCol	18
#=GF ColMap	12,13,14,15,16,17,18
#=GF MIToS_2020-04-20T11:45:52.202	deletefullgaps!  :  Deletes 11 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
#=GF MIToS_2020-04-20T11:45:52.202	filtercolumns! : 11 columns have been deleted.
#=GS PROT_SPECI/3-15	SeqMap	9,10,11,12,13,14,15

To have an easy access to mapping data, MIToS provides the getsequencemapping and getcolumnmapping functions.

getsequencemapping(msa, "PROT_SPECI/3-15")
7-element Array{Int64,1}:
  9
 10
 11
 12
 13
 14
 15
getcolumnmapping(msa)
7-element Array{Int64,1}:
 12
 13
 14
 15
 16
 17
 18

Example: Exporting a MSA for freecontact (part II)

If we want to use the --mincontsep argument of freecontact to calculate scores between distant residues, we will need to add a header to the MSA. This header should contains the residue number of the first residue of the sequence and the full fragment of that sequence (with the inserts). This data is used by FreeContact to calculate the residue number of each residue in the reference sequence. We are going to use MIToS mapping data to create this header, so we read the MSA with generatemapping and useidcoordinates set to true.

using MIToS.MSA

msa = read( "http://pfam.xfam.org/family/PF02476/alignment/full", Stockholm,
            generatemapping=true, useidcoordinates=true)
AnnotatedMultipleSequenceAlignment with 64 annotations : 20×126 Named Array{MIToS.MSA.Residue,2}
               Seq ╲ Col │   9   10   11   12   13  …  205  206  207  208  209
─────────────────────────┼────────────────────────────────────────────────────
US02_GAHVM/120-237       │   M    L    E    S    E  …    F    C    C    -    -
A0A1R3T8S2_9ALPH/119-225 │   -    -    -    -    -       -    -    -    -    -
B1A4T1_9ALPH/119-238     │   H    L    S    S    G       A    C    C    I    -
A0A288CG55_EHV4/111-257  │   Y    L    N    S    S       F    C    Q    E    -
E2IUH4_SHV1/110-233      │   -    -    -    -    -       -    -    -    -    -
Q8V727_CHV1/110-256      │   -    L    C    E    P       P    C    V    A    C
A0A060Q503_9ALPH/110-250 │   -    L    H    G    P       S    C    P    A    -
Q77L60_MEHV1/120-238     │   M    L    E    S    E       L    C    C    C    -
US02_HHV11/110-247       │   -    L    H    R    D       P    C    C    A    C
⋮                            ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
A0A0Y0DAF2_9ALPH/108-249 │   -    -    -    -    -       -    -    -    -    -
A0A068EPE8_9ALPH/120-239 │   Y    L    E    S    G       L    C    C    -    -
US02_HHV2H/110-246       │   L    L    H    Q    E       P    C    F    T    C
J9R0A8_9ALPH/178-313     │   -    -    -    -    -       -    -    -    -    -
A0A0A7D914_9ALPH/111-268 │   H    L    N    S    S       T    A    -    -    -
Q67639_ILTV/100-201      │   H    L    S    S    G       -    -    -    -    -
Q6R5Q3_9ALPH/120-237     │   M    L    E    S    E       F    C    C    -    -
Q5PP72_9ALPH/135-256     │   Y    L    N    S    G       R    C    T    I    -
Q5EGY0_CHV16/111-255     │   -    L    C    D    P  …    P    C    T    A    C

Here, we are going to choose the sequence with more coverage of the MSA as our reference sequence.

msa_coverage = coverage(msa)
maxcoverage, maxindex = findmax(msa_coverage)
setreference!(msa, maxindex[1])
adjustreference!(msa)
AnnotatedMultipleSequenceAlignment with 65 annotations : 20×125 Named Array{MIToS.MSA.Residue,2}
               Seq ╲ Col │  10   11   12   13   14  …  205  206  207  208  209
─────────────────────────┼────────────────────────────────────────────────────
Q8V727_CHV1/110-256      │   L    C    E    P    R  …    P    C    V    A    C
A0A1R3T8S2_9ALPH/119-225 │   -    -    -    -    -       -    -    -    -    -
B1A4T1_9ALPH/119-238     │   L    S    S    G    I       A    C    C    I    -
A0A288CG55_EHV4/111-257  │   L    N    S    S    I       F    C    Q    E    -
E2IUH4_SHV1/110-233      │   -    -    -    -    -       -    -    -    -    -
US02_GAHVM/120-237       │   L    E    S    E    V       F    C    C    -    -
A0A060Q503_9ALPH/110-250 │   L    H    G    P    R       S    C    P    A    -
Q77L60_MEHV1/120-238     │   L    E    S    E    A       L    C    C    C    -
US02_HHV11/110-247       │   L    H    R    D    Q       P    C    C    A    C
⋮                            ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
A0A0Y0DAF2_9ALPH/108-249 │   -    -    -    -    -       -    -    -    -    -
A0A068EPE8_9ALPH/120-239 │   L    E    S    G    I       L    C    C    -    -
US02_HHV2H/110-246       │   L    H    Q    E    R       P    C    F    T    C
J9R0A8_9ALPH/178-313     │   -    -    -    -    -       -    -    -    -    -
A0A0A7D914_9ALPH/111-268 │   L    N    S    S    L       T    A    -    -    -
Q67639_ILTV/100-201      │   L    S    S    G    A       -    -    -    -    -
Q6R5Q3_9ALPH/120-237     │   L    E    S    E    V       F    C    C    -    -
Q5PP72_9ALPH/135-256     │   L    N    S    G    A       R    C    T    I    -
Q5EGY0_CHV16/111-255     │   L    C    D    P    R  …    P    C    T    A    C

MIToS deletes the residues in insert columns, so we are going to use the sequence mapping to generate the whole fragment of the reference sequence (filling the missing regions with 'x').

seqmap = getsequencemapping(msa, 1) # seqmap will be a vector with the residue numbers of the first sequence (reference)

seq = collect( stringsequence(msa, 1) ) # seq will be a Vector of Chars with the reference sequence

sequence = map(seqmap[1]:seqmap[end]) do seqpos # for each position in the whole fragment
    if seqpos in seqmap                         # if that position is in the MSA
        popfirst!(seq)                          # the residue is taken from seq
    else                                        # otherwise
        'x'                                     # 'x' is included
    end
end

sequence = join(sequence) # join the Chars on the Vector to create a string
"LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARxxxRLVTIRVTDAWAxGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPxxxxxxxxxxxxxxGWLRETVPRLWPGSESAAAEAAKTPxxxHPARRPCVAC"

Once we have the whole fragment of the sequence, we create the file and write the header in the required format (as in the man page of freecontact).

open("tofreecontact.msa", "w") do fh

    println(fh, "# querystart=", seqmap[1])

    println(fh, "# query=", sequence )

end

As last (optional) argument, write takes the mode in which is opened the file. We use "a" here to append the MSA to the header.

write("tofreecontact.msa", msa, Raw, "a")
print(read("tofreecontact.msa", String)) # It displays the contents of the output file
# querystart=111
# query=LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARxxxRLVTIRVTDAWAxGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPxxxxxxxxxxxxxxGWLRETVPRLWPGSESAAAEAAKTPxxxHPARRPCVAC
LCEPRPRPRLYHLWVVGAADLCVPFLEHLRRARRLVTIRVTDAWAGGSWPLPDGLAVGETVPWTPFPTAHNHPLAALFGGYEYRYGVVRPGWLRETVPRLWPGSESAAAEAAKTPHPARRPCVAC
------QTDRFRLWIIGAADICQQALSSFPKNMRRITVKVNGEWPGNPCFLDTNLEPVLISSWKPVSWPEK-ENNIDTSKLLCRYELLIP-----------------------------------
LSSGILGNHTFDMWIIGAADICKPAIEQLPNSKRFITIKVPGTWSGLTWEKPDGLSPLTVTEWDPCDDETMSKIEKKLVGIKCCYDLIGQ----------PAAAKHNSELDDSKDCCNGPACCI-
LNSSIIINRPYHLWVLGAADLCRPVFNLIPGPKRIVYVEIEDEF-NKSW-QPSETIPLTTVDARSLLQHKSSFVSPPPNFKRLIYAVVDP--MRLQENLCPQITNRTKTKRRSKKTYNGLFCQE-
------DQRLFHVWIVGAADLCEPLLACLSTGARCVVIEIGGLWGGRSWLPPAYARSGASTSWTPLPPPTGSAAESALEGCEYRYAIIPG----VEARRRPASRDENDERS--------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LHGPRDRPRLYHMWVVGAADLCAPFLESLWRMRRLITIKIPDGWVGASWHLPEPFLPATSVAWTPFPAPPNHPLESLLQNYEYRYGILSPRWLRNLVNKWKNMGGRPTLPPATSEHITQNSCPA-
LESEASDNVSYSLWILGAADICRTAIESIPLPKRLFAIKVPGTWAGMPWAIPCEIQTLLTSTWEPKFENIEDKAYFNDSNMACVYQIIGS----------PPDVPQLQGLGIESTPPKRNLCCC-
LHRDQPSPRLYHLWVVGAADLCVPFLEYAQKIRRFIAIKTPDAWVGEPWAVPTRFLPEWTVAWTPFPAAPNHPLETLLSRYEYQYGVVLPRWLRSLIALHKPHPATPGPLTTSHP--VRRPCCAC
LNSGIFGNYPLNLWVFGAADLCEPVISNIPGPKRLIYAYVSCEWPEPSW-KPENVKHIDDSGWVPVLDTFLSSFKVPPMLSDIFYGTTFNTPFNFESSPRCPSISSSSSSSSSSS----SSC---
LNSSVMGNRPYHLWVFGAADICRPVFDLLPGPKRLVYAELSEEWPGPSW-QPPP-----AVPWTAPERGTPADLEVPPRVARMVYAVVNRSWLRSRPLRHVPGRATPATPSVSTP-SGRGRC---
--------KLYHMWVIGAADICDPFLSCLQQKARFITIKMDFIWEGAAWLPPPRMCPKKTVPWVPCPISQDPALEQLLSNCSYEYGVVGPR----------------------------------
LESGIMSGGPFSLWIIGAADLCKEAIEHIPTPKRMFALQVKGTWSGMPWAIPEAIPVLLESSWSPQFTHPLDQRGFLESGMRGVYKVIGR-------ASDPPRNNSEPETRKPRGRLLKYLCC--
LHQERPGPRLYHLWVVGAADLCVPFFEYAQKTRRFIATKTNDAWVGEPWPLPDRFLPERTVSWTPFPAAPNHPLENLLSRYEYQYGVVVPRWLRSLVAPHKPRPASSRPHPATHP--TQRPCFTC
--------ELFAVWVVGGADICAPAMEAMMALGRVVAIAGAHSLTGSSWPISARLLPAVWHSWEPYPAAPTNPLKNTFAACTFVAGTVNARWLGALTSHTRPQV---------------------
LNSSLIINQPYHLWVLGAADLCKPVFDLIPGPKRMVYAEIADEF-HKSW-QPPETIPWTTVEHNGFSKHSSNSLVHPPTVKRVIYAVVDPAPGRPRQRRRPSEGGMRAPRRRSRAAPARSTA---
LSSGAYAAKEFHLWVLGTADICMAALN-LPAPKTFLITETG------------------------------------------------------------------------------------
LESEVSGNAPHSLWIVGAADICRIALECIPLPKRLLAIKVSGTWSGMPWAIPDNIQTLLTSTWEPKFDTPEDRAHFCDSDMVCVYKILGS----------PPNPLKPPEIEPPQMTPGRLFCC--
LNSGARGTAPIHLWILGAADLCDQVLLAAS---RSTAAGAPGAPTGARLT---RRRPGLTDA------DALDVIVAGIPATRAMFARVHNEALHAQIVTRGDVRRRRGGRGNGRE--RAPRCTI-
LCDPRPRPRLYHLWVVGAADLCAPFLEHLRDGRRLVAVRVPDAWAGGSWPLPEGLALGETVPWTPFPTAHNHPLAALFGTYEYRYGVLRPGWLREAASRLWAGPPATATEAAQKRHPARRPCTAC

Get sequences from a MSA

It's possible to index the MSA as any other matrix to get an aligned sequence. This will be return a Array of Residues without annotations but keeping names/identifiers.

using MIToS.MSA

msa = read( "http://pfam.xfam.org/family/PF16996/alignment/full", Stockholm,
            generatemapping=true, useidcoordinates=true)
AnnotatedMultipleSequenceAlignment with 39 annotations : 8×55 Named Array{MIToS.MSA.Residue,2}
            Seq ╲ Col │  2   3   4   5   6   7  …  51  52  53  54  55  56
──────────────────────┼──────────────────────────────────────────────────
U5P5G4_9STRE/5-59     │  K   K   D   L   F   Y  …   L   T   L   F   R   V
T0TMU0_9STRE/1-43     │  -   -   -   -   -   -      F   T   L   I   -   -
A0A139NPI6_9STRE/5-59 │  -   E   D   L   F   Y      F   T   L   L   R   -
A0A139NTT2_9STRE/3-57 │  K   K   D   L   F   Y      F   T   L   I   -   -
T0TY77_9STRE/5-59     │  K   K   D   L   F   Y      F   T   L   F   R   V
A8AWV6_STRGC/3-57     │  K   K   D   L   F   Y      F   T   L   I   G   -
A0A139NMD7_9STRE/4-59 │  K   D   D   I   F   Y      F   T   L   F   R   I
A3CM62_STRSV/3-57     │  K   K   D   L   F   Y  …   F   T   L   I   -   -
msa[2,:] # second sequence of the MSA, it keeps column names
55-element Named Array{MIToS.MSA.Residue,1}
Col  │ 
─────┼──
2    │ -
3    │ -
4    │ -
5    │ -
6    │ -
7    │ -
8    │ -
9    │ -
10   │ -
⋮      ⋮
48   │ G
49   │ L
50   │ I
51   │ F
52   │ T
53   │ L
54   │ I
55   │ -
56   │ -
msa[2:2,:] # Using the range 2:2 to select the second sequence, keeping also the sequence name
1×55 Named Array{MIToS.MSA.Residue,2}
        Seq ╲ Col │  2   3   4   5   6   7   8  …  50  51  52  53  54  55  56
──────────────────┼──────────────────────────────────────────────────────────
T0TMU0_9STRE/1-43 │  -   -   -   -   -   -   -  …   I   F   T   L   I   -   -

If you want to obtain the aligned sequence with its name and annotations (and therefore sequence and column mappings), you should use the function getsequence. This function returns an AlignedSequence with the sequence name from a MultipleSequenceAlignment or an AnnotatedAlignedSequence, that also contains annotations, from an AnnotatedMultipleSequenceAlignment.

secondsequence = getsequence(msa, 2)
AnnotatedAlignedSequence with 25 annotations : 1×55 Named Array{MIToS.MSA.Residue,2}
        Seq ╲ Col │  2   3   4   5   6   7   8  …  50  51  52  53  54  55  56
──────────────────┼──────────────────────────────────────────────────────────
T0TMU0_9STRE/1-43 │  -   -   -   -   -   -   -  …   I   F   T   L   I   -   -
annotations(secondsequence)
#=GF ID	Asp4
#=GF AC	PF16996.5
#=GF DE	Accessory secretory protein Sec Asp4
#=GF AU	Coggill P;0000-0001-5731-1588
#=GF SE	Pfam-B_7603 (release 27.0)
#=GF GA	25.70 25.70;
#=GF TC	55.00 54.80;
#=GF NC	21.60 20.60;
#=GF BM	hmmbuild HMM.ann SEED.ann
#=GF SM	hmmsearch -Z 45638612 -E 1000 --cpu 4 HMM pfamseq
#=GF TP	Family
#=GF RN	[1]
#=GF RM	23000954
#=GF RT	Emerging themes in SecA2-mediated protein export.
#=GF RA	Feltcher ME, Braunstein M;
#=GF RL	Nat Rev Microbiol. 2012;10:779-789.
#=GF DR	INTERPRO; IPR031551;
#=GF DR	SO; 0100021; polypeptide_conserved_region;
#=GF CC	Asp4 and Asp5 are putative accessory components of the SecY2
#=GF CC	channel of the SecA2-SecY2 mediated export system, but they are
#=GF CC	not present in all SecA2-SecY2 systems. This family of Asp4 is
#=GF CC	found in Firmicutes [1].
#=GF SQ	8
#=GF NCol	58
#=GF ColMap	2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
#=GF MIToS_2020-04-20T11:45:52.998	deletefullgaps!  :  Deletes 3 columns full of gaps (inserts generate full gap columns on MIToS because lowercase and dots are not allowed)
#=GF MIToS_2020-04-20T11:45:52.998	filtercolumns! : 3 columns have been deleted.
#=GS T0TMU0_9STRE/1-43	SeqMap	,,,,,,,,,,,,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,,
#=GS T0TMU0_9STRE/1-43	AC	T0TMU0.1
#=GC seq_cons			KKDLFYK-lEGRM--LK++sPcKEKsTtuEKlNpsFslsLGLVILIGLIFTLlt.

Use stringsequence if you want to get the sequence as a string.

stringsequence(msa, 2)
"------------MEELKQKPITKEKETRGEKINKSFSIMLGLVIIIGLIFTLI--"

Because matrices are stored columnwise in Julia, you will find useful the getresiduesequences function when you need to heavily operate over sequences.

getresiduesequences(msa)
8-element Array{Array{MIToS.MSA.Residue,1},1}:
 [K, K, D, L, F, Y, K, E, V, E  …  L, G, V, L, L, T, L, F, R, V]
 [-, -, -, -, -, -, -, -, -, -  …  I, G, L, I, F, T, L, I, -, -]
 [-, E, D, L, F, Y, K, E, V, E  …  L, G, L, L, F, T, L, L, R, -]
 [K, K, D, L, F, Y, K, D, V, E  …  I, G, L, I, F, T, L, I, -, -]
 [K, K, D, L, F, Y, K, E, V, E  …  L, G, V, I, F, T, L, F, R, V]
 [K, K, D, L, F, Y, K, D, I, E  …  I, G, L, I, F, T, L, I, G, -]
 [K, D, D, I, F, Y, K, D, I, E  …  I, G, L, L, F, T, L, F, R, I]
 [K, K, D, L, F, Y, K, D, I, E  …  I, G, L, I, F, T, L, I, -, -]

Describing your MSA

The MSA module has a number of functions to gain insight about your MSA. Using MIToS.MSA, one can easily ask for...

  • The number of columns and sequences with the ncolumns and nsequences functions.
  • The fraction of columns with residues (coverage) for each sequence making use of the

coverage method.

  • The fraction or percentage of gaps/residues using with the functions gapfraction,

residuefraction and columngapfraction.

  • The percentage of identity (PID) between each sequence of the MSA or its mean value

with percentidentity and meanpercentidentity.

The percentage identity between two aligned sequences is a common measure of sequence similarity and is used by the hobohmI method to estimate and reduce MSA redundancy. MIToS functions to calculate percent identity don't align the sequences, they need already aligned sequences. Full gaps columns don't count to the alignment length.

using MIToS.MSA

msa = permutedims(
        hcat(   res"--GGG-",      # res"..." uses the @res_str macro to create a (column) Vector{Residue}
                res"---GGG" ), (2,1))
#        identities 000110 sum 2
#  aligned residues 001111 sum 4
percentidentity(msa[1,:], msa[2,:]) # 2 / 4
50.0

To quickly calculate if the percentage of identity is greater than a determined value, use that threshold as third argument. percentidentity(seqa, seqb, pid) is a lot more faster than percentidentity(seqa, seqb) >= pid.

percentidentity(msa[1,:], msa[2,:], 62) # 50% >= 62%
false

Example: Plotting gap percentage per column and coverage per sequence

The gapfraction and coverage functions return a vector of numbers between 0.0 and 1.0 (fraction of...). Sometime it's useful to plot this data to quickly understand the MSA structure. In this example, we are going to use the Plots package for plotting, with the GR backend, but you are free to use any of the Julia plotting libraries.

using MIToS.MSA

msa = read("http://pfam.xfam.org/family/PF09776/alignment/full", Stockholm)

using Plots

gr(size=(600,300))

plot(   1:ncolumns(msa), # x is a range from 1 to the number of columns
        vec(columngapfraction(msa)) .* 100.0, # y is a Vector{Float64} with the percentage of gaps of each column
        linetype = :line,
        ylabel = "gaps [%]",
        xlabel = "columns",
        legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

plot(   1:nsequences(msa), # x is a range from 1 to the number of sequences
        vec(coverage(msa)) .* 100, # y is a Vector{Float64} with the coverage of each sequence
        linetype = :line,
        ylabel = "coverage [%]",
        xlabel = "sequences",
        legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

plot(msa)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

Example: Filter sequences per coverage and columns per gap fraction

Taking advantage of the filter...! functions and the coverage and columngapfraction functions, it's possible to delete short sequences or columns with a lot of gaps.

println("\tsequences\tcolumns")
println( "Before:\t", nsequences(msa), "\t\t", ncolumns(msa)  )
# delete sequences with less than 90% coverage of the MSA length:
filtersequences!(msa, coverage(msa) .>= 0.9)
# delete columns with more than 10% of gaps:
filtercolumns!(msa, columngapfraction(msa) .<= 0.1)
println( "After:\t", nsequences(msa), "\t\t",  ncolumns(msa)  )
	sequences	columns
Before:	251		116
After:	82		106
histogram(  vec(columngapfraction(msa)),
            # Using vec() to get a Vector{Float64} with the fraction of gaps of each column
            xlabel = "gap fraction in [0,1]", bins = 20, legend = false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

histogram(  vec(coverage(msa) .* 100.0), #  Column with the coverage of each sequence
            xlabel = "coverage [%]", legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

Example: Plotting the percentage of identity between sequences

The distribution of the percentage of identity between every pair of sequences in an MSA, gives an idea of the MSA diversity. In this example, we are using percentidentity over an MSA to get those identity values.

using MIToS.MSA
msa = read("http://pfam.xfam.org/family/PF09776/alignment/full", Stockholm)
pid = percentidentity(msa)

MIToS stores the matrix of percentage of identity between the aligned sequences as a PairwiseListMatrix from the PairwiseListMatrices package. This matrix type saves RAM, allowing the storage of big matrices. In this example, we use the to_table function of PairwiseListMatrices to convert the matrix into a table with indices.

using PairwiseListMatrices

pidtable = to_table(pid, diagonal=false)
31375×3 Array{Any,2}:
 "A0A2A2LU69_9BILA/10-118"  "G1RVK4_NOMLE/46-161"       27.8261
 "A0A2A2LU69_9BILA/10-118"  "A0A194PY65_PAPXU/3-116"    40.5941
 "A0A2A2LU69_9BILA/10-118"  "A0A0V0S2G9_9BILA/731-840"  33.6634
 "A0A2A2LU69_9BILA/10-118"  "A0A016T9R4_9BILA/3-89"     45.9184
 "A0A2A2LU69_9BILA/10-118"  "A0A1V4KDS6_PATFA/159-270"  33.3333
 "A0A2A2LU69_9BILA/10-118"  "M4A0G1_XIPMA/29-139"       29.1262
 "A0A2A2LU69_9BILA/10-118"  "A0A1B0BJN8_9MUSC/16-132"   33.6634
 "A0A2A2LU69_9BILA/10-118"  "A0A1I8N4P6_MUSDO/6-107"    36.6337
 "A0A2A2LU69_9BILA/10-118"  "A0A1U7TPR2_TARSY/10-125"   23.4783
 "A0A2A2LU69_9BILA/10-118"  "B4ILV0_DROSE/3-106"        37.6238
 ⋮                                                      
 "A0A2I3GIP3_NOMLE/10-125"  "RM55_MOUSE/9-124"          73.2759
 "A0A2I3GIP3_NOMLE/10-125"  "M3XPJ7_MUSPF/10-125"       72.1739
 "A0A2I3GIP3_NOMLE/10-125"  "A0A182GHC0_AEDAL/6-119"    34.7826
 "G3WPK1_SARHA/11-126"      "RM55_MOUSE/9-124"          54.3103
 "G3WPK1_SARHA/11-126"      "M3XPJ7_MUSPF/10-125"       62.6087
 "G3WPK1_SARHA/11-126"      "A0A182GHC0_AEDAL/6-119"    40.8696
 "RM55_MOUSE/9-124"         "M3XPJ7_MUSPF/10-125"       70.6897
 "RM55_MOUSE/9-124"         "A0A182GHC0_AEDAL/6-119"    33.6207
 "M3XPJ7_MUSPF/10-125"      "A0A182GHC0_AEDAL/6-119"    35.6522

The function quantile gives a quick idea of the percentage identity distribution of the MSA.

using Statistics

quantile(convert(Vector{Float64}, pidtable[:,3]), [0.00, 0.25, 0.50, 0.75, 1.00])
5-element Array{Float64,1}:
   0.0
  28.07017543859649
  32.6530612244898
  39.130434782608695
 100.0

The function meanpercentidentity gives the mean value of the percent identity distribution for MSA with less than 300 sequences, or a quick estimate (mean PID in a random sample of sequence pairs) otherwise unless you set exact to true.

meanpercentidentity(msa)
35.41589479524384

One can easily plot that matrix and its distribution using the heatmap and histogram functions of the Plots package.

using Plots
gr()
heatmap(convert(Matrix, pid), yflip=true, ratio=:equal)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

histogram(pidtable[:,3], xlabel ="Percentage of identity", legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

Sequence clustering

The MSA module allows to clusterize sequences in an MSA. The hobohmI function takes as input an MSA followed by an identity threshold value, and returns a Clusters type with the result of a Hobohm I sequence clustering. The Hobohm I algorithm will add a sequence to an existing cluster, if the percentage of identity is equal or greater than the threshold. The Clusters is sub-type of ClusteringResult from the Clustering.jl package. One advantage of use a sub-type of ClusteringResultis that you are able to use any method defined on Clustering.jl like varinfo (Variation of Information) for example. Also, you can use any clustering algorithm included in Clustering.jl, and convert its result to an Clusters object to use it with MIToS. MSA defines the functions nclusters to get the resulting number of clusters, counts to get the number of sequences on each cluster and assignments to get the cluster number of each sequence. The most important method is getweight, which returns the weight of each sequence. This method is used in the Information module of MIToS to reduce redundancy.

Example: Reducing redundancy of a MSA

MSAs can suffer from an unnatural sequence redundancy and a high number of protein fragments. In this example, we are using a sequence clustering to make a non-redundant set of representative sequences. We are going to use the function hobohmI to perform the clustering with the Hobohm I algorithm at 62% identity.

using MIToS.MSA

msa = read("http://pfam.xfam.org/family/PF09776/alignment/full", Stockholm)

println("This MSA has ", nsequences(msa), " sequences...")
This MSA has 251 sequences...
clusters = hobohmI(msa, 62)
MIToS.MSA.Clusters([2, 50, 36, 2, 8, 5, 7, 3, 1, 2  …  1, 1, 1, 1, 2, 2, 1, 1, 2, 1], [1, 2, 3, 4, 5, 6, 7, 8, 3, 2  …  16, 3, 20, 3, 82, 2, 15, 2, 2, 3], [0.5, 0.02, 0.027777777777777776, 0.5, 0.125, 0.2, 0.14285714285714285, 0.3333333333333333, 0.027777777777777776, 0.02  …  0.25, 0.027777777777777776, 0.09090909090909091, 0.027777777777777776, 1.0, 0.02, 0.3333333333333333, 0.02, 0.02, 0.027777777777777776])
println("...but has only ", nclusters(clusters), " sequence clusters after a clustering at 62% identity.")
...but has only 82 sequence clusters after a clustering at 62% identity.
using Plots
gr()

plot(msa)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

We are going to use the DataFrames package to easily select the sequence with the highest coverage of each cluster.

using DataFrames

df = DataFrame( seqnum = 1:nsequences(msa),
                seqname = sequencenames(msa),
                cluster = assignments(clusters), # the cluster number/index of each sequence
                coverage = vec(coverage(msa)))

251 rows × 4 columns

seqnumseqnameclustercoverage
Int64StringInt64Float64
11A0A2A2LU69_9BILA/10-11810.844828
22G1RVK4_NOMLE/46-16120.991379
33A0A194PY65_PAPXU/3-11630.793103
44A0A0V0S2G9_9BILA/731-84040.853448
55A0A016T9R4_9BILA/3-8950.62931
66A0A1V4KDS6_PATFA/159-27060.844828
77M4A0G1_XIPMA/29-13970.87069
88A0A1B0BJN8_9MUSC/16-13280.775862
99A0A1I8N4P6_MUSDO/6-10730.775862
1010A0A1U7TPR2_TARSY/10-12520.991379
1111B4ILV0_DROSE/3-10630.775862
1212G3SRJ5_LOXAF/9-12320.991379
1313A0A0L8FVE8_OCTBM/8-11190.818966
1414W4XCA5_STRPU/9-121100.887931
1515RM55_HUMAN/10-12520.991379
1616A0A1U8D4J6_MESAU/9-12421.0
1717A0A1A9VR26_GLOAU/4-11030.784483
1818C4WSU2_ACYPI/3-101110.775862
1919B0WMT9_CULQU/8-116120.827586
2020A0A182TGF7_9DIPT/27-135120.784483
2121A0A1I7TQ78_9PELO/1-113130.965517
2222G1TZ81_RABIT/10-122140.965517
2323G3WPK0_SARHA/10-125150.991379
2424A0A226ND80_CALSU/6-88160.706897
2525A0A068WZ28_HYMMI/1-103170.827586
2626B4JSF8_DROGR/5-10530.784483
2727A0A044VAC1_ONCVO/26-140180.905172
2828G1KFV5_ANOCA/11-126190.948276
2929E2BEW2_HARSA/189-296200.784483
3030A0A1S3F3D0_DIPOR/10-12521.0
3131A0A0M3IY57_ANISI/12-103210.62069
3232A0A1I7XRJ6_HETBA/24-82220.37931
3333A0A151XA25_9HYME/324-435200.784483
3434A0A1A9ZL90_GLOPL/3-11030.793103
3535G1Q793_MYOLU/11-12621.0
3636S7NPL6_MYOBR/845-96021.0
3737E2RDI6_CANLF/10-12320.974138
3838A0A1S3T723_SALSA/16-92230.646552
3939D3ZF99_RAT/9-12421.0
4040A0A2D0QYE5_ICTPU/16-131240.982759
4141K7FVJ9_PELSI/10-124190.982759
4242A0A154PA44_9HYME/355-466250.827586
4343A0A075AF58_9TREM/385-439260.37931
4444A0A0N4ZNS7_PARTI/18-127270.793103
4545F6V4D5_CALJA/10-12520.991379
4646A0A091DB95_FUKDA/10-12520.991379
4747A0A158NN71_ATTCE/2-108200.801724
4848M7BRF2_CHEMY/1-109190.931034
4949G1P8D1_MYOLU/10-12521.0
5050A0A0R3RPH9_9BILA/11-122180.896552
5151A0A1I8CN28_9BILA/12-125280.853448
5252A0A0R3QN38_9BILA/6-118180.913793
5353A0A195CSW6_9HYME/356-466200.801724
5454T1JZX0_TETUR/9-124290.793103
5555A0A0Q3TJT0_AMAAE/8-11960.862069
5656A0A0N4XDL0_NIPBR/12-12250.810345
5757A0A2I4BYR2_9TELE/24-13770.896552
5858A0A0J7L295_LASNI/1-107200.818966
5959A0A1A6FY34_NEOLE/9-12420.991379
6060RM55_BOVIN/10-12520.991379
6161A0A151I3L2_9HYME/359-470200.827586
6262K7J8V2_NASVI/9-122300.939655
6363A0A182N5M4_9DIPT/11-11230.775862
6464Q2KHY3_BOVIN/10-81310.586207
6565U3JEB6_FICAL/51-139320.62069
6666H9JKD3_BOMMO/5-11330.827586
6767A0A096NE75_PAPAN/10-12521.0
6868A0A0L7K4G1_9NEOP/2-10730.784483
6969U3JEB8_FICAL/1-9660.801724
7070A0A0K0G1S2_9BILA/1-106330.853448
7171Q177J1_AEDAE/2-112120.87069
7272A0A183IMM9_9BILA/3-111340.87069
7373H0X3U7_OTOGA/12-12721.0
7474B5X5L6_SALSA/16-131240.948276
7575G7Y6C1_CLOSI/31-139350.758621
7676Q7QBA3_ANOGA/3-11230.784483
7777A0A0P7WDW8_9TELE/6-121241.0
7878A0A0N5DDH1_TRIMR/1044-1157360.905172
7979A0A0R3W9U0_TAEAS/1-118170.818966
8080A0A183I3F0_9BILA/5-118180.896552
8181A0A0L0CIQ9_LUCCU/4-10730.801724
8282A0A177AYT9_9METZ/1-103370.810345
8383F7H239_MACMU/31-14621.0
8484A0A090MXF3_STRRB/1-104330.844828
8585F6V3G4_CALJA/44-15920.991379
8686W5L7V8_ASTMX/3-117240.974138
8787K7AP45_PANTR/10-12520.991379
8888A0A0F8BZB4_LARCR/26-13770.87931
8989H0VQK5_CAVPO/10-12520.991379
9090A0A0D9RV95_CHLSB/10-12521.0
9191W2TIQ9_NECAM/261-369380.862069
9292A0A0N5AUA8_9BILA/2-96390.793103
9393A0A0A0MST4_HUMAN/10-43400.284483
9494A0A2G5UEV7_9PELO/1-113130.948276
9595G1MUJ2_MELGA/2-113160.87931
9696V4BHI3_LOTGI/4-117410.905172
9797B4NHY3_DROWI/3-10630.784483
9898A0A1Y9H9E9_9DIPT/2-11230.784483
9999A0A131ZZJ6_SARSC/3-88420.603448
100100A0A2G8JL66_STIJA/34-145100.87069
101101G1T9B1_RABIT/10-12420.982759
102102A0A183FDR8_HELBK/28-13750.663793
103103A0A2K5J272_COLAP/10-12521.0
104104W6NTD8_HAECO/10-122380.853448
105105A0A182JYN4_9DIPT/5-11330.784483
106106B5DTE4_DROPS/4-10630.775862
107107H3B9B9_LATCH/16-131240.982759
108108RM55_DROME/3-10630.775862
109109A0A182YGH3_ANOST/3-11230.784483
110110A8WHP2_DANRE/18-132240.956897
111111L9KWN5_TUPCH/1-10420.887931
112112A0A2K5KHX2_CERAT/10-12521.0
113113A0A1W4UAL9_DROFC/4-10630.775862
114114A0A151LZI2_ALLMI/58-172430.982759
115115G3RI37_GORGO/46-16120.991379
116116A0A0N4SYK1_BRUPA/6-118180.922414
117117A0A0K0DW22_STRER/1-104330.844828
118118A0A1B0GPI2_PHLPP/76-188440.939655
119119A0A1J1HIQ9_9DIPT/7-113450.818966
120120W5J6W6_ANODA/2-11230.784483
121121A0A0I9N7W2_BRUMA/6-118180.922414
122122U4U9N4_DENPD/3-106460.784483
123123I3JL90_ORENI/24-13470.887931
124124B3M1P4_DROAN/4-10630.775862
125125H2S474_TAKRU/2-10570.853448
126126B4MB45_DROVI/2-10530.775862
127127A0A0R3WIR7_HYDTA/14-118170.775862
128128A0A077ZGK7_TRITR/1-108360.896552
129129F6TE40_MONDO/1-115150.974138
130130A0A1X7U318_AMPQE/41-159470.663793
131131A0A0N4VDZ6_ENTVE/2-95480.793103
132132G5C999_HETGA/4-10220.793103
133133A0A1W0XDC4_HYPDU/21-129490.767241
134134A0A1B0G6D5_GLOMM/2-11080.818966
135135A0A1L8FV00_XENLA/17-133500.991379
136136A0A1I7W0F6_LOALO/17-130180.87931
137137A0A183WRR5_TRIRE/1-95510.818966
138138G3SD22_GORGO/10-12520.991379
139139K1RC04_CRAGI/1-50520.431034
140140E2AMM9_CAMFO/352-464200.862069
141141A0A182E334_ONCOC/37-140180.784483
142142A0A2A2LU44_9BILA/10-11810.844828
143143G4LYQ5_SCHMA/4-108510.801724
144144Q9TYJ8_CAEEL/5-117130.965517
145145D6X0J3_TRICA/2-106460.784483
146146A0A2K5KHY3_CERAT/33-14821.0
147147A0A2K5J2K0_COLAP/39-15421.0
148148H2YZU3_CIOSA/21-124530.767241
149149A0A1I8BZY4_MELHA/1-67540.560345
150150A0A1A9X193_9MUSC/4-10730.793103
151151S9YLR7_CAMFR/19-13520.991379
152152A0A1V9XBH2_9ACAR/7-121550.896552
153153A0A183SQK1_SCHSO/64-115560.422414
154154A0A1D1VKA4_RAMVA/12-123570.853448
155155A0A0K0D594_ANGCA/3-9950.775862
156156A0A2B4RQ34_STYPI/26-138580.87069
157157A0A1I8QBY2_STOCA/5-10730.810345
158158V8PEJ2_OPHHA/1-78590.672414
159159F6SS40_CIOIN/11-120600.836207
160160A0A0K0G1J5_9BILA/30-107610.560345
161161B4G668_DROPE/4-10630.775862
162162A0A0N4W4F7_HAEPC/11-123380.853448
163163A0A0L7QW11_9HYME/370-465620.637931
164164A7RNQ0_NEMVE/26-143630.818966
165165A0A068YAY6_ECHMU/8-118170.810345
166166A0A182XIX4_ANOQN/3-11230.784483
167167G7NXC9_MACFA/10-12521.0
168168T1J5Q4_STRMM/5-116640.827586
169169A0A182FKY5_ANOAL/50-16130.784483
170170A0A0R3SVQ2_HYMDI/1-110170.896552
171171A0A0N4U7G0_DRAME/3-89650.612069
172172G3HHN0_CRIGR/9-12220.965517
173173A0A0M3HSN6_ASCLU/9-120660.818966
174174L5M5N5_MYODS/10-12521.0
175175A0A2G8K792_STIJA/34-145670.87069
176176A0A0R3PMB8_ANGCS/12-11950.818966
177177G1MCX3_AILME/10-12521.0
178178A0A182L7I7_9DIPT/46-16030.784483
179179A0A1Y3BGN7_EURMA/199-316680.87069
180180A0A1U7S2P1_ALLSI/28-142190.956897
181181A0A1D5PF37_CHICK/13-125160.922414
182182E0VEI6_PEDHC/5-112690.836207
183183A0A182V438_ANOME/2-11230.784483
184184E3NM70_CAERE/1-113130.931034
185185I3L9J0_PIG/10-12520.991379
186186A0A2I3LGQ6_PAPAN/32-14721.0
187187A0A1S3WRB9_ERIEU/9-12120.948276
188188L5JXI7_PTEAL/1-9720.827586
189189G3NBS4_GASAC/1-9170.784483
190190S7MBG1_MYOBR/10-12521.0
191191G5C106_HETGA/10-12420.974138
192192A0A182RYH6_ANOFN/1-112120.853448
193193E9ICI6_SOLIN/2-109200.767241
194194A0A087T3L9_9ARAC/2-109700.818966
195195A0A1I7S389_BURXY/3-113710.87931
196196E3MY88_CAERE/200-252720.422414
197197U6JFQ1_ECHGR/10-118170.801724
198198B7Q0Y3_IXOSC/8-117730.862069
199199E4XN03_OIKDI/34-148740.724138
200200A0A084W390_ANOSI/1-69750.594828
201201A0A212FFC4_DANPL/1-7330.62931
202202A8XI10_CAEBR/1-113130.939655
203203J9EWJ7_WUCBA/1-94760.715517
204204A0A1V4KE06_PATFA/103-21460.844828
205205A0A0D8XRC7_DICVI/3-10450.741379
206206A0A0V0TDB0_9BILA/725-83340.862069
207207A0A2C9LIH9_BIOGL/13-128770.948276
208208A0A0M9A9F3_9HYME/351-460780.862069
209209A0A195EEX3_9HYME/349-459200.758621
210210A0A067RHM1_ZOONE/8-12630.844828
211211F4WNV6_ACREC/304-415200.801724
212212A0A088ARM4_APIME/14-122780.793103
213213A0A182W0I3_9DIPT/2-112120.844828
214214F7DF84_HORSE/10-12420.991379
215215A0A1D2MP76_ORCCI/27-138790.827586
216216A0A0B2V267_TOXCA/8-119660.87069
217217A0A2C9LJ85_BIOGL/16-131770.948276
218218A0A1A9XI44_GLOFF/16-13280.775862
219219H2Q197_PANTR/46-16120.991379
220220A0A218UFN6_9PASE/5-11560.87069
221221A0A1W4XV80_AGRPL/2-113460.793103
222222A0A232FIY4_9HYME/9-122300.931034
223223A0A261BVI0_9PELO/63-176130.939655
224224A0A087Y3U8_POEFO/26-13970.905172
225225A0A182PNP6_9DIPT/2-11230.784483
226226H3DW02_PRIPA/2-128800.87931
227227W5P8R5_SHEEP/10-12520.991379
228228X6RIW1_HUMAN/10-76310.568966
229229A0A0N5CW25_THECL/95-207180.836207
230230X6R631_HUMAN/1-52810.448276
231231A0A0R3TCC4_HYMNN/2-99170.801724
232232B4KBB5_DROMO/3-10530.784483
233233B4QTH5_DROSI/3-10630.784483
234234A0A182LXD1_9DIPT/3-113120.706897
235235H2N3H3_PONAB/10-12520.991379
236236F7GI04_MACMU/1-52810.448276
237237A0A094ZMR1_SCHHA/3-104510.801724
238238A0A0B1SAW7_OESDE/11-10550.689655
239239M3WM55_FELCA/10-12521.0
240240A0A0C2BEU2_9BILA/5-10550.801724
241241A0A0N5CCD5_STREA/1-104330.844828
242242A0A226P586_COLVI/2-108160.87069
243243A0A0M3QXL3_DROBS/2-10530.784483
244244A0A195FWH0_9HYME/356-468200.784483
245245A0A182I7Y1_ANOAR/4-11230.784483
246246T1FXF3_HELRO/1-90820.775862
247247A0A2I3GIP3_NOMLE/10-12520.991379
248248G3WPK1_SARHA/11-126150.991379
249249RM55_MOUSE/9-12421.0
250250M3XPJ7_MUSPF/10-12520.991379
251251A0A182GHC0_AEDAL/6-11930.836207

It is possible to use this DataFrame and Plots to plot the sequence coverage of the MSA and also an histogram of the number of sequences in each cluster:

using StatPlots # Plotting DataFrames
h = @df df histogram(:cluster, ylabel="nseq")
p = @df df plot(:cluster, :coverage, linetype=:scatter)
plot(p, h, nc=1, xlim=(0, nclusters(clusters)+1 ), legend=false)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS

We use the Split-Apply-Combine strategy, though the by function of the DataFrames package, to select the sequence of highest coverage for each cluster.

maxcoverage = by(df, :cluster, cl -> cl[ findmax(cl[:coverage])[2] ,
                 [:seqnum, :seqname, :coverage]])
p = @df maxcoverage plot(:cluster, :coverage, linetype=:scatter)
h = @df maxcoverage histogram(:cluster, ylabel="nseq")
plot(p, h, nc=1, xlim=(0, nclusters(clusters)+1 ), legend=false)
png("msa_clusters_iii.png") # hide
nothing # hide

We can easily generate a mask using list comprehension, to select only the representative sequences of the MSA (deleting the rest of the sequences with filtersequences!).

cluster_references = Bool[ seqnum in maxcoverage[:seqnum] for seqnum in 1:nsequences(msa) ]
filtersequences!(msa, cluster_references)
plot(msa)
/juliateam/.julia/packages/GR/yMV3y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory
connect: Connection refused
GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS
GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS