Example walkthrough

Loading the reference

First, let's load the reference:

julia> using BinBencherBackend

julia> ref = Reference(path_to_ref_file)
Reference
  Genomes:    3
  Sequences:  11
  Ranks:      3
  Seq length: 10
  Assembled:  61.9 %

This gives us a few statistics about the reference:

  • Number of genomes
  • Number of sequences
  • Number of taxonomic ranks (strain, species, genus...)
  • Length of shortest sequence
  • Total length of genomes that are assembled

The genomes here contain both plasmids and organisms, and the sequence length of 200 bp is too short. Let's filter the reference using subset! to only retain organisms, and sequences of length 25 or more:

julia> subset!(ref;
           genomes = is_organism,
           sequences = s -> length(s) >= 10
       )
Reference
  Genomes:    2
  Sequences:  11
  Ranks:      3
  Seq length: 10
  Assembled:  91.3 %
Note

The function subset! will mutate the reference, whereas the function subset will create a new independent reference. At the moment, the latter is much slower.

We removed a single genome, namely one labeled as virus.

Genomes

We can explore the genomes contained in the reference with the genomes function, which returns an iterable of Genome (in this case, a Set):

julia> genomes(ref)
Set{Genome} with 2 elements:
  Genome(gA)
  Genome(gB)

Let's look at a Genome in more detail:

julia> genome, genome2 = genomes(ref);

julia> genome
Genome "gA"
  Parent:        "D"
  Genome size:   100
  Assembly size: 88 (88.0 %)
  Sources:       1
  Flags:         1 (organism)

The flags can be extracted with the flags(genome) function - each genome contains zero or more flags:

julia> flags(genome)
FlagSet with 1 element:
  BinBencherBackend.Flags.organism

... in this case, this genome is an organism as opposed to a plasmid or virus. You can see all possible flags with instances(Flags.Flag).

See also the helper functions is_organism, is_virus and is_plasmid

Sources

The genome has one source - let's look at that

julia> source = only(genome.sources)
Source "subjA1"
genome:          Genome(gA)
  Length:        100
  Assembly size: 88
  Sequences:     6

A Source is one of the genomic sequences that genomes are composed of. This is distinct from the assembled sequences that we will be binning - a Source represents the reference sequence, typically the full genome, assembled from a sequencing run on a clonal colony. For this genome, we can see it has a length of 100 bp, and that 5 sequences map to this source, covering 88 bp.

We can get the sequences mapping to this source:

julia> source.sequences
6-element Vector{Tuple{Sequence, Tuple{Int64, Int64}}}:
 (Sequence("s1", 25), (5, 29))
 (Sequence("s1", 25), (10, 34))
 (Sequence("s2", 40), (1, 40))
 (Sequence("s3", 50), (51, 98))
 (Sequence("s7", 20), (21, 40))
 (Sequence("s8", 25), (2, 26))

Where, e.g. the first entrance tells us that the sequence "s2" with a length of 40 maps to positions 1:40 (both inclusive).

Clades

Genomes are organised into a taxonomic hierarchy. We can find the immediate parent of a genome by accessing the field genome.parent. Let's look at another genome:

julia> genome2
Genome "gB"
  Parent:        "D"
  Genome size:   50
  Assembly size: 49 (98.0 %)
  Sources:       2
  Flags:         1 (organism)

julia> clade = genome2.parent
Species "D", 2 genomes
├─ Genome(gA)
└─ Genome(gB)

The parent is an instance of a Clade. Clades are at a specific rank: Rank 1 for species, 2 for genus, 3 for family, etc. Every clade has one or more children: These are the clades one rank lower. Conceptually, rank zero corresponds to Genomes (OTUs, for this reference dataset)

julia> clade.children
2-element Vector{Genome}:
 Genome(gA)
 Genome(gB)

We can find the most recent common ancestor (MRCA) of genome and genome2 like this:

julia> mrca(genome, genome2)
Species "D", 2 genomes
├─ Genome(gA)
└─ Genome(gB)

They are very distantly related, so the domain "Bacteria", one of the highest ranked Clades, are their most recent common ancestor.

The top clade can be found with the top_clade(ref) function, which is the universal ancestor of all clades in the reference.

Binnings

A Binning is a set of bins benchmarked against a reference. We can load a set of Vamb bins and turn it into a Binning object like this:

julia> binning = Binning(path_to_bins_file, ref)
Binning
  Reference
    Genomes:    2
    Sequences:  11
    Ranks:      3
    Seq length: 10
    Assembled:  91.3 %
  Bins:        6
  NC genomes:  0
  Precisions: [0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
  Recalls:    [0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
  Reconstruction (assemblies):
    P\R   0.6  0.7  0.8  0.9 0.95 0.99
    0.6     1    0    0    0    0    0
    0.7     1    0    0    0    0    0
    0.8     1    0    0    0    0    0
    0.9     1    0    0    0    0    0
    0.95    1    0    0    0    0    0
    0.99    1    0    0    0    0    0

A wealth of information is readily available:

  • binning.ref gives the underlying Reference
  • binning.recalls and binning.precisions gives the recall/precision thresholds used in benchmarking

The function print_matrix will display the number of recovered genomes/assemblies. It takes two optional keyword: level, the taxonomic rank (defaults to 0, meaning strain level), and assembly which defaults to true. If set to false, benchmark number of recovered genomes, not number of recovered assemblies.

julia> print_matrix(binning; level=1, assembly=false)
P\R   0.6  0.7  0.8  0.9 0.95 0.99
0.6     1    0    0    0    0    0
0.7     1    0    0    0    0    0
0.8     1    0    0    0    0    0
0.9     1    0    0    0    0    0
0.95    1    0    0    0    0    0
0.99    1    0    0    0    0    0

You can also get the number of genomes or assemblies reconstructed at a given precision/recall level directly with n_recovered:

julia> n_recovered(binning, 0.6, 0.7; assembly=true)
1

julia> n_recovered(binning, 0.66, 0.91; level=1)
0

Bins

The Binning object obviously contains our bins. Let's pick a random bin:

julia> bin = binning.bins[4]
Bin "C4"
  Sequences: 3
  Breadth:   55
  Intersecting 2 genomes

The "breadth" here is the sum of the length of its sequences. bin.sequences gets an iterable of its sequences:

julia> collect(bin.sequences)
3-element Vector{Sequence}:
 Sequence("s5", 25)
 Sequence("s6", 10)
 Sequence("s7", 20)

The "Intersecting 2 genomes" means that the sequences map to 2 different genomes - the only two in the reference. We can get that with the function intersecting, then get the precision/recall with recall_precision:

julia> Set(intersecting(bin)) == genomes(ref)
true

julia> recall_precision(genome2, bin)
(recall = 0.6122448979591837, precision = 1.0)

Or do the same for a higher clade - let's say a genus. In this case, we get the same result.

julia> genus = only(Iterators.filter(i -> i.rank == 2, intersecting(Clade, bin)));

julia> recall_precision(genus, bin)
(recall = 0.6122448979591837, precision = 1.0)