Reference

BinBencherBackend.Bin — Type

Bin(name::AbstractString, ref::Reference, sequences)

Bins each represent a bin created by the binner. Conceptually, they are simply a set of Sequence with a name attached. Practically, every Bin is benchmarked against all Genomes and Clades of a given Reference, so each Bin stores data about its intersection with every genome/clade, e.g. its purity and recall.

Like Sources, Bins also have an assembly size for a given Genome. This is the number of base pairs in the genomes covered by any sequence in the Bin, which is always a subset of the genome's assembly size.

Benchmark statistics for a Bin/Genome can be done with either assemblies or genomes as the ground truth.

True positives (TP) are defined as the sum of assembly sizes over all sources in the genome
False positives (FP) are the sum of length of sequences in the bin not mapping to the genome
False negatives (FN) is either the genome assembly size or genome size minus TP.

For Bin/Clade pairs B/C, recall is the maximal recall of B/Ch for all children Ch of C. Precision is the sum of lengths of sequences mapping to any child of the clade divided by the sum of lengths of all sequences in the bin.

See also: Binning, Genome, Clade

Examples

julia> bin = first(binning.bins)
Bin "C1"
  Sequences: 2
  Breadth:   65
  Intersecting 1 genome

julia> first(bin.sequences)
Sequence("s1", 25)

julia> f1(first(ref.genomes), bin)
0.5714285714285715

BinBencherBackend.Binning — Type

Binning(::Union{IO, AbstractString}, ::Reference; kwargs...)

A Binning represents a set of Bins benchmarked against a Reference. Binnings can be created given a set of Bins and a Reference, where the bins may potentially be loaded from a .tsv file. The fields recovered_asms and recovered_genomes are used for benchmarking, these are normally output using the print_matrix function.

A Binning is loaded from a tsv file, which is specified either as an IO, or its path as an AbstractString. If the path ends with .gz, automatically gzip decompress when reading the file.

See also: print_matrix, Bin, Reference

Examples

julia> bins = Binning(path_to_bins_file, ref);


julia> bins isa Binning
true

julia> BinBencherBackend.n_nc(binning)
0

Extended help

Create with:

open(file) do io
    Binning(
        io::Union{IO, AbstractString},
        ref::Reference;
        min_size::Integer=1,
        min_seqs::Integer=1,
        binsplit_separator::Union{AbstractString, Char, Nothing}=nothing,
        disjoint::Bool=true,
        recalls=DEFAULT_RECALLS,
        precisions=DEFAULT_PRECISIONS,
        filter_genomes=Returns(true)
)

min_size: Filter away bins with breadth lower than this
min_seqs: Filter away bins with fewer sequences that this
binsplit_separator: Split bins based on this separator (nothing means no binsplitting)
disjoint: Throw an error if the same sequence is seen in multiple bins
recalls and precision: The thresholds to benchmark with
filter_genomes: A function f(genome)::Bool. Genomes for which it returns false are ignored in benchmarking.

BinBencherBackend.Clade — Type

Clade{Genome}(name::AbstractString, child::Union{Clade{Genome}, Genome})

A Clade represents any clade above Genome. Every Genome is expected to belong to the same number of clades, e.g. there may be exactly 7 levels of clades above every Genome. Clades always have at least one child (which is either a Genome or a Clade one rank lower), and a parent, unless it's the unique top clade from which all other clades and genomes descend from. The rank of a Genome is 0, clades that contain genomes have rank 1, and clades containing rank-1 clades have rank 2 etc. By default, zero-indexed ranks correspond to OTU, species, genus, family, order, class, phylum and domain.

Examples

julia> top_clade(ref)
Genus "F", 3 genomes
├─ Species "D", 2 genomes
│  ├─ Genome(gA)
│  └─ Genome(gB)
└─ Species "E", 1 genome
   └─ Genome(gC)

julia> top_clade(ref).children
2-element Vector{Clade{Genome}}:
 Species "D", 2 genomes
 Species "E", 1 genome

BinBencherBackend.FlagSet — Type

FlagSet <: AbstractSet{Flag}

Flags are compact sets of Flag associated to a Genome. You can construct them from an iterable of Flag, e.g. a 1-element tuple. FlagSet support most set operations efficiently.

See also: Flag, FlagSet

Example

julia> flags(genome)
FlagSet with 1 element:
  BinBencherBackend.Flags.organism

BinBencherBackend.gold_standard — Method

gold_standard(
    ref::Reference
    [sequences, a Binning or an iterable of Sequence];
    disjoint=true,
    recalls=DEFAULT_RECALLS,
    precisions=DEFAULT_PRECISIONS
)::Binning

Create the optimal Binning object given a Reference, by the optimal binning of the Sequences in sequences. If disjoint, assign each sequence to only a single genome.

If sequences is not passed, use all sequences in ref. If a Binning is passed, use all sequences in any of its bins. Else, pass an iterable of Sequence.

Extended help

Currently, the disjoint option uses a simple greedy algorithm to assign sequences to genomes.

BinBencherBackend.intersecting — Method

intersecting([Genome, Clade]=Genome, x::Bin)

Get an iterator of the Genomes or Clades that bin x intersects with. intersecting(::Bin) defaults to genomes.

Example

julia> collect(intersecting(bin))
1-element Vector{Genome}:
 Genome(gA)

julia> sort!(collect(intersecting(Clade, bin)); by=i -> i.name)
2-element Vector{Clade{Genome}}:
 Species "D", 2 genomes
 Genus "F", 3 genomes

BinBencherBackend.is_organism — Method

is_organism(g::Genome)::Bool

Check if g is known to be an organism.

Example

julia> is_organism(genome)
true

BinBencherBackend.is_plasmid — Method

is_plasmid(g::Genome)::Bool

Check if g is known to be a plasmid.

Example

julia> is_plasmid(genome)
false

BinBencherBackend.is_virus — Method

is_virus(g::Genome)::Bool

Check if g is known to be a virus.

Example

julia> is_virus(genome)
false

BinBencherBackend.mrca — Method

mrca(a::Node, b::Node)::Node

Compute the most recent common ancestor (MRCA) of a and b.

BinBencherBackend.n_passing_bins — Method

n_passing_bins(::Binning, recall, precision; level=0, assembly::Bool=false)::Integer

Return the number of bins which correspond to any genome or clade at the given recall and precision levels. If assembly is set, a recall of 1.0 means a bin corresponds to a whole assembly, else it corresponds to a whole genome. The argument level sets the taxonomic rank: 0 for Genome (or assemblies).

Examples

julia> n_passing_bins(binning, 0.4, 0.71)
1

julia> n_passing_bins(binning, 0.65, 0.71)
0

BinBencherBackend.n_recovered — Method

n_recovered(::Binning, recall, precision; level=0, assembly=false)::Integer

Return the number of genomes or clades reconstructed in the Binning at the given recall and precision levels. If assembly is set, return the number of assemblies reconstructed instead. The argument level sets the taxonomic rank: 0 for Genome (or assemblies).

Examples

julia> n_recovered(binning, 0.4, 0.71)
1

julia> n_recovered(binning, 0.4, 0.71; assembly=true)
2

julia> n_recovered(binning, 0.4, 0.71; assembly=true, level=2)
1

BinBencherBackend.passes_f1 — Method

passes_f1(bin::Bin, threshold::Real; assembly::Bool=false)::Bool

Computes if bin has an F1 score equal to, or higher than threshold for any genome.

Examples

julia> obs_f1 = f1(only(intersecting(bin)), bin)
0.5714285714285715

julia> passes_f1(bin, obs_f1)
true

julia> passes_f1(bin, obs_f1 + 0.001)
false

BinBencherBackend.passes_recall_precision — Method

passes_recall_precision(bin::Bin, recall::Real, precision::Real; assembly::Bool=false)::Bool

Computes if bin intersects with any Genome with at least the given recall and precision thresholds.

Examples

julia> (r, p) = recall_precision(only(intersecting(bin)), bin)
(recall = 0.4, precision = 1.0)

julia> passes_recall_precision(bin, 0.40, 1.0)
true

julia> passes_recall_precision(bin, 0.41, 1.0)
false

BinBencherBackend.print_matrix — Method

print_matrix(::Binning; level=0, assembly=false)

Print the number of reconstructed assemblies or genomes at the given taxonomic level (rank). Level 0 corresponds to genomes, level 1 to species, etc. If assembly, print the number of reconstructed assemblies, else print the level of reconstructed genomes.