Reference

BinBencherBackend.BinType
Bin(name::AbstractString, sequences, targets, scratch)

Bins each represent a bin created by the binner. Conceptually, they are simply a set of Sequence with a name attached. Practically, every Bin is benchmarked against all Genomes and Clades of a given Reference, so each Bin stores data about its intersection with every genome/clade, e.g. its purity and recall.

Like Sources, Bins also have an assembly size for a given Genome. This is the number of base pairs in the genomes covered by any sequence in the Bin, which is always a subset of the genome's assembly size.

Benchmark statistics for a Bin/Genome can be done with either assemblies or genomes as the ground truth.

  • True positives (TP) are defined as the sum of assembly sizes over all sources in the genome
  • False positives (FP) are the sum of length of sequences in the bin not mapping to the genome
  • False negatives (FN) is either the genome assembly size or genome size minus TP.

For Bin/Clade pairs B/C, recall is the maximal recall of B/Ch for all children Ch of C. Precision is the sum of lengths of sequences mapping to any child of the clade divided by the sum of lengths of all sequences in the bin.

See also: Binning, Genome, Clade

Examples

julia> bin = first(binning.bins)
Bin "C1"
  Sequences: 2
  Breadth:   65
  Intersecting 1 genome

julia> first(bin.sequences)
Sequence("s1", 25)

julia> f1(first(ref.genomes), bin)
0.625
BinBencherBackend.BinningType
Binning(::Union{IO, AbstractString}, ::Reference; kwargs...)

A Binning represents a set of Bins benchmarked against a Reference. Binnings can be created given a set of Bins and a Reference, where the bins may potentially be loaded from a .tsv file. The fields recovered_asms and recovered_genomes are used for benchmarking, these are normally output using the print_matrix function.

A Binning is loaded from a tsv file, which is specified either as an IO, or its path as an AbstractString. If the path ends with .gz, automatically gzip decompress when reading the file.

See also: print_matrix, Bin, Reference

Examples

julia> bins = Binning(path_to_bins_file, ref);

julia> bins isa Binning
true

julia> BinBencherBackend.n_nc(binning)
0

Extended help

Create with:

open(file) do io
    Binning(
        io::Union{IO, AbstractString},
        ref::Reference;
        min_size::Integer=1,
        min_seqs::Integer=1,
        binsplit_separator::Union{AbstractString, Char, Nothing}=nothing,
        disjoint::Bool=true,
        recalls=DEFAULT_RECALLS,
        precisions=DEFAULT_PRECISIONS,
        filter_genomes=Returns(true)
)
  • min_size: Filter away bins with breadth lower than this
  • min_seqs: Filter away bins with fewer sequences that this
  • binsplit_separator: Split bins based on this separator (nothing means no binsplitting)
  • disjoint: Throw an error if the same sequence is seen in multiple bins
  • recalls and precision: The thresholds to benchmark with
  • filter_genomes: A function f(genome)::Bool. Genomes for which it returns false are ignored in benchmarking.
BinBencherBackend.CladeType
Clade{Genome}(name::AbstractString, child::Union{Clade{Genome}, Genome})

A Clade represents any clade above Genome. Every Genome is expected to belong to the same number of clades, e.g. there may be exactly 7 levels of clades above every Genome. Clades always have at least one child (which is either a Genome or a Clade one rank lower), and a parent, unless it's the unique top clade from which all other clades and genomes descend from. The rank of a Genome is 0, clades that contain genomes have rank 1, and clades containing rank-1 clades have rank 2 etc. By default, zero-indexed ranks correspond to OTU, species, genus, family, order, class, phylum and domain.

Examples

julia> top_clade(ref)
Genus "F", 3 genomes
├─ Species "D", 2 genomes
│  ├─ Genome(gA)
│  └─ Genome(gB)
└─ Species "E", 1 genome
   └─ Genome(gC)

julia> top_clade(ref).children
2-element Vector{Clade{Genome}}:
 Species "D", 2 genomes
 Species "E", 1 genome
BinBencherBackend.FlagSetType
FlagSet <: AbstractSet{Flag}

Flags are compact sets of Flag associated to a Genome. You can construct them from an iterable of Flag, e.g. a 1-element tuple. FlagSet support most set operations efficiently.

See also: Flag, Genome

Examples

julia> flags = FlagSet((Flags.organism, Flags.virus));

julia> Flags.virus in flags
true

julia> isdisjoint(flags, FlagSet((Flags.organism,)))
false
BinBencherBackend.Flags.FlagType
Flag

A flag is a boolean associated to a Genome, stored in a Flags object. A flag may be e.g. Flag.organism, signaling that the genome is known to be an organism.

See also: FlagSet, Genome

Examples

julia> tryparse(Flag, "organism") == Flags.organism
true

julia> tryparse(Flag, "Canada") === nothing
true
BinBencherBackend.GenomeType
Genome(name::AbstractString [flags::FlagSet])

Genomes represent individual target genomes (organisms, plasmids, viruses etc), analogous to lowest-level clade that can be reconstructed. Conceptually, Genomes contain one or more Sources, and to a single parent Clade. They are identified uniquely among genomes by their name.

A genome have a genome size, which is the sum of the length of all its sources. We consider this to be the true size of the biological genome (assuming its full sequence is contained in its sources), as well as an assembly size, which represent the sum of the assembly sizes of each source.

See also: Clade, Source, mrca

Examples

julia> gA, gB, gC = collect(ref.genomes);

julia> flags(gA)
FlagSet with 1 element:
  BinBencherBackend.Flags.organism

julia> mrca(gA, gB)
Species "D", 2 genomes
├─ Genome(gA)
└─ Genome(gB)
BinBencherBackend.ReferenceType
Reference(::Union{IO, AbstractString}; [min_seq_length=1])

A Reference contains the ground truth to benchmark against. Conceptually, it consists of the following parts:

  • A list of genomes, each with sources
  • The full taxonomic tree, as lists of clades
  • A list of sequences, each with a list of (source, span) to where it maps.

Normally, the types FlagSet Genome, Source, Clade and Sequence do not need to be constructed manually, but are constructed when the Reference is loaded from a JSON file.

A Reference is loaded from a JSON file, which is specified either as an IO, or its path as an AbstractString. If the path ends with .gz, automatically gzip decompress when reading the file.

Examples

julia> ref = Reference(path_to_ref_file; min_seq_length=3);

julia> ref isa Reference
true

julia> length(genomes(ref))
3

julia> nseqs(ref)
11

julia> first(ref.genomes) isa Genome
true

See also: subset, Genome, Clade

BinBencherBackend.SequenceType
Sequence(name::AbstractString, length::Integer)

Type that represents a binnable sequence. Sequences do not contain other information than their name and their length, and are identified by their name.

Examples

julia> Sequence("abc", 5)
Sequence("abc", 5)

julia> Sequence("abc", 5) == Sequence("abc", 9)
true

julia> Sequence("abc", 0)
ERROR: ArgumentError: Cannot instantiate an empty sequence
BinBencherBackend.SourceType
Source{Genome}(g::Genome, name::AbstractString, length::Integer)

Sources are the "ground truth" sequences that the binning attempts to recreate. For example, the assembled contigs of the reference genome (typically full, closed circular contigs) as found in NCBI or elsewhere are each Sources. Many Genomes only contain a single Source namely its full assembled genome. Each Source has a single parent Genome, and a unique name which identifies it.

Sources have zero or more mapping Sequences, that each map to the Source at a given span given by a 2-tuple Tuple{Int, Int}.

Sources have an assembly size, which is the number of base pairs where any sequence map to.

BinBencherBackend.assembly_size!Method

Compute the number of positions in v covered at least once. v must be a Vector such that all(by(i) isa Tuple{Integer, Integer} for i in v). The scratch input is mutated.

BinBencherBackend.flagsMethod
flags(g::Genome)::FlagSet

Returns the Flags of the Genome as a FlagSet.

See also: Flag, FlagSet

Example

julia> flags(genome)
FlagSet with 1 element:
  BinBencherBackend.Flags.organism
BinBencherBackend.gold_standardMethod
gold_standard(
    ref::Reference
    [sequences, an iterable of bins or Binning];
    disjoint=true,
    recalls=DEFAULT_RECALLS,
    precisions=DEFAULT_PRECISIONS
)::Binning

Create the optimal Binning object given a Reference, by the optimal binning of sequences. If disjoint, assign each sequence to only a single genome.

If sequences is not passed, use all sequences in ref. If a Binning is passed, use all sequences in any of its bins. Else, pass an iterable of Sequence. The elements of Sequence must be unique.

Extended help

Currently, the disjoint option uses a simple greedy algorithm to assign sequences to genomes.

BinBencherBackend.intersectingMethod
intersecting([Genome, Clade]=Genome, x::Bin)

Get an iterator of the Genomes or Clades that bin x intersects with. intersecting(::Bin) defaults to genomes.

Example

julia> collect(intersecting(bin))
1-element Vector{Genome}:
 Genome(gA)

julia> sort!(collect(intersecting(Clade, bin)); by=i -> i.name)
2-element Vector{Clade{Genome}}:
 Species "D", 2 genomes
 Genus "F", 3 genomes
BinBencherBackend.is_plasmidMethod
is_plasmid(g::Genome)::Bool

Check if g is known to be a plasmid.

Example

julia> is_plasmid(genome)
false
BinBencherBackend.is_virusMethod
is_virus(g::Genome)::Bool

Check if g is known to be a virus.

Example

julia> is_virus(genome)
false
BinBencherBackend.mrcaMethod
mrca(a::Node, b::Node)::Node

Compute the most recent common ancestor (MRCA) of a and b.

BinBencherBackend.n_recoveredMethod
n_recovered(::Binning, recall, precision; level=0, assembly=false)::Integer

Return the number of genomes or clades reconstructed in the Binning at the given recall and precision levels. If assembly is set, return the number of assemblies reconstructed instead. The argument level sets the taxonomic rank: 0 for Genome (or assemblies).

Examples

julia> n_recovered(binning, 0.4, 0.71)
1

julia> n_recovered(binning, 0.4, 0.71; assembly=true)
2

julia> n_recovered(binning, 0.4, 0.71; assembly=true, level=2)
1
BinBencherBackend.print_matrixMethod
print_matrix(::Binning; level=0, assembly=true)

Print the number of reconstructed assemblies or genomes at the given taxonomic level (rank). Level 0 corresponds to genomes, level 1 to species, etc. If assembly, print the number of reconstructed assemblies, else print the level of reconstructed genomes.

See also: Binning

BinBencherBackend.recall_precisionMethod
recall_precision(x::Union{Genome, Clade}, bin::Bin; assembly::Bool=true)

Get the recall, precision as a 2-tuple of Float64 for the given genome/bin pair. See the docstring for Bin for how this is computed.

See also: Bin, Binning

Examples

julia> bingenome = only(intersecting(bin));

julia> recall_precision(bingenome, bin)
(recall = 0.45454545454545453, precision = 1.0)

julia> recall_precision(bingenome, bin; assembly=false)
(recall = 0.4, precision = 1.0)

julia> recall_precision(bingenome.parent, bin; assembly=false)
(recall = 0.4, precision = 1.0)
BinBencherBackend.subset!Method
subset!(
        ref::Reference;
        sequences::Function=Returns(true),
        genomes::Function=Returns(true)
)::Reference

Mutate ref in place, removing genomes and sequences. Keep only sequences S where sequences(S) returns true and genomes G for which genomes(G) returns true.

See also: subset, Reference

Examples

julia> ref
Reference
  Genomes:    3
  Sequences:  11
  Ranks:      3
  Seq length: 10
  Assembled:  61.9 %

julia> subset(ref; genomes=g -> Flags.organism in flags(g))
Reference
  Genomes:    2
  Sequences:  11
  Ranks:      3
  Seq length: 10
  Assembled:  91.3 %

julia> BinBencherBackend.subset(ref; sequences=s -> length(s) ≥ 25)
Reference
  Genomes:    3
  Sequences:  9
  Ranks:      3
  Seq length: 25
  Assembled:  56.2 %