Reference
BinBencherBackend.Bin
— TypeBin(name::AbstractString, ref::Reference, sequences)
Bin
s each represent a bin created by the binner. Conceptually, they are simply a set of Sequence
with a name attached. Practically, every Bin
is benchmarked against all Genome
s and Clade
s of a given Reference
, so each Bin
stores data about its intersection with every genome/clade, e.g. its purity and recall.
Like Source
s, Bin
s also have an assembly size for a given Genome
. This is the number of base pairs in the genomes covered by any sequence in the Bin
, which is always a subset of the genome's assembly size.
Benchmark statistics for a Bin
/Genome
can be done with either assemblies or genomes as the ground truth.
- True positives (TP) are defined as the sum of assembly sizes over all sources in the genome
- False positives (FP) are the sum of length of sequences in the bin not mapping to the genome
- False negatives (FN) is either the genome assembly size or genome size minus TP.
For Bin
/Clade
pairs B/C, recall is the maximal recall of B/Ch for all children Ch of C. Precision is the sum of lengths of sequences mapping to any child of the clade divided by the sum of lengths of all sequences in the bin.
See also: Binning
, Genome
, Clade
Examples
julia> bin = first(binning.bins)
Bin "C1"
Sequences: 2
Breadth: 65
Intersecting 1 genome
julia> first(bin.sequences)
Sequence("s1", 25)
julia> f1(first(ref.genomes), bin)
0.5714285714285715
BinBencherBackend.Binning
— TypeBinning(::Union{IO, AbstractString}, ::Reference; kwargs...)
A Binning
represents a set of Bin
s benchmarked against a Reference
. Binning
s can be created given a set of Bin
s and a Reference
, where the bins may potentially be loaded from a .tsv
file. The fields recovered_asms
and recovered_genomes
are used for benchmarking, these are normally output using the print_matrix
function.
A Binning
is loaded from a tsv file, which is specified either as an IO
, or its path as an AbstractString
. If the path ends with .gz
, automatically gzip decompress when reading the file.
See also: print_matrix
, Bin
, Reference
Examples
julia> bins = Binning(path_to_bins_file, ref);
julia> bins isa Binning
true
julia> BinBencherBackend.n_nc(binning)
0
Extended help
Create with:
open(file) do io
Binning(
io::Union{IO, AbstractString},
ref::Reference;
min_size::Integer=1,
min_seqs::Integer=1,
binsplit_separator::Union{AbstractString, Char, Nothing}=nothing,
disjoint::Bool=true,
recalls=DEFAULT_RECALLS,
precisions=DEFAULT_PRECISIONS,
filter_genomes=Returns(true)
)
min_size
: Filter away bins with breadth lower than thismin_seqs
: Filter away bins with fewer sequences that thisbinsplit_separator
: Split bins based on this separator (nothing
means no binsplitting)disjoint
: Throw an error if the same sequence is seen in multiple binsrecalls
andprecision
: The thresholds to benchmark withfilter_genomes
: A functionf(genome)::Bool
. Genomes for which it returnsfalse
are ignored in benchmarking.
BinBencherBackend.Clade
— TypeClade{Genome}(name::AbstractString, child::Union{Clade{Genome}, Genome})
A Clade
represents any clade above Genome
. Every Genome
is expected to belong to the same number of clades, e.g. there may be exactly 7 levels of clades above every Genome
. Clade
s always have at least one child (which is either a Genome
or a Clade
one rank lower), and a parent, unless it's the unique top clade from which all other clades and genomes descend from. The rank of a Genome
is 0, clades that contain genomes have rank 1, and clades containing rank-1 clades have rank 2 etc. By default, zero-indexed ranks correspond to OTU, species, genus, family, order, class, phylum and domain.
Examples
julia> top_clade(ref)
Genus "F", 3 genomes
├─ Species "D", 2 genomes
│ ├─ Genome(gA)
│ └─ Genome(gB)
└─ Species "E", 1 genome
└─ Genome(gC)
julia> top_clade(ref).children
2-element Vector{Clade{Genome}}:
Species "D", 2 genomes
Species "E", 1 genome
BinBencherBackend.FlagSet
— TypeFlagSet <: AbstractSet{Flag}
Flags are compact sets of Flag
associated to a Genome. You can construct them from an iterable of Flag
, e.g. a 1-element tuple. FlagSet
support most set operations efficiently.
Examples
julia> flags = FlagSet((Flags.organism, Flags.virus));
julia> Flags.virus in flags
true
julia> isdisjoint(flags, FlagSet((Flags.organism,)))
false
BinBencherBackend.Flags.Flag
— TypeBinBencherBackend.Genome
— TypeGenome(name::AbstractString [flags::FlagSet])
Genome
s represent individual target genomes (organisms, plasmids, viruses etc), and are conceptually the lowest-level clade that can be reconstructed. Genome
s contain one or more Source
s, and belong to a single parent Clade
. They are identified uniquely among genomes by their name.
A genome have a genome size, which is the sum of the length of all its sources. We consider this to be the true size of the biological genome (assuming its full sequence is contained in its sources), as well as an assembly size, which represent the sum of the assembly sizes of each source.
Examples
julia> gA, gB, gC = collect(ref.genomes);
julia> flags(gA)
FlagSet with 1 element:
BinBencherBackend.Flags.organism
julia> mrca(gA, gB)
Species "D", 2 genomes
├─ Genome(gA)
└─ Genome(gB)
BinBencherBackend.Reference
— TypeReference(::Union{IO, AbstractString})
A Reference
contains the ground truth to benchmark against. Conceptually, it consists of the following parts:
- A list of genomes, each with sources
- The full taxonomic tree, as lists of clades
- A list of sequences, each with a list of (source, span) to where it maps.
Normally, the types FlagSet
Genome
, Source
, Clade
and Sequence
do not need to be constructed manually, but are constructed when the Reference
is loaded from a JSON file.
A Reference
is loaded from a JSON file, which is specified either as an IO
, or its path as an AbstractString
. If the path ends with .gz
, automatically gzip decompress when reading the file.
Examples
julia> ref = Reference(path_to_ref_file);
julia> ref isa Reference
true
julia> length(genomes(ref))
3
julia> n_seqs(ref)
11
julia> first(ref.genomes) isa Genome
true
BinBencherBackend.Sequence
— TypeSequence(name::AbstractString, length::Integer)
Type that represents a binnable sequence. Sequences do not contain other information than their name and their length, and are identified by their name.
Examples
```jldoctest julia> Sequence("abc", 5) Sequence("abc", 5)
julia> Sequence("abc", 5) == Sequence("abc", 9) true
julia> Sequence("abc", 0) ERROR: ArgumentError: Cannot instantiate an empty sequence [...]
BinBencherBackend.Source
— TypeSource{Genome}(g::Genome, name::AbstractString, length::Integer)
Sources are the "ground truth" sequences that the binning attempts to recreate. For example, the assembled contigs of the reference genome (typically full, closed circular contigs) as found in NCBI or elsewhere are each Source
s. Many Genome
s only contain a single Source
namely its full assembled genome. Each Source
has a single parent Genome
, and a unique name which identifies it.
Source
s have zero or more mapping Sequence
s, that each map to the Source
at a given span given by a 2-tuple Tuple{Int, Int}
.
Source
s have an assembly size, which is the number of base pairs where any sequence map to.
BinBencherBackend.assembly_size!
— MethodCompute -> (breadth, totalbp), where breadth is the number of positions in v
covered at least once, and totalbp the sum of the lengths of the sequences. v
must be a Vector
such that all(by(i) isa Tuple{Integer, Integer} for i in v)
. The scratch
input is mutated.
BinBencherBackend.flags
— MethodBinBencherBackend.gold_standard
— Methodgold_standard(
ref::Reference
[sequences, a Binning or an iterable of Sequence];
disjoint=true,
recalls=DEFAULT_RECALLS,
precisions=DEFAULT_PRECISIONS
)::Binning
Create the optimal Binning
object given a Reference
, by the optimal binning of the Sequence
s in sequences
. If disjoint
, assign each sequence to only a single genome.
If sequences
is not passed, use all sequences in ref
. If a Binning
is passed, use all sequences in any of its bins. Else, pass an iterable of Sequence
.
Extended help
Currently, the disjoint
option uses a simple greedy algorithm to assign sequences to genomes.
BinBencherBackend.intersecting
— Methodintersecting([Genome, Clade]=Genome, x::Bin)
Get an iterator of the Genome
s or Clade
s that bin x
intersects with. intersecting(::Bin)
defaults to genomes.
Example
julia> collect(intersecting(bin))
1-element Vector{Genome}:
Genome(gA)
julia> sort!(collect(intersecting(Clade, bin)); by=i -> i.name)
2-element Vector{Clade{Genome}}:
Species "D", 2 genomes
Genus "F", 3 genomes
BinBencherBackend.is_organism
— Methodis_organism(g::Genome)::Bool
Check if g
is known to be an organism.
Example
julia> is_organism(genome)
true
BinBencherBackend.is_plasmid
— Methodis_plasmid(g::Genome)::Bool
Check if g
is known to be a plasmid.
Example
julia> is_plasmid(genome)
false
BinBencherBackend.is_virus
— Methodis_virus(g::Genome)::Bool
Check if g
is known to be a virus.
Example
julia> is_virus(genome)
false
BinBencherBackend.mrca
— Methodmrca(a::Node, b::Node)::Node
Compute the most recent common ancestor (MRCA) of a
and b
.
BinBencherBackend.n_passing_bins
— Methodn_passing_bins(::Binning, recall, precision; level=0, assembly::Bool=false)::Integer
Return the number of bins which correspond to any genome or clade at the given recall and precision levels. If assembly
is set, a recall of 1.0 means a bin corresponds to a whole assembly, else it corresponds to a whole genome. The argument level
sets the taxonomic rank: 0 for Genome
(or assemblies).
Examples
julia> n_passing_bins(binning, 0.4, 0.71)
1
julia> n_passing_bins(binning, 0.65, 0.71)
0
BinBencherBackend.n_recovered
— Methodn_recovered(::Binning, recall, precision; level=0, assembly=false)::Integer
Return the number of genomes or clades reconstructed in the Binning
at the given recall and precision levels. If assembly
is set, return the number of assemblies reconstructed instead. The argument level
sets the taxonomic rank: 0 for Genome
(or assemblies).
Examples
julia> n_recovered(binning, 0.4, 0.71)
1
julia> n_recovered(binning, 0.4, 0.71; assembly=true)
2
julia> n_recovered(binning, 0.4, 0.71; assembly=true, level=2)
1
BinBencherBackend.passes_f1
— Methodpasses_f1(bin::Bin, threshold::Real; assembly::Bool=false)::Bool
Computes if bin
has an F1 score equal to, or higher than threshold
for any genome.
Examples
julia> obs_f1 = f1(only(intersecting(bin)), bin)
0.5714285714285715
julia> passes_f1(bin, obs_f1)
true
julia> passes_f1(bin, obs_f1 + 0.001)
false
BinBencherBackend.passes_recall_precision
— Methodpasses_recall_precision(bin::Bin, recall::Real, precision::Real; assembly::Bool=false)::Bool
Computes if bin
intersects with any Genome
with at least the given recall and precision thresholds.
Examples
julia> (r, p) = recall_precision(only(intersecting(bin)), bin)
(recall = 0.4, precision = 1.0)
julia> passes_recall_precision(bin, 0.40, 1.0)
true
julia> passes_recall_precision(bin, 0.41, 1.0)
false
BinBencherBackend.print_matrix
— Methodprint_matrix(::Binning; level=0, assembly=false)
Print the number of reconstructed assemblies or genomes at the given taxonomic level (rank). Level 0 corresponds to genomes, level 1 to species, etc. If assembly
, print the number of reconstructed assemblies, else print the level of reconstructed genomes.
See also: Binning
BinBencherBackend.recall_precision
— Methodrecall_precision(x::Union{Genome, Clade}, bin::Bin; assembly::Bool=true)
Get the recall, precision NamedTuple
of Float64
for the given genome/bin pair. See the docstring for Bin
for how this is computed.
Examples
julia> bingenome = only(intersecting(bin));
julia> recall_precision(bingenome, bin)
(recall = 0.4, precision = 1.0)
julia> recall_precision(bingenome, bin; assembly=false)
(recall = 0.4, precision = 1.0)
julia> recall_precision(bingenome.parent, bin; assembly=false)
(recall = 0.4, precision = 1.0)
BinBencherBackend.recursively_delete_child!
— MethodDelete a child from the clade tree.
BinBencherBackend.subset!
— Methodsubset!(
ref::Reference;
sequences::Function=Returns(true),
genomes::Function=Returns(true)
)::Reference
Mutate ref
in place, removing genomes and sequences. Keep only sequences S where sequences(S)
returns true
and genomes G for which genomes(G)
returns true
.
Examples
julia> ref
Reference
Genomes: 3
Sequences: 11
Ranks: 3
Seq length: 10
Assembled: 61.9 %
julia> subset(ref; genomes=g -> Flags.organism in flags(g))
Reference
Genomes: 2
Sequences: 11
Ranks: 3
Seq length: 10
Assembled: 91.3 %
julia> BinBencherBackend.subset(ref; sequences=s -> length(s) ≥ 25)
Reference
Genomes: 3
Sequences: 9
Ranks: 3
Seq length: 25
Assembled: 56.2 %
BinBencherBackend.subset
— Methodsubset(ref::Reference; kwargs...)
Non-mutating copying version of subset!
. This is currently much slower than subset!
.
See also: subset!
BinBencherBackend.update_matrix!
— MethodFor each precision column in the matrix, add one to the correct row given by the recall value at the given precision