Indexing & modifying sequences

Indexing

Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:

For example, with LongSequences:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC

julia> seq[5]
DNA_T

julia> seq[6:end]
14nt DNA Sequence:
TANAGTNNAGTACC

The biological symbol at a given locus in a biological sequence can be set using setindex:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC

julia> seq[5] = DNA_A
DNA_A

Note

Some types such as Kmer can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a subsequence of the original sequence. Unlike Arrays in the standard library, creating this subsequence is copy-free: the subsequence simply points to the original LongSequence data with its range. You may think that this is unsafe because modifying subsequences propagates to the original sequence, but this doesn't happen actually:

julia> seq = dna"AAAA"    # create a sequence
4nt DNA Sequence:
AAAA

julia> subseq = seq[1:2]  # create a subsequence from `seq`
2nt DNA Sequence:
AA

julia> subseq[2] = DNA_T  # modify the second element of it
DNA_T

julia> subseq             # the subsequence is modified
2nt DNA Sequence:
AT

julia> seq                # but the original sequence is not
4nt DNA Sequence:
AAAA

This is because modifying a sequence checks whether its underlying data are shared with other sequences under the hood. If and only if the data are shared, the subsequence creates a copy of itself. Any modifying operation does this check. This is called copy-on-write strategy and users don't need to care about it because it is transparent: If the user modifies a sequence with or subsequence, the job of managing and protecting the underlying data of sequences is handled for them.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push! — Function

push!(collection, items...) -> collection

Insert one or more items in collection. If collection is an ordered container, the items are inserted at the end (in the given order).

Examples

julia> push!([1, 2, 3], 4, 5, 6)
6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

If collection is ordered, use append! to add all the elements of another collection to it. The result of the preceding example is equivalent to append!([1, 2, 3], [4, 5, 6]). For AbstractSet objects, union! can be used instead.

source

Base.pop! — Function

pop!(collection) -> item

Remove an item in collection and return it. If collection is an ordered container, the last item is returned.

Examples

julia> A=[1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

julia> pop!(A)
3

julia> A
2-element Array{Int64,1}:
 1
 2

julia> S = Set([1, 2])
Set{Int64} with 2 elements:
  2
  1

julia> pop!(S)
2

julia> S
Set{Int64} with 1 element:
  1

julia> pop!(Dict(1=>2))
1 => 2

source

pop!(collection, key[, default])

Delete and return the mapping for key if it exists in collection, otherwise return default, or throw an error if default is not specified.

Examples

julia> d = Dict("a"=>1, "b"=>2, "c"=>3);

julia> pop!(d, "a")
1

julia> pop!(d, "d")
ERROR: KeyError: key "d" not found
Stacktrace:
[...]

julia> pop!(d, "e", 4)
4

source

pop!(q::Deque{T})

Remove the element at the back

pop!(h::BinaryMinMaxHeap) = popmin!(h)

pop!(sc, k[, default])

Deletes the item with key k in SortedDict or SortedSet sc and returns the value that was associated with k in the case of SortedDict or k itself in the case of SortedSet. If k is not in sc return default, or throw a KeyError if default is not specified. Time: O(c log n)

pop!(sc, k[, default])

pop!(ss)

Deletes the item with first key in SortedSet ss and returns the key. A BoundsError results if ss is empty. Time: O(c log n)

pop!(collection, key[, default])

Delete and return the mapping for key if it exists in collection, otherwise return default, or throw an error if default is not specified.

Examples

julia> d = RobinDict("a"=>1, "b"=>2, "c"=>3);

julia> pop!(d, "a")
1

julia> pop!(d, "d")
ERROR: KeyError: key "d" not found
Stacktrace:
[...]

julia> pop!(d, "e", 4)
4

pop!(cb::CircularBuffer)

Remove the element at the back.

pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

Base.pushfirst! — Function

pushfirst!(collection, items...) -> collection

Insert one or more items at the beginning of collection.

Examples

julia> pushfirst!([1, 2, 3, 4], 5, 6)
6-element Array{Int64,1}:
 5
 6
 1
 2
 3
 4

source

pushfirst!(q::Deque{T}, x)

Add an element to the front

pushfirst!(D::CircularDeque, v)

Add an element to the front.

pushfirst!(cb::CircularBuffer, data)

Insert one or more items at the beginning of CircularBuffer and overwrite back if full.

pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

Base.popfirst! — Function

popfirst!(collection) -> item

Remove the first item from collection.

Examples

julia> A = [1, 2, 3, 4, 5, 6]
6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

julia> popfirst!(A)
1

julia> A
5-element Array{Int64,1}:
 2
 3
 4
 5
 6

source

popfirst!(q::Deque{T})

Remove the element at the front

popfirst!(D::CircularDeque)

Remove the element at the front.

popfirst!(cb::CircularBuffer)

Remove the element from the front of the CircularBuffer.

popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

Base.insert! — Function

insert!(a::Vector, index::Integer, item)

Insert an item into a at the given index. index is the index of item in the resulting a.

Examples

julia> insert!([6, 5, 4, 2, 1], 4, 3)
6-element Array{Int64,1}:
 6
 5
 4
 3
 2
 1

source

insert!(sc, k)

Argument sc is a SortedDict or SortedMultiDict, k is a key and v is the corresponding value. This inserts the (k,v) pair into the container. If the key is already present in a SortedDict, this overwrites the old value. In the case of SortedMultiDict, no overwriting takes place (since SortedMultiDict allows the same key to associate with multiple values). In the case of SortedDict, the return value is a pair whose first entry is boolean and indicates whether the insertion was new (i.e., the key was not previously present) and the second entry is the semitoken of the new entry. In the case of SortedMultiDict, a semitoken is returned (but no boolean). Time: O(c log n)

insert!(sc, k)

insert!(sc, k)

Argument sc is a SortedSet and k is a key. This inserts the key into the container. If the key is already present, this overwrites the old value. (This is not necessarily a no-op; see below for remarks about the customizing the sort order.) The return value is a pair whose first entry is boolean and indicates whether the insertion was new (i.e., the key was not previously present) and the second entry is the semitoken of the new entry. Time: O(c log n)

insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

Base.deleteat! — Method

deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

Base.append! — Function

append!(collection, collection2) -> collection.

For an ordered container collection, add the elements of collection2 to the end of it.

Examples

julia> append!([1],[2,3])
3-element Array{Int64,1}:
 1
 2
 3

julia> append!([1, 2, 3], [4, 5, 6])
6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

Use push! to add individual items to collection which are not already themselves in another collection. The result of the preceding example is equivalent to push!([1, 2, 3], 4, 5, 6).

source

append!(cb::CircularBuffer, datavec::AbstractVector)

Push at most last capacity items.

append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

Base.resize! — Function

resize!(a::Vector, n::Integer) -> Vector

Resize a to contain n elements. If n is smaller than the current collection length, the first n elements will be retained. If n is larger, the new elements are not guaranteed to be initialized.

Examples

julia> resize!([6, 5, 4, 3, 2, 1], 3)
3-element Array{Int64,1}:
 6
 5
 4

julia> a = resize!([6, 5, 4, 3, 2, 1], 8);

julia> length(a)
8

julia> a[1:6]
6-element Array{Int64,1}:
 6
 5
 4
 3
 2
 1

source

resize!(seq, size, [force::Bool])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

Base.empty! — Function

empty!(collection) -> collection

Remove all elements from a collection.

Examples

julia> A = Dict("a" => 1, "b" => 2)
Dict{String,Int64} with 2 entries:
  "b" => 2
  "a" => 1

julia> empty!(A);

julia> A
Dict{String,Int64} with 0 entries

source

empty!(q::Deque{T})

Reset the deque.

empty!(collection) -> collection

Remove all elements from a collection.

Examples

julia> A = RobinDict("a" => 1, "b" => 2)
RobinDict{String,Int64} with 2 entries:
  "b" => 2
  "a" => 1

julia> empty!(A);

julia> A
RobinDict{String,Int64} with 0 entries

empty!(cb::CircularBuffer)

Reset the buffer.

empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

Here are some examples:

julia> seq = dna"ACG"
3nt DNA Sequence:
ACG

julia> push!(seq, DNA_T)
4nt DNA Sequence:
ACGT

julia> append!(seq, dna"AT")
6nt DNA Sequence:
ACGTAT

julia> deleteat!(seq, 2)
5nt DNA Sequence:
AGTAT

julia> deleteat!(seq, 2:3)
3nt DNA Sequence:
AAT

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse! — Method

reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

Base.reverse — Method

reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

BioSequences.complement! — Function

complement!(seq)

Make a complement sequence of seq in place.

BioSymbols.complement — Function

complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
DNA_T

julia> complement(DNA_N)
DNA_N

julia> complement(RNA_U)
RNA_A

complement(seq)

Make a complement sequence of seq.

complement(x::T) where {T <: Skipmer}

Return the complement of a short sequence type x.

BioSequences.reverse_complement! — Function

reverse_complement!(seq)

Make a reversed complement sequence of seq in place.

BioSequences.reverse_complement — Function

reverse_complement(seq)

Make a reversed complement sequence of seq.

reverse_complement(x::Skipmer)

Return the reverse complement of x.

BioSequences.ungap! — Function

Remove gap characters from an input sequence.

BioSequences.ungap — Function

Create a copy of a sequence with gap characters removed.

BioSequences.canonical! — Function

canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
ATCGATCG
CGATCGAT
<-------

Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

BioSequences.canonical — Function

canonical(seq::NucleotideSeq)

Create the canonical sequence of seq.

canonical(kmer::Skipmer)

Return the canonical sequence of x.

A canonical sequence is the numerical lesser of a k-mer and its reverse complement. This is useful in hashing/counting sequences in data that is not strand specific, and thus observing the short sequence is equivalent to observing its reverse complement.

Some examples:

julia> seq = dna"ACGTAT"
6nt DNA Sequence:
ACGTAT

julia> reverse!(seq)
6nt DNA Sequence:
TATGCA

julia> complement!(seq)
6nt DNA Sequence:
ATACGT

julia> reverse_complement!(seq)
6nt DNA Sequence:
ACGTAT

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translate — Function

translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, convert_start_codon=false)

Translate an LongRNASeq or a LongDNASeq to an LongAminoAcidSeq.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

BioSequences.ncbi_trans_table — Constant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

julia> ncbi_trans_table
Translation Tables:
  1. The Standard Code (standard_genetic_code)
  2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
  3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
  4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
  5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
  6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
  9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes