Encoding biological sequences into Voss representation

Documentation Latest Release DOI
CI Workflow License Work in Progress Downloads Aqua QA


BioVossEncoder

A Julia package for encoding biological sequences into Voss representation

Installation

BioVossEncoder is a   Julia Language   package. To install BioVossEncoder, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioVossEncoder

Encoding BioSequences

This package provides a simple and fast way to encode biological sequences into Voss representation. The main struct provided by this package is BinarySequenceMatrix which is a wrapper of BitMatrix that encodes a biological sequence into a binary matrix. The following example shows how to encode a DNA sequence into a binary matrix.

julia> using BioSequences, BioVossEncoder
julia> seq = dna"ACGT"
julia> BinarySequenceMatrix(seq)
4×4 BinarySequenceMatrix of DNAAlphabet{4}():
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

For simplicity the BinarySequenceMatrix struct provides a property bsm that returns the BitMatrix representation of the sequence.

julia> BinarySequenceMatrix(seq).bsm
4×4 BitMatrix:
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

Similarly another function that makes use of the BinarySequenceMatrix struct is binary_sequence_matrix which returns the BitMatrix representation of a sequence directly.

julia> binary_sequence_matrix(seq)
4×4 BitMatrix:
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

Creating a one-hot vector of a sequence

Sometimes it proves to be useful to encode a sequence into a one-hot representation. This package provides a function binaryseq that returns a one-hot representation of a sequence given a BioSequence and the specific molecule (BioSymbol) that could be DNA or AA.

julia> binaryseq(seq, DNA_A)
4-element view(::BitMatrix, 1, :) with eltype Bool:
 1
 0
 0
 0

Note that the output is actually using behind the scenes a view of the BitMatrix representation of the sequence. This is done for performance reasons.