Representing biological sequences as Markov chains

Documentation Latest Release DOI
CI Workflow License Work in Progress Downloads


A Julia package to represent biological sequences as Markov chains


BioMarkovChains is a   Julia Language   package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioMarkovChains

Creating Markov chain out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the BioMarkovChain method that will capture the initials and transition probabilities of a DNA sequence (LongSequence) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random LongDNA :

using BioSequences, GeneFinder, BioMarkovChains

sequence = randdnaseq(10^3)
orfdna = getorfdna(sequence, min_len=75)[1]

If we translate it, we get a 69aa sequence:

69aa Amino Acid Sequence:

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the BioMarkovChain method and tune it to 2nd-order Markov chain:

BioMarkovChain(orfdna, 2)
BioMarkovChain with DNAAlphabet{4}() Alphabet:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.2123  0.2731  0.278   0.2366
   0.2017  0.3072  0.2687  0.2224
   0.1978  0.2651  0.2893  0.2478
   0.2013  0.3436  0.2431  0.212
  - Initial Probabilities -> Vector{Float64}(4 × 1):
  - Markov Chain Order -> Int64:

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models or Markov chain implemented:

BioMarkovChain with DNAAlphabet{4}() Alphabet:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.31    0.224   0.199   0.268
   0.251   0.215   0.313   0.221
   0.236   0.308   0.249   0.207
   0.178   0.217   0.338   0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
  - Markov Chain Order -> Int64:

What is then the probability of the previous random Lambda phage DNA sequence given this model?

dnaseqprobability(orfdna, ECOLICDS)

This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs