Representing biological sequences as Markov chains

Documentation Latest Release DOI
CI Workflow License Work in Progress Downloads Aqua QA


A Julia package to represent biological sequences as Markov chains


BioMarkovChains is a   Julia Language   package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioMarkovChains

Creating Markov chain out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the BioMarkovChain method that will capture the initials and transition probabilities of a DNA sequence (LongSequence) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random LongDNA :

using BioSequences, BioMarkovChains, GeneFinder

# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)

orfseq = findorfs(seq)[3] |> sequence

21nt DNA Sequence:

If we translate it, we get a 7aa sequence:


7aa Amino Acid Sequence:

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the BioMarkovChain method and tune it to 2nd-order Markov chain:

BioMarkovChain(orfdna, 2)

BioMarkovChain of DNAAlphabet{4}() and order 1:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.25    0.25    0.0     0.5
   0.25    0.0     0.75    0.0
   0.25    0.25    0.25    0.25
   0.0     0.25    0.75    0.0
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.2     0.2     0.4     0.2

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models or Markov chain implemented:


BioMarkovChain of DNAAlphabet{4}() and order 1:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.31    0.224   0.199   0.268
   0.251   0.215   0.313   0.221
   0.236   0.308   0.249   0.207
   0.178   0.217   0.338   0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.245   0.243   0.273   0.239

What is then the probability of the previous DNA sequence given this model?

dnaseqprobability(orfseq, ECOLICDS)


This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs