A Gene Finder framework for Julia.

Documentation Release DOI
GitHub Actions License Repo Status Downloads Aqua QA


Overview

This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library framework for the Julia Language.

The main goal of GeneFinder is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).

Installation

You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:

add GeneFinder

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Finding complete and overlapped ORFs

The first implemented function is findorfs a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider aa"M*" a posible encoding protein from the resulting ORFs.

using BioSequences, GeneFinder

# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"

Now lest us find the ORFs

findorfs(seq)

12-element Vector{ORF}:
 ORF(29:40, '+', 2)
 ORF(137:145, '+', 2)
 ORF(164:184, '+', 2)
 ORF(173:184, '+', 2)
 ORF(236:241, '+', 2)
 ORF(248:268, '+', 2)
 ORF(362:373, '+', 2)
 ORF(470:496, '+', 2)
 ORF(551:574, '+', 2)
 ORF(569:574, '+', 2)
 ORF(581:601, '+', 2)
 ORF(695:706, '+', 2)

Two other functions (get_orfs_dna and get_orfs_aa) are implemented to get the ORFs in DNA and amino acid sequences, respectively. They use the findorfs function to first get the ORFs and then get the correspondance array of BioSequence objects.

get_orfs_dna(seq)

12-element Vector{LongSubSeq{DNAAlphabet{4}}}:
 ATGCAACCCTGA
 ATGCGCTGA
 ATGCGTCGAATGGCACGGTGA
 ATGGCACGGTGA
 ATGTGA
 ATGTGTCCAACGGCAGTCTGA
 ATGCAACCCTGA
 ATGCACTGGCTGGTCCTGTCAATCTGA
 ATGTCACCGCACAAGGCAATGTGA
 ATGTGA
 ATGTGTCCAACGGCAGCCTGA
 ATGCAACCCTGA

Writting ORF information into bioinformatic formats

This package facilitates now the creation of FASTA, BED, and GFF files, specifically extracting Open Reading Frame (ORF) information from BioSequence instances, particularly those of type NucleicSeqOrView{A} where A, and then writing the information into the desired format.

Functionality:

The package provides four distinct functions for writing files in different formats:

Function Description
write_orfs_fna Writes nucleotide sequences in FASTA format.
write_orfs_faa Writes amino acid sequences in FASTA format.
write_orfs_bed Outputs information in BED format.
write_orfs_gff Generates files in GFF format.

All these functions support processing both BioSequence instances and external FASTA files. In the case of a BioSequence instace into external files, simply provide the path to the FASTA file using a String to the path. To demonstrate the use of the write_* methods with a BioSequence, consider the following example:

using BioSequences, GeneFinder

# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"

Once a BioSequence object has been instantiated, the write_orfs_fna function proves useful for generating a FASTA file containing the nucleotide sequences of the ORFs. Notably, the write_orfs* methods support either an IOStream or an IOBuffer as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.

outfile = "LFLS01000089.fna"

open(outfile, "w") do io
    write_orfs_fna(seq, io)
end
cat LFLS01000089.fna

>ORF01 id=01 start=29 stop=40 strand=+ frame=2
ATGCAACCCTGA
>ORF02 id=02 start=137 stop=145 strand=+ frame=2
ATGCGCTGA
>ORF03 id=03 start=164 stop=184 strand=+ frame=2
ATGCGTCGAATGGCACGGTGA
>ORF04 id=04 start=173 stop=184 strand=+ frame=2
ATGGCACGGTGA
>ORF05 id=05 start=236 stop=241 strand=+ frame=2
ATGTGA
>ORF06 id=06 start=248 stop=268 strand=+ frame=2
ATGTGTCCAACGGCAGTCTGA
>ORF07 id=07 start=362 stop=373 strand=+ frame=2
ATGCAACCCTGA
>ORF08 id=08 start=470 stop=496 strand=+ frame=2
ATGCACTGGCTGGTCCTGTCAATCTGA
>ORF09 id=09 start=551 stop=574 strand=+ frame=2
ATGTCACCGCACAAGGCAATGTGA
>ORF10 id=10 start=569 stop=574 strand=+ frame=2
ATGTGA
>ORF11 id=11 start=581 stop=601 strand=+ frame=2
ATGTGTCCAACGGCAGCCTGA
>ORF12 id=12 start=695 stop=706 strand=+ frame=2
ATGCAACCCTGA

This could also be done to writting a FASTA file with the nucleotide sequences of the ORFs using the write_orfs_fna function. Similarly for the BED and GFF files using the write_orfs_bed and write_orfs_gff functions respectively.