A Gene Finder framework for Julia.
Overview
This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library framework for the Julia Language.
The main goal of GeneFinder
is to create a versatile module that enables apply different implemented algorithm to DNA sequences. See, for instance, BioAlignment implementations of different sequence alignment algorithms (local, global, edit-distance).
Installation
You can install GeneFinder from the julia REPL. Press ]
to enter pkg mode, and enter the following:
add GeneFinder
If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.
Finding complete and overlapped ORFs
The first implemented function is findorfs
a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider aa"M*"
a posible encoding protein from the resulting ORFs.
using BioSequences, GeneFinder
# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"
Now lest us find the ORFs
findorfs(seq)
12-element Vector{ORF}:
ORF(29:40, '+', 2)
ORF(137:145, '+', 2)
ORF(164:184, '+', 2)
ORF(173:184, '+', 2)
ORF(236:241, '+', 2)
ORF(248:268, '+', 2)
ORF(362:373, '+', 2)
ORF(470:496, '+', 2)
ORF(551:574, '+', 2)
ORF(569:574, '+', 2)
ORF(581:601, '+', 2)
ORF(695:706, '+', 2)
Two other functions (get_orfs_dna
and get_orfs_aa
) are implemented to get the ORFs in DNA and amino acid sequences, respectively. They use the findorfs
function to first get the ORFs and then get the correspondance array of BioSequence
objects.
get_orfs_dna(seq)
12-element Vector{LongSubSeq{DNAAlphabet{4}}}:
ATGCAACCCTGA
ATGCGCTGA
ATGCGTCGAATGGCACGGTGA
ATGGCACGGTGA
ATGTGA
ATGTGTCCAACGGCAGTCTGA
ATGCAACCCTGA
ATGCACTGGCTGGTCCTGTCAATCTGA
ATGTCACCGCACAAGGCAATGTGA
ATGTGA
ATGTGTCCAACGGCAGCCTGA
ATGCAACCCTGA
Writting ORF information into bioinformatic formats
This package facilitates now the creation of FASTA
, BED
, and GFF
files, specifically extracting Open Reading Frame (ORF) information from BioSequence
instances, particularly those of type NucleicSeqOrView{A} where A
, and then writing the information into the desired format.
Functionality:
The package provides four distinct functions for writing files in different formats:
Function | Description |
---|---|
write_orfs_fna |
Writes nucleotide sequences in FASTA format. |
write_orfs_faa |
Writes amino acid sequences in FASTA format. |
write_orfs_bed |
Outputs information in BED format. |
write_orfs_gff |
Generates files in GFF format. |
All these functions support processing both BioSequence
instances and external FASTA
files. In the case of a BioSequence
instace into external files, simply provide the path to the FASTA
file using a String
to the path. To demonstrate the use of the write_*
methods with a BioSequence
, consider the following example:
using BioSequences, GeneFinder
# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"
Once a BioSequence
object has been instantiated, the write_orfs_fna
function proves useful for generating a FASTA
file containing the nucleotide sequences of the ORFs. Notably, the write_orfs*
methods support either an IOStream
or an IOBuffer
as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.
outfile = "LFLS01000089.fna"
open(outfile, "w") do io
write_orfs_fna(seq, io)
end
cat LFLS01000089.fna
>ORF01 id=01 start=29 stop=40 strand=+ frame=2
ATGCAACCCTGA
>ORF02 id=02 start=137 stop=145 strand=+ frame=2
ATGCGCTGA
>ORF03 id=03 start=164 stop=184 strand=+ frame=2
ATGCGTCGAATGGCACGGTGA
>ORF04 id=04 start=173 stop=184 strand=+ frame=2
ATGGCACGGTGA
>ORF05 id=05 start=236 stop=241 strand=+ frame=2
ATGTGA
>ORF06 id=06 start=248 stop=268 strand=+ frame=2
ATGTGTCCAACGGCAGTCTGA
>ORF07 id=07 start=362 stop=373 strand=+ frame=2
ATGCAACCCTGA
>ORF08 id=08 start=470 stop=496 strand=+ frame=2
ATGCACTGGCTGGTCCTGTCAATCTGA
>ORF09 id=09 start=551 stop=574 strand=+ frame=2
ATGTCACCGCACAAGGCAATGTGA
>ORF10 id=10 start=569 stop=574 strand=+ frame=2
ATGTGA
>ORF11 id=11 start=581 stop=601 strand=+ frame=2
ATGTGTCCAACGGCAGCCTGA
>ORF12 id=12 start=695 stop=706 strand=+ frame=2
ATGCAACCCTGA
This could also be done to writting a FASTA
file with the nucleotide sequences of the ORFs using the write_orfs_fna
function. Similarly for the BED
and GFF
files using the write_orfs_bed
and write_orfs_gff
functions respectively.