FASTA formatted files

FASTA formatted files

FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of sequence records with name, description, and sequence.

The template of a sequence record is:

>{name} {description}?
{sequence}

Here is an example of a chromosomal sequence:

>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG

Readers and Writers

The reader and writer for FASTA formatted files, are found within the FASTA submodule of FASTX.

They can be created with IOStreams.

using FASTX

r = FASTA.Reader(open("my-seqs.fasta", "r"))
w = FASTA.Writer(open("my-out.fasta", "w"))

As always with julia IO types, remember to close your file readers and writer after you are finished.

Using open with a do-block can help ensure you close a stream after you are finished. Base.open is overloaded with a method for this purpose.

r = open(FASTA.Reader, "my-seqs.fasta")
w = open(FASTA.Writer, "my-out.fasta")

Usually sequence records will be read sequentially from a file by iteration.

open(FASTA.Reader, "my-seqs.fasta") do reader
    for record in reader
        ## Do something
        # like showing the identifiers
        @show FASTA.identifier(record)
    end
end

Gzip compressed files can be streamed to the Reader using the CodecZlib.jl package.

reader = FASTA.Reader(GzipDecompressorStream(open("my-reads.fasta.gz")))
for record in reader
    ## do something
end
close(reader)

You can also overwrite records in a while loop to avoid excessive memory allocation.

open(FASTA.Reader, "my-seqs.fasta") do reader
    record = FASTA.Record()
    while !eof(reader)
        read!(reader, record)
        ## Do something.
    end
end

But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:

open(FASTA.Reader, "sacCer.fa", index = "sacCer.fa.fai") do reader
    chrIV = reader["chrIV"]  # directly read sequences called chrIV.
end

Reading in a sequence from a FASTA formatted file will give you a variable of type FASTA.Record.

Various getters and setters are available for FASTA.Records:

To write a BioSequence to FASTA file, you first have to create a FASTA.Record:

using BioSequences
x = dna"aaaaatttttcccccggggg"
rec = FASTA.Record("MySeq", x)
open(FASTA.Writer, "my-out.fasta") do
    write(w, rec)
end