FastaIO.jl — FASTA file reader and writer module

This module provides ways to parse and write files in FASTA format in Julia. It is designed to be lightweight and fast; the parsing method is inspired by kseq.h. It can read and write files on the fly, keeping only one entry at a time in memory, and it can read and write gzip-compressed files.

Here is a quick example for reading a file:

julia> using FastaIO

julia> FastaReader("somefile.fasta") do fr
           for (desc, seq) in fr
               println("$desc : $seq")
           end
       end

And for writing:

julia> using FastaIO

julia> FastaWriter("somefile.fasta") do fw
           for s in [">GENE1", "GCATT", ">GENE2", "ATTAGC"]
               write(fw, s)
           end
       end

Installation and usage

To install the module, use Julia's package manager: start pkg mode by pressing ] and then enter:

(v1.3) pkg> add FastaIO

Dependencies will be installed automatically. The module can then be loaded like any other Julia module:

julia> using FastaIO

Introductory notes

For both reading and writing, there are quick methods to read/write all the data at once: readfasta and writefasta. These, however, require all the data to be stored in memory at once, which may be impossible or undesirable for very large files. Therefore, for both reading and writing, the preferred way is actually to use specialized types, FastaReader and FastaWriter, which have the ability to process one entry (description + sequence data) at a time (the writer can actually process one char at a time); however, note that these two object types are not symmetric: the reader acts as an iterable object, while the writer behaves similarly to an IO stream.

The FASTA format

The FASTA format which is assumed by this module is as follows:

description lines must start with a > character, and cannot be empty
only one description line per entry is allowed
all characters must be ASCII
whitespace is not allowed within sequence data (except for newlines) and at the beginning or end of the description
Empty lines are ignored (note however that lines containing whitespace will still trigger an error)

When writing, description lines longer than 80 characters will trigger a warning message; sequence data is formatted in lines of 80 characters each; extra whitespace is silently discarded. No other restriction is put on the content of the sequence data, except that the > character is forbidden.

When reading, almost no explicit checks are performed to test that the data actually conforms to these specifications.

The sequence storage type

When reading FASTA files, the container type used to store the sequence data can be chosen (as an optional argument to readfasta or as a parametric type of FastaReader). The default is String, which is the most memory-efficient and the fastest; another performance-optimal option is Vector{UInt8}, which is a less friendly representation, but has the advantage of being mutable. Any other container T for which convert(::Type{T}, ::Vector{UInt8}) is defined can be used (e.g. Vector{Char}, or a more specialized Vector{AminoAcid} if you use BioSeq, but the conversion will generally slightly reduce the performance.

Reading files

FastaIO.readfasta — Function

readfasta(file::Union{String,IO}, [sequence_type::Type = String])

This function parses a whole FASTA file at once and stores it into memory. The result is a Vector{Any} whose elements are tuples consisting of (description, sequence), where description is a String and sequence contains the sequence data, stored in a container type defined by the sequence_type optional argument (see The sequence storage type section for more information).

FastaIO.FastaReader — Method

FastaReader{T}(file::Union{AbstractString,IO})

This creates an object which is able to parse FASTA files, one entry at a time. file can be a plain text file or a gzip-compressed file (it will be autodetected from the content). The type T determines the output type of the sequences (see The sequence storage type section for more information) and it defaults to String.

The data can be read out by iterating the FastaReader object:

for (name, seq) in FastaReader("somefile.fasta")
    # do something with name and seq
end

As shown, the iterator returns a tuple containing the description (always a String) and the data (whose type is set when creating the FastaReader object (e.g. FastaReader{Vector{UInt8}}(filename)).

The FastaReader type has a field num_parsed which contains the number of entries parsed so far.

Other ways to read out the data are via the readentry and readfasta functions.

FastaIO.FastaReader — Method

FastaReader(f::Function, filename::AbstractString, [sequence_type::Type = String])

This format of the constructor is useful for do-notation, i.e.:

FastaReader(filename) do fr
    # read out the data from fr, e.g.
    for (name, seq) in fr
        # do something with name and seq
    end
end

which ensures that the close function is called and is thus recommended (otherwise the file is closed by the garbage collector when the FastaReader object goes out of scope).

FastaIO.readentry — Function

readentry(fr::FastaReader)

This function can be used to read entries one at a time:

fr = FastaReader("somefile.fasta")
name, seq = readentry(fr)

Writing files

FastaIO.writefasta — Method

writefasta(filename::String, data, [mode::String = "w"])

This function dumps data to a FASTA file, auto-formatting it so to follow the specifications detailed in the section titled The FASTA format. The data can be anything which is iterable and which produces (description, sequence) tuples upon iteration, where the description must be convertible to a String and the sequence can be any iterable object which yields elements convertible to ASCII characters (e.g. a String, a Vector{UInt8} etc.).

Examples:

writefasta("somefile.fasta", [("GENE1", "GCATT"), ("GENE2", "ATTAGC")])
writefasta("somefile.fasta", ["GENE1" => "GCATT", "GENE2" => "ATTAGC"])

If the filename ends with .gz, the result will be a gzip-compressed file.

The mode flag determines how the filename is open; use "a" to append the data to an existing file.

FastaIO.writefasta — Method

writefasta([io::IO = stdout], data)

This version of the function writes to an already opened IO stream, defaulting to stdout.

FastaIO.FastaWriter — Type

FastaWriter(filename::AbstractString, [mode::String = "w"])
FastaWriter([io::IO = stdout])
FastaWriter(f::Function, args...)

This creates an object which is able to write formatted FASTA files which conform to the specifications detailed in the section titled The FASTA format, via the write and writeentry functions.

The third form allows to use do-notation:

FastaWriter("somefile.fasta") do fw
    # write the file
end

which is strongly recommended since it ensures that the close function is called at the end of writing: this is crucial, as failing to do so may result in incomplete files (this is done by the finalizer, so it will still happen automatically if the FastaWriter object goes out of scope and is garbage-collected, but there is no guarantee that this will happen if Julia exits).

If the filename ends with .gz, the result will be gzip-compressed.

The mode flag can be used to set the opening mode of the file; use "a" to append to an existing file.

The FastaWriter object has an entry::Int field which stores the number of the entry which is currently being written.

FastaIO.writeentry — Function

writeentry(fw::FastaWriter, description::AbstractString, sequence)

This function writes one entry to the FASTA file, following the specifications detailed in the section titled The FASTA format. The description is without the initial '>' character. The sequence can be any iterable object whose elements are convertible to ASCII characters.

Example:

FastaWriter("somefile.fasta") do fw
    for (desc,seq) in [("GENE1", "GCATT"), ("GENE2", "ATTAGC")]
        writeentry(fw, desc, seq)
    end
end

Base.write — Method

write(fw::FastaWriter, item)

This function extends Base.write and streams items to a FASTA file, which will be formatted according to the specifications detailed in the section titled The FASTA format.

When using this method, description lines are marked by the fact that they begin with a '>' character; anything else is assumed to be part of the sequence data.

If item is a Vector, write will be called iteratively over it; if it is a String, a newline will be appended to it and it will be dumped. For example the following code:

FastaWriter("somefile.fasta") do fw
    for s in [">GENE1", "GCA", "TTT", ">GENE2", "ATTAGC"]
        write(fw, s)
    end
end

will result in the file:

>GENE1
GCATTT
>GENE2
ATTAGC

If item is not a Vector nor a String, it must be convertible to an ASCII character, and it will be piped into the file. For example the following code:

data = """
  >GENE1
  GCA
  TTT
  >GENE2
  ATT
  AGC
  """

FastaWriter("somefile.fasta") do fw
    for ch in data
        write(fw, ch)
    end
end

will result in the same file as above.

Base.close — Method

close(fw::FastaWriter)

This function extends Base.close and it should always be explicitly used for finalizing the FastaWriter once the writing has finished, unless the do-notation is used when creating it.