FASTA index (FAI files)

FASTX.jl supports FASTA index (FAI) files. When a FASTA file is indexed with a FAI file, one can seek records by their name, or extract parts of records easily.

See the FAI specifcation here: http://www.htslib.org/doc/faidx.html

Making an Index

A FASTA index (of type Index) can be constructed from an IO object representing a FAI file:

julia> io = IOBuffer("seqname\t9\t2\t6\t8");

julia> Index(io) isa Index
true

Or from a path representing a FAI file:

julia> Index("../test/data/test.fasta.fai");

Alternatively, a FASTA file can be indexed to produce an Index using faidx.

julia> faidx(IOBuffer(">abc\nTAGA\nTA"))
Index:
  abc	6	5	4	5

Alternatively, a FASTA file can be indexed, and the index immediately written to a FAI file, by passing an AbstractString to faidx:

julia> rm("../test/data/test.fasta.fai") # remove existing fai

julia> ispath("../test/data/test.fasta.fai")
false

julia> faidx("../test/data/test.fasta");

julia> ispath("../test/data/test.fasta.fai")
true

Note that the restrictions on FASTA files for indexing are stricter than Julia's FASTA parser, so not all FASTA files that can be read can be indexed:

julia> str = ">\0\n\0";

julia> first(FASTAReader(IOBuffer(str))) isa FASTARecord
true

julia> Index(IOBuffer(str))
ERROR
[...]

Writing a FAI file

If you have an Index object, you can simply write it to an IO:

julia> index = open(i -> Index(i), "../test/data/test.fasta.fai");

julia> filename = tempname();

julia> open(i -> write(i, index), filename, "w");

julia> index2 = open(i -> Index(i), filename);

julia> string(index) == string(index2)
true

Attaching an Index to a Reader

When opening a FASTA.Reader, you can attach an Index by passing the index keyword. You can either pass an Index directly, or else an IO, in which case an Index will be parsed from the IO, or an AbstractString that will be interpreted as a path to a FAI file:

julia> str = ">abc\nTAG\nTA";

julia> idx = faidx(IOBuffer(str));

julia> rdr = FASTAReader(IOBuffer(str), index=idx);

You can also add a index to an existing reader using the index! function:

FASTX.FASTA.index!Function
index!(r::FASTA.Reader, ind::Union{Nothing, Index, IO, AbstractString})

Set the index of r, and return r. If ind isa Union{Nothing, Index}, directly set the index to ind. If ind isa IO, parse the index from the FAI-formatted IO first. If ind isa AbstractString, treat it as the path to a FAI file to parse.

See also: Index, FASTA.Reader

Seeking using an Index

With an Index attached to a Reader, you can do the following operation in O(1) time. In these examples, we will use the following FASTA file:

>seq1 sequence
TAGAAAGCAA
TTAAAC
>seq2 sequence
AACGG
UUGC
  • Seek to a Record using its identifier:
julia> seekrecord(reader, "seq2");

julia> record = first(reader); sequence(record)
"AACGGUUGC"
  • Directly extract a record using its identifier
julia> record = reader["seq1"];

julia> description(record)
"seq1 sequence"
  • Extract a sequence directly without loading the whole record into memory. This is useful for huge sequences like chromosomes
julia> extract(reader, "seq1", 3:5)
"GAA"

FASTX.jl does not yet support indexing FASTQ files.

Reference:

FASTX.FASTA.faidxFunction
faidx(io::IO)::Index

Read a FASTA.Index from io.

See also: Index

Examples

julia> ind = faidx(IOBuffer(">ab\nTA\nT\n>x y\nGAG\nGA"))
Index:
  ab	3	4	2	3
  x	5	14	3	4
faidx(fnapath::AbstractString, [idxpath::AbstractString], check=true)

Index FASTA path at fnapath and write index to idxpath. If idxpath is not given, default to same name as fnapath * ".fai". If check, throw an error if the output file already exists

See also: Index

FASTX.FASTA.seekrecordFunction
seekrecord(reader::FASTAReader, i::Union{AbstractString, Integer})

Seek Reader to the i'th record. The next iterated record with be the i'th record. i can be the identifier of a sequence, or the 1-based record number in the Index.

The Reader needs to be indexed for this to work.

FASTX.FASTA.extractFunction
extract(reader::Reader, name::AbstractString, range::Union{Nothing, UnitRange})

Extract a subsequence given by index range from the sequence named in a Reader with an index. Returns a String. If range is nothing (the default value), return the entire sequence.

FASTX.FASTA.IndexType
Index(src::Union{IO, AbstractString})

FASTA index object, which allows constant-time seeking of FASTA files by name. The index is assumed to be in FAI format.

Notable methods:

  • Index(::Union{IO, AbstractString}): Read FAI file from IO or file at path
  • write(::IO, ::Index): Write index in FAI format
  • faidx(::IO)::Index: Index FASTA file
  • seekrecord(::Reader, ::AbstractString): Go to position of seq
  • extract(::Reader, ::AbstractString): Extract part of sequence

Note that the FAI specs are stricter than FASTX.jl's definition of FASTA, such that some valid FASTA records may not be indexable. See the specs at: http://www.htslib.org/doc/faidx.html

See also: FASTA.Reader

Examples

julia> src = IOBuffer("seqname\t9\t14\t6\t8\nA\t1\t3\t1\t2");

julia> fna = IOBuffer(">A\nG\n>seqname\nACGTAC\r\nTTG");

julia> rdr = FASTA.Reader(fna; index=src);

julia> seekrecord(rdr, "seqname");

julia> sequence(String, first(rdr))
"ACGTACTTG"