BGZF is a compression format that supports random access using virtual file offsets.
See the SAM/BAM file format specs for the details of BGZF: https://samtools.github.io/hts-specs/SAMv1.pdf.
# The first argument is a filename or an IO object (e.g. IOStream).
stream = BGZFStream("data.bgz")
# BGZFStream is a subtype of IO and works like a usual IO object.
byte = read(stream, UInt8)
# do something...
# BGZFStream is also seekable with a VirtualOffset.
seek(stream, VirtualOffset(0, 2))
# The current virtual file offset is available.
The BGZFStreams.jl package exports three types and a function to the package user:
IOstream of the BGZF file format
VirtualOffset: data offset in a BGZF file
BGZFDataError: an error type thrown when reading a malformed byte stream
virtualoffset(stream): returns the current virtual file offset of
BGZFStream type wraps an underlying
IO object and transparently inflate
(for reading) or deflate (for writing) data. Since it is a subtype of
instance of it behaves like other
IO objects, but the
seek method takes a
virtual offset instead of a normal file offset as its second argument.
VirtualOffset type represents a 64-bit virtual file offset as described in
the specification of the SAM file format. That is, the most significant 48-bit
integer of a virtual offset is a byte offset to the BGZF file to the beginning
position of a BGZF block and the least significant 16-bit integer is a byte
offset to the uncompressed byte(s).
BGZFDataError type is a subtype of
Exception and used to throw an
exception when invalid data are read.
virtualoffset(stream::BGZFStream) returns the current virtual file offset.
More specifically, it returns the virtual offset of the next reading byte while
reading and the next writing byte while writing.
A major bottleneck of processing a BGZF file is the inflation and deflation
process. The throughput of reading data is ~100 MiB/s, which is quite slower
than that of raw reading from a file. In order to alleviate this problem, this
package supports parallelized inflation when reading compressed data. This
requires the support of multi-threading introduced in Julia 0.5. The
JULIA_NUM_THREADS environmental variable sets the number of threads used for
bash-3.2$ JULIA_NUM_THREADS=2 julia -q
julia> using Base.Threads
We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.
Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.
We also welcome financial contributions in full transparency on our open collective. Anyone can file an expense. If the expense makes sense for the development the core contributors and the person who filed the expense will be reimbursed.
Thank you to all our backers and sponsors!
Love our work and community? Become a backer.
Does your company use BioJulia? Help keep BioJulia feature rich and healthy by sponsoring the project. Your logo will show up here with a link to your website.