Internals
FilePathsBase
Integration
This package uses FilePathsBase.jl to infer the type of path to an object to read and the appropriate way to read it.
An AbstractPath
object is passed to a Fetcher
object which provides a simple, minimal interface for fetching contiguous blocks of data from the source.
The FileManager
object is used to track which files are referred to by the parquet metadata.
Parquet2.FileManager
— TypeFileManager
Data structure containing references to CacheVector
objects providing an interface to access any file in a parquet file directory tree. Is directory schema agnostic.
Implementing New Fetchers
The package can be easily extended by implementing new Fetcher
objects. The provide, primarily, two methods, Base.length
and Parquet2.fetch
. The former simply gives the total size of the source object in bytes, while the latter provides a method for reading contiguous blocks of data from it.
Parquet2.Fetcher
— TypeFetcher
Abstract type for retrieving byte arrays from arbitrary sources. Used by CacheVector
for retrieving data. Each Fetcher
represents a single object (i.e. addressable in a contiguous range) of a file-system or file-system-like interface.
Fetcher Interface
Required Methods
Parquet2.fetch(f, ab::AbstractUnitRange)
: Retrieve aVector{UInt8}
identical to bytesa:b
(1-based index, inclusive) of the underlying object.Base.length(f)
: The total number of bytes of the object the fetcher fetches data from.
Optional Methods
Fetcher(p::AbstractPath; kw...)
: Defines an appropriate fetcher for the path type.PageBuffer(f, a::Integer, b::Integer)
: An object for accessing views of the underlying data. The provided buffer must be aVector{UInt8}
. The definition defaults toPageBuffer(fetch(f, a:b), a, b)
. SeePageBuffer
.
Data Caching
This package does everything as lazily as possible by default. While files on the local file system are memory mapped by default, if memory mapping is disabled or if reading from a remote file system, only the top-level metadata is read in fully. Everything else is read in by blocks. By default these are approximately $100~\textrm{MB}$ each (exactly 100*1024^2
bytes). The max_subsets
argment can be provided to the Dataset
to dictate the maximum number of these blocks that can be loaded at any given time (by default this is effectively unlimited).
The caching is achieved by a CacheVector
, which provides an interface for the data buffers from which all data is read. The CacheVector
can be accessed like a normal AbstractVector
but internally caches only subsets of this at a time.
The CacheVector
uses Fetcher
objects to fetch the blocks, and in this way provides an interface for loading data by blocks from any source that supports this. Sources that do not can be loaded in as a single vector.
CacheVector
has extremely inefficient indexing for individual indices, but scales favorably when indexed by large contiguous blocks. When loading parquet files CacheVector
are only used for loading entire pages at a time (usually smaller than the blocks). Note that CacheVector
indexing is also much less efficient when hitting bloc boundaries, so some alignment improves the performance of this package.
Parquet2.CacheVector
— TypeCacheVector <: AbstractVector{UInt8}
An array that automatically caches blocks of data. The envisioned use case is selectively reading subsets of a large remote file. Contains a Fetcher
object for retrieving data to be cached.
Note that the performance of indexing CacheVector
is quite poor for single indices. It is intended that CacheVector
is indexed by the largest subsets possible.
Constructors
CacheVector(f::Fetcher; subset_length=default_subset_length(f), max_subsets=typemax(Int), preload=false)
Arguments
f
:Fetcher
object for retrieving data.subset_length::Integer
: the length of each subset. This controls the size of blocks retrieved in a single call.max_subsets::Integer
: the number of subsets that can be stored before they start to be evicted.
API
Parquet2.IOFetcher
— TypeIOFetcher <: Fetcher
Provides a Fetcher
interface for an IO
stream object.
Parquet2.GenericFetcher
— TypeGenericFetcher
Wraps a function which fetches byte arrays to provide a Fetcher
interface.
Parquet2.VectorFetcher
— TypeVectorFetcher <: Fetcher
Provides a Fetcher
interface for an in-memory (or memory-mapped) array.
Parquet2.ParquetType
— TypeParquetType
Describes a type specified by the parquet standard metadata.
Parquet2.PageBuffer
— TypePageBuffer
Represents a view into a byte buffer that guarantees the underlying data is a Vector{UInt8}
.
Parquet2.PageIterator
— TypePageIterator
Object for iterating through pages of a column. Executing the iteration is essentially binary schema discovery and may invoke reading from the data source. Normally once a full iteration has been performed Page
objects are stored by the Column
making future access cheaper and this object can be discarded.
Parquet2.PageLoader
— TypePageLoader
Object which wraps a Column
and Page
for loading data. This is the object from which all parquet data beneath the metadata is ultimately loaded.
Development Notes
We badly want to get rid of this. The main reason this is not possible is that in the original DataPageHeader
the length of the repetition and definition levels is not knowable a priori. This has the consequence that reading from the page is stateful, i.e. one needs to know where the data starts and this is only possible after reading the levels in the legacy format. Since it will presumably never be possible to drop support for DataPageHeader
, it will presumably never be possible to eliminate this frustration.
Parquet2.BitUnpackVector
— TypeBitUnpackVector{𝒯}
A vector type that unpacks underlying data into values of type 𝒯
when indexed.
Parquet2.PooledVector
— TypePooledVector <: AbstractVector
A simple implementation of a "pooled" (or "dictionary encoded) rank-1 array, providing read-only access. The underlying references and value pool are required to have the form naturally returned when reading from a parquet.
Parquet2.ParqRefVector
— TypeParqRefVector <: AbstractVector
An array wrapper for an AbstractVector
which acts as a reference array for the wrapped vector for its dictionary encoding.
Indexing this returns a UInt32
reference, unless the underlying vector is missing
at that index, in which case it returns missing
.
Parquet2.decompressedpageview
— Functiondecompressedpageview
Creates the view of page data handling decompression appropriately. If DataPageHeaderV2
this must be handled carefully since the levels bytes are not compressed. For the old data page format, this simply decompresses the entire buffer. Note that this can allocate if decompression is needed, else it is just a view.
TODO: This is currently inefficient for DataPageHeaderV2
as it causes extra allocations by concatenating the levels.
Parquet2.bitpack
— Functionbitpack(v::AbstractVector, w)
bitpack(io::IO, w)
Pack the first w
bits of each value of v
into the bytes of a new Vector{UInt8}
buffer.
Parquet2.bitpack!
— Functionbitpack!(o::AbstractVector{UInt8}, a, v::AbstractVector, w::Integer)
bitpack!(io::IO, v::AbstractVector, w::Integer)
Pack the first w
bits of each value of v
into bytes in the vector o
starting from index a
. If the values of v
have any non-zero bits beyond the first w
they will be truncated.
WARNING the bytes of o
to be written to must be initialized to zero or the result may be corrupt.
Parquet2.bitmask
— Functionbitmask(𝒯, α, β)
bitmask(𝒯, β)
Create a bit mask of type 𝒯 <: Integer
where bits α
to β
(inclusive) are 1
and the rest are 0
, where bit 1
is the least significant bit. If only one argument is given it is taken as the end position β
.
Parquet2.bitjustify
— Functionbitjustify(k, α, β)
Move bits α
through β
(inclusive) to the least significant bits of an integer of type k
.
Parquet2.bitwidth
— Functionbitwidth(n::Integer)
Compute the width in bits needed to encode integer n
, truncating leading zeros. For example, 1
has a width of 1
, 3
has a width of 2
, 8
has a width of 4
, et cetera.
Parquet2.bytewidth
— Functionbytewidth(n::Integer)
Compute the width in bytes needed to encode integer n
truncating leading zeros beyond the nearest byte boundary. For example, anything expressible as a UInt8
has a byte width of 1
, anything expressible as a UInt16
has a byte width of 2
, et cetera.
Parquet2.readfixed
— Functionreadfixed(io, 𝒯, N, v=zero(𝒯))
readfixed(w::AbstractVector{UInt8}, 𝒯, N, i=1, v=zero(𝒯))
Read a 𝒯 <: Integer
from the first N
bytes of io
. This is for reading integers which have had their leading zeros truncated.
Parquet2.writefixed
— Functionwritefixed(io::IO, x::Integer)
Write the integer x
using the minimal number of bytes needed to accurately represent x
, i.e. by writing bytewidth(x)
bytes.
Parquet2.HybridIterator
— TypeHybridIterator
An iterable object for iterating over the parquet "hybrid encoding" described here.
Each item in the collection is an AbstractVector
with decoded values.
Parquet2.encodehybrid_bitpacked
— Functionencodehybrid_bitpacked(io::IO, v::AbstractVector, w=bitwidth(maximum(v)); write_preface=true, additional_bytes=0)
Bit-pack v
and encode it to io
such that it can be read with decodehybrid
. This encodes all data in v
as a single bitpacked run.
If write_preface
the Int32
indicating the number of payload bytes will be written, with additional_bytes
additional payload bytes.
WARNING Parquet's horribly confusing encoding format does not appear to support arbitrary combinations of bitpacked encoding with run-length encoding, because the number of bitpacked-values cannot in general be uniquely determined... yeah...
Parquet2.encodehybrid_rle
— Functionencodehybrid_rle(io::IO, x::Integer, n::Integer; write_preface=false, additional_bytes=0)
Run-length encode a sequence of n
copies of x
to io
.
If write_preface
the Int32
indicating the number of payload bytes will be written, with additional_bytes
additional payload bytes.
WARNING This cannot be combined arbitrarily with encodehybrid_bitpacked
, see that function's documentation.
encodehybrid_rle(io::IO, v::AbstractVector{<:Integer})
Write the vector v
to io
using the parquet run-length encoding.
Parquet2.maxdeflevel
— Functionmaxdeflevel(r::SchemaNode, p)
Compute the maximum definition level for the node at path p
from the root node r
.
Parquet2.maxreplevel
— Functionmaxreplevel(r::SchemaNode, p)
Compute the maximum repetition level for the node at path p
from the root node r
.
Parquet2.leb128encode
— Functionleb128encode(n::Unsigned)
leb128encode(io::IO, n::Unsigned)
Encode the integer n
as a byte array according to the LEB128 encoding scheme.
Parquet2.leb128decode
— Functionleb128decode(𝒯, v, k)
Decode v
(from index k
) to an integer of type 𝒯 <: Unsigned
according to the LEB128 encoding scheme.
Returns o, j
where o
is the decoded value and j
is the index of v
after reading (i.e. the encoded byte is contained in data from k
to j-1
inclusive).
leb128decode(𝒯, io)
Decode v
to an integer of type 𝒯 <: Unsigned
according to the LEB128 encoding scheme.
Parquet2.OptionSet
— TypeOptionSet
Abstract type for storing options for reading or writing parquet data.
See ReadOptions
and WriteOptions
.
Parquet2.ReadOptions
— TypeReadOptions <: OptionSet
A struct containing all options relevant for reading parquet files. Specific options are documented in Dataset
.
Parquet2.WriteOptions
— TypeWriteOptions <: OptionSet
A struct containing all options relevant for writing parquet files. Specific options are documented in FileWriter