Docstrings · ArrowTypes.jl

ArrowTypes.ArrowTypes — Module

The ArrowTypes module provides the ArrowTypes.Arrowtype interface trait that objects can define in order to signal how they should be serialized in the arrow format.

ArrowTypes.ArrowKind — Type

ArrowTypes.ArrowKind(T)

For a give type T, define it's "arrow type kind", or the general category of arrow types it should be treated as. Must be one of:

ArrowTypes.NullKind: Missing is the only type defined as NullKind
ArrowTypes.PrimitiveKind: <:Integer, <:AbstractFloat, along with Arrow.Decimal, and the various Arrow.ArrowTimeType subtypes
ArrowTypes.BoolKind: only Bool
ArrowTypes.ListKind: any AbstractString or AbstractArray
ArrowTypes.FixedSizeList: NTuple{N, T}
ArrowTypes.MapKind: any AbstractDict
ArrowTypes.StructKind: any NamedTuple or plain struct (mutable or otherwise)
ArrowTypes.UnionKind: any Union
ArrowTypes.DictEncodedKind: array types that implement the DataAPI.refarray interface

The list of ArrowKinds listed above translate to different ways to physically store data as supported by the arrow data format. See the docs for each for an idea of whether they might be an appropriate fit for a custom type. Note that custom types need to satisfy any additional "interface methods" as required by the various ArrowKind types. By default, if a type in julia is declared like primitive type ... it is considered a PrimitiveKind and if struct or mutable struct it's considered a StructKind. Also note that types will rarely need to define ArrowKind; much more common is to define ArrowType(T) and toarrow(x::T) to transform T to a natively supported arrow type, which will already have its ArrowKind defined.

ArrowTypes.BoolKind — Type

BoolKind data is stored with values packed down to individual bits; so instead of a traditional Bool being 1 byte/8 bits, 8 Bool values would be packed into a single byte

ArrowTypes.DictEncodedKind — Type

DictEncodedKind store a small pool of unique values in one buffer, with a full-length buffer of integer offsets into the small value pool

ArrowTypes.FixedSizeListKind — Type

FixedSizeListKind data are stored in a single contiguous buffer; individual elements can be computed based on the fixed size of the lists

ArrowTypes.ListKind — Type

ListKind data are stored in two separate buffers; one buffer contains all the original data elements flattened into one long buffer; the 2nd buffer contains an offset into the 1st buffer for how many elements make up the original array element

ArrowTypes.MapKind — Type

MapKind data are stored similarly to ListKind, where elements are flattened, and a 2nd offsets buffer contains the individual list element length data

ArrowTypes.NullKind — Type

NullKind data is actually not physically stored since the data is constant; just the length is needed

ArrowTypes.PrimitiveKind — Type

PrimitiveKind data is stored as plain bits in a single contiguous buffer

ArrowTypes.StructKind — Type

StructKind data are stored in separate buffers for each field of the struct

ArrowTypes.UnionKind — Type

UnionKind data are stored either in a separate, compacted buffer for each union type (dense), or in full-length buffers for each union type (sparse)

ArrowTypes.ArrowType — Function

ArrowTypes.ArrowType(T) = S

Interface method to define the natively supported arrow type S that a given type T should be converted to before serializing. Useful when a custom type wants a "serialization hook" or otherwise needs to be transformed/converted into a natively supported arrow type for serialization. If a type defines ArrowType, it must also define a corresponding ArrowTypes.toarrow(x::T) method which does the actual conversion from T to S. Note that custom structs defined like struct T or mutable struct T are natively supported in serialization, so unless additional transformation/customization is desired, a custom type T can serialize with no ArrowType definition (by default, each field of a struct is serialized, using the results of fieldnames(T) and getfield(x, i)). Note that defining these methods only deal with custom serialization to the arrow format; to be able to deserialize custom types at all, see the docs for ArrowTypes.arrowname, ArrowTypes.arrowmetadata, ArrowTypes.JuliaType, and ArrowTypes.fromarrow.

ArrowTypes.JuliaType — Function

ArrowTypes.JuliaType(::Val{Symbol(name)}, ::Type{S}, arrowmetadata::String) = T

Interface method to define the custom Julia logical type T that a serialized metadata label should be converted to when deserializing. When reading arrow data, and a logical type label is encountered for a column, it will call ArrowTypes.JuliaType(Val(Symbol(name)), S, arrowmetadata) to see if a Julia type has been "registered" for deserialization. The name used when defining the method must correspond to the same name when defining ArrowTypes.arrowname(::Type{T}) = Symbol(name). The use of Val(Symbol(...)) is to allow overloading a method on a specific logical type label. The S 2nd argument passed to JuliaType is the native arrow serialized type. This can be useful for parametric Julia types that wish to correctly parameterize their custom type based on what was serialized. The 3rd argument arrowmetadata is any metadata that was stored when the logical type was serialized as the result of calling ArrowTypes.arrowmetadata(T). Note the 2nd and 3rd arguments are optional when overloading if unneeded. When defining ArrowTypes.arrowname and ArrowTypes.JuliaType, you may also want to implement [ArrowTypes.fromarrow] in order to customize how a custom type T should be constructed from the native arrow data type. See its docs for more details.

ArrowTypes.arrowmetadata — Function

ArrowTypes.arrowmetadata(T) => String

Interface method to provide additional logical type metadata when serializing extension types. ArrowTypes.arrowname provides the logical type name, which may be all that's needed to return a proper Julia type from ArrowTypes.JuliaType, but some custom types may, for example have type parameters that aren't inferred/based on fields. In order to fully recreate these kinds of types when deserializing, these type parameters can be stored by defining ArrowTypes.arrowmetadata(::Type{T}) = "type_param". This will then be available to access by overloading ArrowTypes.JuliaType(::Val{Symbol(name)}, S, arrowmetadata::String).

ArrowTypes.arrowname — Function

ArrowTypes.arrowname(T) = Symbol(name)

Interface method to define the logical type "label" for a custom Julia type T. Names will be global for an entire arrow dataset, and conventionally, custom types will just use their type name along with a Julia-specific prefix; for example, for a custom type Foo, I would define ArrowTypes.arrowname(::Type{Foo}) = Symbol("JuliaLang.Foo"). This ensures other language implementations won't get confused and are safe to ignore the logical type label. When arrow stores non-native data, it must still be stored as a native data type, but can have type metadata tied to the data that labels the original logical type it originated from. This enables the conversion of native data back to the logical type when deserializing, as long as the deserializer has the same definitions when the data was serialized. Namely, the current Julia session will need the appropriate ArrowTypes.JuliaType and ArrowTypes.fromarrow definitions in order to know how to convert the native data to the original logical type. See the docs for those interface methods in order to ensure a complete implementation. Also see the accompanying ArrowTypes.arrowmetadata docs around providing additional metadata about a custom logical type that may be necessary to fully re-create a Julia type (e.g. non-field-based type parameters).

ArrowTypes.default — Function

There are a couple places when writing arrow buffers where we need to write a "dummy" value; it doesn't really matter what we write, but we need to write something of a specific type. So each supported writing type needs to define default.

ArrowTypes.fromarrow — Function

ArrowTypes.fromarrow(::Type{T}, x::S) => T

Interface method that provides a "deserialization hook" for a custom type T to be constructed from the native arrow type S. The T and S types must correspond to the definitions used in ArrowTypes.ArrowType(::Type{T}) = S. This is a paired method with ArrowTypes.toarrow.

The default definition is ArrowTypes.fromarrow(::Type{T}, x) = T(x), so if that works for a custom type already, no additional overload is necessary. A few ArrowKinds have/allow slightly more custom overloads for their fromarrow methods:

ListKind{true}: for String types, they may overload fromarrow(::Type{T}, ptr::Ptr{UInt8}, len::Int) = ... to avoid materializing a String
StructKind:
- May overload fromarrow(::Type{T}, x...) where individual fields are passed as separate
positional arguments; so if my custom type Interval has two fields first and last, then I'd overload like ArrowTypes.fromarrow(::Type{Interval}, first, last) = .... Note the default implementation is ArrowTypes.fromarrow(::Type{T}, x...) = T(x...), so if your type already accepts all arguments in a constructor no additional fromarrow method should be necessary (default struct constructors have this behavior).
- Alternatively, may overload fromarrowstruct(::Type{T}, ::Val{fnames}, x...), where fnames is a tuple of the
field names corresponding to the values in x. This approach is useful when you need to implement deserialization in a manner that is agnostic to the field order used by the serializer. When implemented, fromarrowstruct takes precedence over fromarrow in StructKind deserialization.

ArrowTypes.toarrow — Function

ArrowTypes.toarrow(x::T) => S

Interface method to perform the actual conversion from an object x of type T to the type S. T and S must match the types used when defining ArrowTypes.ArrowType(::Type{T}) = S. Hence, S is the natively supported arrow type that T desires to convert to to enable serialization. See ArrowTypes.ArrowType docs for more details. This enables custom objects to be serialized as a natively supported arrow data type.