ArrowTypes.ArrowTypes
— ModuleThe ArrowTypes module provides the ArrowTypes.Arrowtype
interface trait that objects can define in order to signal how they should be serialized in the arrow format.
ArrowTypes.ArrowKind
— TypeArrowTypes.ArrowKind(T)
For a give type T
, define it's "arrow type kind", or the general category of arrow types it should be treated as. Must be one of:
ArrowTypes.NullKind
:Missing
is the only type defined asNullKind
ArrowTypes.PrimitiveKind
:<:Integer
,<:AbstractFloat
, along withArrow.Decimal
, and the variousArrow.ArrowTimeType
subtypesArrowTypes.BoolKind
: onlyBool
ArrowTypes.ListKind
: anyAbstractString
orAbstractArray
ArrowTypes.FixedSizeList
:NTuple{N, T}
ArrowTypes.MapKind
: anyAbstractDict
ArrowTypes.StructKind
: anyNamedTuple
or plain struct (mutable or otherwise)ArrowTypes.UnionKind
: anyUnion
ArrowTypes.DictEncodedKind
: array types that implement theDataAPI.refarray
interface
The list of ArrowKind
s listed above translate to different ways to physically store data as supported by the arrow data format. See the docs for each for an idea of whether they might be an appropriate fit for a custom type. Note that custom types need to satisfy any additional "interface methods" as required by the various ArrowKind
types. By default, if a type in julia is declared like primitive type ...
it is considered a PrimitiveKind
and if struct
or mutable struct
it's considered a StructKind
. Also note that types will rarely need to define ArrowKind
; much more common is to define ArrowType(T)
and toarrow(x::T)
to transform T
to a natively supported arrow type, which will already have its ArrowKind
defined.
ArrowTypes.BoolKind
— TypeBoolKind data is stored with values packed down to individual bits; so instead of a traditional Bool being 1 byte/8 bits, 8 Bool values would be packed into a single byte
ArrowTypes.DictEncodedKind
— TypeDictEncodedKind store a small pool of unique values in one buffer, with a full-length buffer of integer offsets into the small value pool
ArrowTypes.FixedSizeListKind
— TypeFixedSizeListKind data are stored in a single contiguous buffer; individual elements can be computed based on the fixed size of the lists
ArrowTypes.ListKind
— TypeListKind data are stored in two separate buffers; one buffer contains all the original data elements flattened into one long buffer; the 2nd buffer contains an offset into the 1st buffer for how many elements make up the original array element
ArrowTypes.MapKind
— TypeMapKind data are stored similarly to ListKind, where elements are flattened, and a 2nd offsets buffer contains the individual list element length data
ArrowTypes.NullKind
— TypeNullKind data is actually not physically stored since the data is constant; just the length is needed
ArrowTypes.PrimitiveKind
— TypePrimitiveKind data is stored as plain bits in a single contiguous buffer
ArrowTypes.StructKind
— TypeStructKind data are stored in separate buffers for each field of the struct
ArrowTypes.UnionKind
— TypeUnionKind data are stored either in a separate, compacted buffer for each union type (dense), or in full-length buffers for each union type (sparse)
ArrowTypes.ArrowType
— FunctionArrowTypes.ArrowType(T) = S
Interface method to define the natively supported arrow type S
that a given type T
should be converted to before serializing. Useful when a custom type wants a "serialization hook" or otherwise needs to be transformed/converted into a natively supported arrow type for serialization. If a type defines ArrowType
, it must also define a corresponding ArrowTypes.toarrow(x::T)
method which does the actual conversion from T
to S
. Note that custom structs defined like struct T
or mutable struct T
are natively supported in serialization, so unless additional transformation/customization is desired, a custom type T
can serialize with no ArrowType
definition (by default, each field of a struct is serialized, using the results of fieldnames(T)
and getfield(x, i)
). Note that defining these methods only deal with custom serialization to the arrow format; to be able to deserialize custom types at all, see the docs for ArrowTypes.arrowname
, ArrowTypes.arrowmetadata
, ArrowTypes.JuliaType
, and ArrowTypes.fromarrow
.
ArrowTypes.JuliaType
— FunctionArrowTypes.JuliaType(::Val{Symbol(name)}, ::Type{S}, arrowmetadata::String) = T
Interface method to define the custom Julia logical type T
that a serialized metadata label should be converted to when deserializing. When reading arrow data, and a logical type label is encountered for a column, it will call ArrowTypes.JuliaType(Val(Symbol(name)), S, arrowmetadata)
to see if a Julia type has been "registered" for deserialization. The name
used when defining the method must correspond to the same name
when defining ArrowTypes.arrowname(::Type{T}) = Symbol(name)
. The use of Val(Symbol(...))
is to allow overloading a method on a specific logical type label. The S
2nd argument passed to JuliaType
is the native arrow serialized type. This can be useful for parametric Julia types that wish to correctly parameterize their custom type based on what was serialized. The 3rd argument arrowmetadata
is any metadata that was stored when the logical type was serialized as the result of calling ArrowTypes.arrowmetadata(T)
. Note the 2nd and 3rd arguments are optional when overloading if unneeded. When defining ArrowTypes.arrowname
and ArrowTypes.JuliaType
, you may also want to implement [ArrowTypes.fromarrow
] in order to customize how a custom type T
should be constructed from the native arrow data type. See its docs for more details.
ArrowTypes.arrowmetadata
— FunctionArrowTypes.arrowmetadata(T) => String
Interface method to provide additional logical type metadata when serializing extension types. ArrowTypes.arrowname
provides the logical type name, which may be all that's needed to return a proper Julia type from ArrowTypes.JuliaType
, but some custom types may, for example have type parameters that aren't inferred/based on fields. In order to fully recreate these kinds of types when deserializing, these type parameters can be stored by defining ArrowTypes.arrowmetadata(::Type{T}) = "type_param"
. This will then be available to access by overloading ArrowTypes.JuliaType(::Val{Symbol(name)}, S, arrowmetadata::String)
.
ArrowTypes.arrowname
— FunctionArrowTypes.arrowname(T) = Symbol(name)
Interface method to define the logical type "label" for a custom Julia type T
. Names will be global for an entire arrow dataset, and conventionally, custom types will just use their type name along with a Julia-specific prefix; for example, for a custom type Foo
, I would define ArrowTypes.arrowname(::Type{Foo}) = Symbol("JuliaLang.Foo")
. This ensures other language implementations won't get confused and are safe to ignore the logical type label. When arrow stores non-native data, it must still be stored as a native data type, but can have type metadata tied to the data that labels the original logical type it originated from. This enables the conversion of native data back to the logical type when deserializing, as long as the deserializer has the same definitions when the data was serialized. Namely, the current Julia session will need the appropriate ArrowTypes.JuliaType
and ArrowTypes.fromarrow
definitions in order to know how to convert the native data to the original logical type. See the docs for those interface methods in order to ensure a complete implementation. Also see the accompanying ArrowTypes.arrowmetadata
docs around providing additional metadata about a custom logical type that may be necessary to fully re-create a Julia type (e.g. non-field-based type parameters).
ArrowTypes.default
— FunctionThere are a couple places when writing arrow buffers where we need to write a "dummy" value; it doesn't really matter what we write, but we need to write something of a specific type. So each supported writing type needs to define default
.
ArrowTypes.fromarrow
— FunctionArrowTypes.fromarrow(::Type{T}, x::S) => T
Interface method that provides a "deserialization hook" for a custom type T
to be constructed from the native arrow type S
. The T
and S
types must correspond to the definitions used in ArrowTypes.ArrowType(::Type{T}) = S
. This is a paired method with ArrowTypes.toarrow
.
The default definition is ArrowTypes.fromarrow(::Type{T}, x) = T(x)
, so if that works for a custom type already, no additional overload is necessary. A few ArrowKind
s have/allow slightly more custom overloads for their fromarrow
methods:
ListKind{true}
: forString
types, they may overloadfromarrow(::Type{T}, ptr::Ptr{UInt8}, len::Int) = ...
to avoid materializing aString
StructKind
:- May overload
fromarrow(::Type{T}, x...)
where individual fields are passed as separate
Interval
has two fieldsfirst
andlast
, then I'd overload likeArrowTypes.fromarrow(::Type{Interval}, first, last) = ...
. Note the default implementation isArrowTypes.fromarrow(::Type{T}, x...) = T(x...)
, so if your type already accepts all arguments in a constructor no additionalfromarrow
method should be necessary (default struct constructors have this behavior).- Alternatively, may overload
fromarrowstruct(::Type{T}, ::Val{fnames}, x...)
, wherefnames
is a tuple of the
x
. This approach is useful when you need to implement deserialization in a manner that is agnostic to the field order used by the serializer. When implemented,fromarrowstruct
takes precedence overfromarrow
inStructKind
deserialization.- May overload
ArrowTypes.toarrow
— FunctionArrowTypes.toarrow(x::T) => S
Interface method to perform the actual conversion from an object x
of type T
to the type S
. T
and S
must match the types used when defining ArrowTypes.ArrowType(::Type{T}) = S
. Hence, S
is the natively supported arrow type that T
desires to convert to to enable serialization. See ArrowTypes.ArrowType
docs for more details. This enables custom objects to be serialized as a natively supported arrow data type.