Parquet2

A pure Julia implementation of the apache parquet format.

Installation

using Pkg; Pkg.add("Parquet2")

or, in the REPL ]add Parquet2

Basic Usage

using Parquet2, Tables, DataFrames

ds = Parquet2.Dataset("/path/to/file")

sch = Tables.schema(ds)  # view table schema

t = Tables.columntable(ds)  # load as a NamedTuple of columns

df = DataFrame(ds; copycols=false)  # load entire dataset as a DataFrame

df1 = DataFrame(ds[1]; copycols=false)  # load first RowGroup as a DataFrame


# load *only* columns (col1, col2) as a DataFrame
dfc = ds |> TableOperations.select(:col1, :col2) |> DataFrame


using AWSS3  # for recognizing S3 url's
s3ds = Parquet2.Dataset("s3://path/to/file")

# can load multi-file datasets
dsd = Parquet2.Dataset("/path/to/directory/")
# multi-file datasets don't read everything by default
append!(dsd, A="1", B="alpha")  # can append by partition columns
# or read it all eagerly (WARNING! don't do this for gigantic directories)
dsd = Parquet2.Dataset("/path/to/directory/"; load_initial=true)

# write a file
df = DataFrame(A=1:5, B=randn(5))
Parquet2.writefile("/path/to/file", df)

# write a file to S3
Parquet2.writefile("s3://path/to/file", df)

For more information please see the documentation.