Docstrings · DataFrames.jl

DataFrames.AbstractDataFrame — Type

AbstractDataFrame

An abstract type for which all concrete types expose an interface for working with tabular data.

An AbstractDataFrame is a two-dimensional table with Symbols or strings for column names.

DataFrames.jl defines two types that are subtypes of AbstractDataFrame: DataFrame and SubDataFrame.

Indexing and broadcasting

AbstractDataFrame can be indexed by passing two indices specifying row and column selectors. The allowed indices are a superset of indices that can be used for standard arrays. You can also access a single column of an AbstractDataFrame using getproperty and setproperty! functions. Columns can be selected using integers, Symbols, or strings. In broadcasting AbstractDataFrame behavior is similar to a Matrix.

A detailed description of getindex, setindex!, getproperty, setproperty!, broadcasting and broadcasting assignment for data frames is given in the "Indexing" section of the manual.

DataFrames.AsTable — Type

AsTable(cols)

A type having a special meaning in source => transformation => destination selection operations supported by combine, select, select!, transform, transform!, subset, and subset!.

If AsTable(cols) is used in source position it signals that the columns selected by the wrapped selector cols should be passed as a NamedTuple to the function.

If AsTable is used in destination position it means that the result of the transformation operation is a vector of containers (or a single container if ByRow(transformation) is used) that should be expanded into multiple columns using keys to get column names.

Examples

julia> df1 = DataFrame(a=1:3, b=11:13)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13

julia> df2 = select(df1, AsTable([:a, :b]) => ByRow(identity))
3×1 DataFrame
 Row │ a_b_identity
     │ NamedTuple…
─────┼─────────────────
   1 │ (a = 1, b = 11)
   2 │ (a = 2, b = 12)
   3 │ (a = 3, b = 13)

julia> select(df2, :a_b_identity => AsTable)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13

julia> select(df1, AsTable([:a, :b]) => ByRow(nt -> map(x -> x^2, nt)) => AsTable)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1    121
   2 │     4    144
   3 │     9    169

DataFrames.DataFrame — Type

DataFrame <: AbstractDataFrame

An AbstractDataFrame that stores a set of named columns.

The columns are normally AbstractVectors stored in memory, particularly a Vector, PooledVector or CategoricalVector.

Constructors

DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
DataFrame(pairs::AbstractVector{<:Pair}; makeunique::Bool=false, copycols::Bool=true)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(; kwargs..., copycols::Bool=true)

DataFrame(table; copycols::Union{Bool, Nothing}=nothing)
DataFrame(table, names::AbstractVector;
          makeunique::Bool=false, copycols::Union{Bool, Nothing}=nothing)
DataFrame(columns::AbstractVecOrMat, names::AbstractVector;
          makeunique::Bool=false, copycols::Bool=true)

DataFrame(::DataFrameRow; copycols::Bool=true)
DataFrame(::GroupedDataFrame; copycols::Bool=true, keepkeys::Bool=true)

Keyword arguments

copycols : whether vectors passed as columns should be copied; by default set to true and the vectors are copied; if set to false then the constructor will still copy the passed columns if it is not possible to construct a DataFrame without materializing new columns. Note the copycols=nothing default in the Tables.jl compatible constructor; it is provided as certain input table types may have already made a copy of columns or the columns may otherwise be immutable, in which case columns are not copied by default. To force a copy in such cases, or to get mutable columns from an immutable input table (like Arrow.Table), pass copycols=true explicitly.
makeunique : if false (the default), an error will be raised

(note that not all constructors support these keyword arguments)

Details on behavior of different constructors

It is allowed to pass a vector of Pairs, a list of Pairs as positional arguments, or a list of keyword arguments. In this case each pair is considered to represent a column name to column value mapping and column name must be a Symbol or string. Alternatively a dictionary can be passed to the constructor in which case its entries are considered to define the column name and column value pairs. If the dictionary is a Dict then column names will be sorted in the returned DataFrame.

In all the constructors described above column value can be a vector which is consumed as is or an object of any other type (except AbstractArray). In the latter case the passed value is automatically repeated to fill a new vector of the appropriate length. As a particular rule values stored in a Ref or a 0-dimensional AbstractArray are unwrapped and treated in the same way.

It is also allowed to pass a vector of vectors or a matrix as as the first argument. In this case the second argument must be a vector of Symbols or strings specifying column names, or the symbol :auto to generate column names x1, x2, ... automatically. Note that in this case if the first argument is a matrix and copycols=false the columns of the created DataFrame will be views of columns the source matrix.

If a single positional argument is passed to a DataFrame constructor then it is assumed to be of type that implements the Tables.jl interface using which the returned DataFrame is materialized.

If two positional arguments are passed, where the second argument is an AbstractVector, then the first argument is taken to be a table as described in the previous paragraph, and columns names of the resulting data frame are taken from the vector passed as the second positional argument.

Finally it is allowed to construct a DataFrame from a DataFrameRow or a GroupedDataFrame. In the latter case the keepkeys keyword argument specifies whether the resulting DataFrame should contain the grouping columns of the passed GroupedDataFrame and the order of rows in the result follows the order of groups in the GroupedDataFrame passed.

Notes

The DataFrame constructor by default copies all columns vectors passed to it. Pass the copycols=false keyword argument (where supported) to reuse vectors without copying them.

By default an error will be raised if duplicates in column names are found. Pass makeunique=true keyword argument (where supported) to accept duplicate names, in which case they will be suffixed with _i (i starting at 1 for the first duplicate).

If an AbstractRange is passed to a DataFrame constructor as a column it is always collected to a Vector (even if copycols=false). As a general rule AbstractRange values are always materialized to a Vector by all functions in DataFrames.jl before being stored in a DataFrame.

DataFrame can store only columns that use 1-based indexing. Attempting to store a vector using non-standard indexing raises an error.

The DataFrame type is designed to allow column types to vary and to be dynamically changed also after it is constructed. Therefore DataFrames are not type stable. For performance-critical code that requires type-stability either use the functionality provided by select/transform/combine functions, use Tables.columntable and Tables.namedtupleiterator functions, use barrier functions, or provide type assertions to the variables that hold columns extracted from a DataFrame.

Metadata: this function preserves all table and column-level metadata. As a special case if a GroupedDataFrame is passed then only :note-style metadata from parent of the GroupedDataFrame is preserved.

Examples

julia> DataFrame((a=[1, 2], b=[3, 4])) # Tables.jl table constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> DataFrame([(a=1, b=0), (a=2, b=0)]) # Tables.jl table constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame("a" => 1:2, "b" => 0) # Pair constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame([:a => 1:2, :b => 0]) # vector of Pairs constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame(Dict(:a => 1:2, :b => 0)) # dictionary constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame(a=1:2, b=0) # keyword argument constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame([[1, 2], [0, 0]], [:a, :b]) # vector of vectors constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame([1 0; 2 0], :auto) # matrix constructor
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

DataFrames.DataFrameColumns — Type

DataFrameColumns{<:AbstractDataFrame}

A vector-like object that allows iteration over columns of an AbstractDataFrame.

Indexing into DataFrameColumns objects using integer, Symbol or string returns the corresponding column (without copying). Indexing into DataFrameColumns objects using a multiple column selector returns a subsetted DataFrameColumns object with a new parent containing only the selected columns (without copying).

DataFrameColumns supports most of the AbstractVector API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

In particular findnext, findprev, findfirst, findlast, and findall functions are supported, and in findnext and findprev functions it is allowed to pass an integer, string, or Symbol as a reference index.

DataFrames.DataFrameRow — Type

DataFrameRow{<:AbstractDataFrame, <:AbstractIndex}

A view of one row of an AbstractDataFrame.

A DataFrameRow is returned by getindex or view functions when one row and a selection of columns are requested, or when iterating the result of the call to the eachrow function.

The DataFrameRow constructor can also be called directly:

DataFrameRow(parent::AbstractDataFrame, row::Integer, cols=:)

A DataFrameRow supports the iteration interface and can therefore be passed to functions that expect a collection as an argument. Its element type is always Any.

Indexing is one-dimensional like specifying a column of a DataFrame. You can also access the data in a DataFrameRow using the getproperty and setproperty! functions and convert it to a Tuple, NamedTuple, or Vector using the corresponding functions.

If the selection of columns in a parent data frame is passed as : (a colon) then DataFrameRow will always have all columns from the parent, even if they are added or removed after its creation.

Examples

julia> df = DataFrame(a=repeat([1, 2], outer=[2]),
                      b=repeat(["a", "b"], inner=[2]),
                      c=1:4)
4×3 DataFrame
 Row │ a      b       c
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  a           1
   2 │     2  a           2
   3 │     1  b           3
   4 │     2  b           4

julia> df[1, :]
DataFrameRow
 Row │ a      b       c
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  a           1

julia> @view df[end, [:a]]
DataFrameRow
 Row │ a
     │ Int64
─────┼───────
   4 │     2

julia> eachrow(df)[1]
DataFrameRow
 Row │ a      b       c
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  a           1

julia> Tuple(df[1, :])
(1, "a", 1)

julia> NamedTuple(df[1, :])
(a = 1, b = "a", c = 1)

julia> Vector(df[1, :])
3-element Vector{Any}:
 1
  "a"
 1

DataFrames.DataFrameRows — Type

DataFrameRows{D<:AbstractDataFrame} <: AbstractVector{DataFrameRow}

Iterator over rows of an AbstractDataFrame, with each row represented as a DataFrameRow.

A value of this type is returned by the eachrow function.

DataFrames.GroupKey — Type

GroupKey{T<:GroupedDataFrame}

Key for one of the groups of a GroupedDataFrame. Contains the values of the corresponding grouping columns and behaves similarly to a NamedTuple, but using it to index its GroupedDataFrame is more efficient than using the equivalent Tuple and NamedTuple, and much more efficient than using the equivalent AbstractDict.

Instances of this type are returned by keys(::GroupedDataFrame) and are not meant to be constructed directly.

Indexing fields of GroupKey is allowed using an integer, a Symbol, or a string. It is also possible to access the data in a GroupKey using the getproperty function. A GroupKey can be converted to a Tuple, NamedTuple, a Vector, or a Dict. When converted to a Dict, the keys of the Dict are Symbols.

See keys(::GroupedDataFrame) for more information.

DataFrames.GroupKeys — Type

GroupKeys{T<:GroupedDataFrame} <: AbstractVector{GroupKey{T}}

A vector containing all GroupKey objects for a given GroupedDataFrame.

See keys(::GroupedDataFrame) for more information.

DataFrames.GroupedDataFrame — Type

GroupedDataFrame

The result of a groupby operation on an AbstractDataFrame; a view into the AbstractDataFrame grouped by rows.

Not meant to be constructed directly, see groupby.

One can get the names of columns used to create GroupedDataFrame using the groupcols function. Similarly the groupindices function returns a vector of group indices for each row of the parent data frame.

After its creation, a GroupedDataFrame reflects the grouping of rows that was valid at its creation time. Therefore grouping columns of its parent data frame must not be mutated, and rows must not be added nor removed from it. To safeguard the user against such cases, if the number of rows in the parent data frame changes then trying to use GroupedDataFrame will throw an error. However, one can add or remove columns to the parent data frame without invalidating the GroupedDataFrame provided that columns used for grouping are not changed.

DataFrames.RepeatedVector — Type

RepeatedVector{T} <: AbstractVector{T}

An AbstractVector that is a view into another AbstractVector with repeated elements

NOTE: Not exported.

Constructor

RepeatedVector(parent::AbstractVector, inner::Int, outer::Int)

Arguments

parent : the AbstractVector that's repeated
inner : the number of times each element is repeated
outer : the number of times the whole vector is repeated after expanded by inner

inner and outer have the same meaning as similarly named arguments to repeat.

Examples

RepeatedVector([1, 2], 3, 1)   # [1, 1, 1, 2, 2, 2]
RepeatedVector([1, 2], 1, 3)   # [1, 2, 1, 2, 1, 2]
RepeatedVector([1, 2], 2, 2)   # [1, 1, 2, 2, 1, 1, 2, 2]

DataFrames.StackedVector — Type

StackedVector <: AbstractVector

An AbstractVector that is a linear, concatenated view into another set of AbstractVectors

NOTE: Not exported.

Constructor

StackedVector(d::AbstractVector)

Arguments

d... : one or more AbstractVectors

Examples

StackedVector(Any[[1, 2], [9, 10], [11, 12]])  # [1, 2, 9, 10, 11, 12]

DataFrames.SubDataFrame — Type

SubDataFrame{<:AbstractDataFrame, <:AbstractIndex, <:AbstractVector{Int}} <: AbstractDataFrame

A view of an AbstractDataFrame. It is returned by a call to the view function on an AbstractDataFrame if a collections of rows and columns are specified.

A SubDataFrame is an AbstractDataFrame, so expect that most DataFrame functions should work. Such methods include describe, summary, nrow, size, by, stack, and join.

If the selection of columns in a parent data frame is passed as : (a colon) then SubDataFrame will always have all columns from the parent, even if they are added or removed after its creation.

Examples

julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
                      b=repeat([2, 1], outer=[4]),
                      c=1:8)
8×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     2      1      2
   3 │     3      2      3
   4 │     4      1      4
   5 │     1      2      5
   6 │     2      1      6
   7 │     3      2      7
   8 │     4      1      8

julia> sdf1 = view(df, :, 2:3) # column subsetting
8×2 SubDataFrame
 Row │ b      c
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     1      2
   3 │     2      3
   4 │     1      4
   5 │     2      5
   6 │     1      6
   7 │     2      7
   8 │     1      8

julia> sdf2 = @view df[end:-1:1, [1, 3]]  # row and column subsetting
8×2 SubDataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     4      8
   2 │     3      7
   3 │     2      6
   4 │     1      5
   5 │     4      4
   6 │     3      3
   7 │     2      2
   8 │     1      1

julia> sdf3 = groupby(df, :a)[1]  # indexing a GroupedDataFrame returns a SubDataFrame
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     1      2      5

Base.Iterators.only — Method

only(df::AbstractDataFrame)

If df has a single row return it as a DataFrameRow; otherwise throw ArgumentError.

Metadata: this function preserves table-level and column-level :note-style metadata.

Base.Iterators.partition — Method

Iterators.partition(df::AbstractDataFrame, n::Integer)

Iterate over df data frame n rows at a time, returning each block as a SubDataFrame.

Examples

julia> collect(Iterators.partition(DataFrame(x=1:5), 2))
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}:
 2×1 SubDataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     2
 2×1 SubDataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     3
   2 │     4
 1×1 SubDataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     5

Base.Iterators.partition — Method

Iterators.partition(dfr::DataFrameRows, n::Integer)

Iterate over DataFrameRows dfr n rows at a time, returning each block as a DataFrameRows over a view of rows of parent of dfr.

Examples

julia> collect(Iterators.partition(eachrow(DataFrame(x=1:5)), 2))
3-element Vector{DataFrames.DataFrameRows{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}}:
 2×1 DataFrameRows
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
 2×1 DataFrameRows
 Row │ x     
     │ Int64 
─────┼───────
   1 │     3
   2 │     4
 1×1 DataFrameRows
 Row │ x     
     │ Int64 
─────┼───────
   1 │     5

Base.allunique — Function

allunique(df::AbstractDataFrame, cols=:)

Return true if none of the rows of df are duplicated. Two rows are duplicates if all their columns contain equal values (according to isequal) for all columns in cols (by default, all columns).

Arguments

df : AbstractDataFrame
cols : a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted by select.