Documentation

DataFrameMacros.DataFrameMacrosModule

DataFrameMacros offers macros which transform expressions for DataFrames functions that use the source .=> function .=> sink mini-language. The supported functions are @transform/@transform!, @select/@select!, @groupby, @combine, @subset/@subset!, @sort/@sort! and @unique.

All macros have signatures of the form:

@macro(df, args...; kwargs...)

Each positional argument in args is converted to a source .=> function .=> sink expression for the transformation mini-language of DataFrames. By default, all macros execute the given function by-row, only @combine executes by-column. There is automatic broadcasting across all column specifiers, so it is possible to directly use multi-column specifiers such as {All()}, {Not(:x)}, {r"columnname"} and {startswith("prefix")}.

For example, the following pairs of expressions are equivalent:

transform(df, :x .=> ByRow(x -> x + 1) .=> :y)
@transform(df, :y = :x + 1)

select(df, names(df, All()) .=> ByRow(x -> x ^ 2))
@select(df, {All()} ^ 2)

combine(df, :x .=> (x -> sum(x) / 5) .=> :result)
@combine(df, :result = sum(:x) / 5)

Column references

Each positional argument must be of the form [sink =] some_expression. Columns can be referenced within sink or some_expression using a Symbol, a String, or an Int. Any column identifier that is not a Symbol must be wrapped with {}. Wrapping with {} also allows to use variables or expressions that evaluate to column identifiers.

The five expressions in the following code block are equivalent.

using DataFrames
using DataFrameMacros

df = DataFrame(x = 1:3)

@transform(df, :y = :x + 1)
@transform(df, :y = {"x"} + 1)
@transform(df, :y = {1} + 1)
col = :x
@transform(df, :y = {col} + 1)
cols = [:x, :y, :z]
@transform(df, :y = {cols[1]} + 1)

Multi-column references

You can also use multi-column specifiers. For example @select(df, sqrt({Between(2, 4)})) acts as if the function sqrt is applied along each column that belongs to the group selected by Between(2, 4). Because the source-function-sink complex is connected by broadcasted pairs like source .=> function .=> sink, you can use multi-column specifiers together with single-column specifiers in the same expression. For example, @select(df, {All()} + :x) would compute df.some_column + df.x for each column in the DataFrame df.

If you use {{}}, the multi-column expression is not broadcast, but given as a tuple so you can aggregate over it. For example sum({{All()}} calculates the sum of all columns once, while sum({All()}) would apply sum to each column separately.

Sink names in multi-column scenarios

For most complex function expressions, DataFrames concatenates all names of the columns that you used to create a new sink column name, which looks like col1_col2_function. It's common that you want to use a different naming scheme, but you can't write @select(df, :x = {All()} + 1) because then every new column would be named x and that is disallowed. There are several options to deal with the problem of multiple new columns:

  • You can use a Vector of strings or symbols such as ["x", "y", "z"] = sqrt({All()}). The length has to match the number of columns in the multi-column specifier(s). This is the most direct way to specify multiple names, but it doesn't leverage the names of the used columns dynamically.

  • You can use DataFrameMacro's string shortcut syntax. If there's a string literal with one or more {} brackets, it's treated as an anonymous function that takes in column names and splices them into the string. {} is equivalent to {1}, but you can access further names with {2} and so on, if there is more than one column used in the function. In the example above, you could rename all columns with @select(df, "sqrt_of_{}" = sqrt({All()})).

  • You can use {1}, {2}, etc. in any expression to refer to the first, second, etc. column name. Like with the shortcut string syntax, {} is the same as {1}.

    For example:

julia> df = DataFrame(a_1 = 1:3, b_1 = 4:6)
3×2 DataFrame
 Row │ a_1    b_1   
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> @transform(df, "result_" * split({}, "_")[1] = sqrt({All()}))
3×4 DataFrame
 Row │ a_1    b_1    result_a  result_b 
     │ Int64  Int64  Float64   Float64  
─────┼──────────────────────────────────
   1 │     1      4   1.0       2.0
   2 │     2      5   1.41421   2.23607
   3 │     3      6   1.73205   2.44949

Passing multiple expressions

Multiple expressions can be passed as multiple positional arguments, or alternatively as separate lines in a begin end block. You can use parentheses, or omit them. The following expressions are equivalent:

@transform(df, :y = :x + 1, :z = :x * 2)
@transform df :y = :x + 1 :z = :x * 2
@transform df begin
    :y = :x + 1
    :z = :x * 2
end
@transform(df, begin
    :y = :x + 1
    :z = :x * 2
end)

Modifier macros

You can modify the behavior of all macros using modifier macros, which are not real macros but only signal changed behavior for a positional argument to the outer macro.

macromeaning
@byrowSwitch to by-row processing.
@bycolSwitch to by-column processing.
@passmissingWrap the function expression in passmissing.
@astableCollect all :symbol = expression expressions into a NamedTuple where (; symbol = expression, ...) and set the sink to AsTable.

Example @bycol

To compute a centered column with @transform, you need access to the whole column at once and signal this with the @bycol modifier.

using Statistics
using DataFrames
using DataFrameMacros

julia> df = DataFrame(x = 1:3)
3×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> @transform(df, :x_centered = @bycol :x .- mean(:x))
3×2 DataFrame
 Row │ x      x_centered 
     │ Int64  Float64    
─────┼───────────────────
   1 │     1        -1.0
   2 │     2         0.0
   3 │     3         1.0

Example @passmissing

Many functions need to be wrapped in passmissing to correctly return missing if any input is missing. This can be achieved with the @passmissing modifier macro.

julia> df = DataFrame(name = ["alice", "bob", missing])
3×1 DataFrame
 Row │ name    
     │ String? 
─────┼─────────
   1 │ alice
   2 │ bob
   3 │ missing 

julia> @transform(df, :name_upper = @passmissing uppercasefirst(:name))
3×2 DataFrame
 Row │ name     name_upper 
     │ String?  String?    
─────┼─────────────────────
   1 │ alice    Alice
   2 │ bob      Bob
   3 │ missing  missing    

Example @astable

In DataFrames, you can return a NamedTuple from a function and then automatically expand it into separate columns by using AsTable as the sink value. To simplify this process, you can use the @astable modifier macro, which collects all statements of the form :symbol = expression in the function body, collects them into a NamedTuple, and sets the sink argument to AsTable.

julia> df = DataFrame(name = ["Alice Smith", "Bob Miller"])
2×1 DataFrame
 Row │ name        
     │ String      
─────┼─────────────
   1 │ Alice Smith
   2 │ Bob Miller

julia> @transform(df, @astable begin
           s = split(:name)
           :first_name = s[1]
           :last_name = s[2]
       end)
2×3 DataFrame
 Row │ name         first_name  last_name  
     │ String       SubString…  SubString… 
─────┼─────────────────────────────────────
   1 │ Alice Smith  Alice       Smith
   2 │ Bob Miller   Bob         Miller

The @astable modifier also works with tuple destructuring syntax, so the previous example can be shortened to:

@transform(df, @astable :first_name, :last_name = split(:name))
DataFrameMacros.@combineMacro
@combine(df, args...; kwargs...)

The @combine macro builds a DataFrames.combine call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to combine but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.

DataFrameMacros.@select!Macro
@select!(df, args...; kwargs...)

The @select! macro builds a DataFrames.select! call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to select! but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.

@subset argument

You can pass a @subset expression as the second argument to @select!, between the input argument and the source-function-sink expressions. Then, the call is equivalent to first taking a subset of the input with view = true, then calling select! on the subset and returning the mutated input. If the input is a GroupedDataFrame, the parent DataFrame is returned.

df = DataFrame(x = 1:5, y = 6:10)
@select!(df, @subset(:x > 3), :y = 20, :z = 3 * :x)
DataFrameMacros.@selectMacro
@select(df, args...; kwargs...)

The @select macro builds a DataFrames.select call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to select but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.

DataFrameMacros.@subset!Macro
@subset!(df, args...; kwargs...)

The @subset! macro builds a DataFrames.subset! call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to subset! but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.

DataFrameMacros.@subsetMacro
@subset(df, args...; kwargs...)

The @subset macro builds a DataFrames.subset call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to subset but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.

DataFrameMacros.@transform!Macro
@transform!(df, args...; kwargs...)

The @transform! macro builds a DataFrames.transform! call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to transform! but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.

@subset argument

You can pass a @subset expression as the second argument to @transform!, between the input argument and the source-function-sink expressions. Then, the call is equivalent to first taking a subset of the input with view = true, then calling transform! on the subset and returning the mutated input. If the input is a GroupedDataFrame, the parent DataFrame is returned.

df = DataFrame(x = 1:5, y = 6:10)
@transform!(df, @subset(:x > 3), :y = 20, :z = 3 * :x)
DataFrameMacros.@transformMacro
@transform(df, args...; kwargs...)

The @transform macro builds a DataFrames.transform call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to transform but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.

DataFrameMacros.@uniqueMacro
@unique(df, args...; kwargs...)

The @unique macro builds a DataFrames.unique call. Each expression in args is converted to a src .=> function . => sink construct that conforms to the transformation mini-language of DataFrames.

Keyword arguments kwargs are passed down to unique but have to be separated from the positional arguments by a semicolon ;.

The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros module docstring, accessible via ?DataFrameMacros.