FLoops: fold for humans™

Dev GitHub Actions

FLoops.jl provides a macro @floop. It can be used to generate a fast generic sequential and parallel iteration over complex collections.

Furthermore, the loop written in @floop can be executed with any compatible executors. See FoldsThreads.jl for various thread-based executors that are optimized for different kinds of loops. FoldsCUDA.jl provides an executor for GPU. FLoops.jl also provide a simple distributed executor.

Update notes

FLoops.jl 0.2 defaults to a parallel loop; i.e., it uses a parallel executor (e.g., ThreadedEx) when the executor is not specified and the explicit sequential form @floop begin ... end is not used.

That is to say, @floop without @reduce such as

@floop for i in eachindex(ys, xs)
    ys[i] = f(xs[i])

is now executed in parallel by default.


Parallel loop

@floop is a superset of Threads.@threads (see below) and in particular supports complex reduction with additional syntax @reduce:

julia> using FLoops  # exports @floop macro

julia> @floop for (x, y) in zip(1:3, 1:2:6)
           a = x + y
           b = x - y
           @reduce s += a
           @reduce t += b
       (s, t)
(15, -3)

For more examples, see parallel loops tutorial.

Sequential (single-thread) loop

Simply wrap a for loop and its initialization part with @floop begin ... end:

julia> @floop begin
           s = 0
           for x in 1:3
               s += x

For more examples, see sequential loops tutorial.

Advantages over Threads.@threads

@floop is a superset of Threads.@threads and has a couple of advantages:

  • @floop supports various input collection types including arrays, dicts, sets, strings, and many iterators from Base.Iterators such as zip and product. More precisely, @floop can generate high-performance parallel iterations for any collections that supports SplittablesBase.jl interface.
  • With FoldsThreads.NondeterministicEx, @floop can even parallelize iterations over non-parallelizable input collections (although it is beneficial only for heavier workload).
  • FoldsThreads.jl provides multiple alternative thread-based executors (= loop execution backend) that can be used to tune the performance without touching the loop itself.
  • FoldsCUDA.jl provides a simple GPU executor.
  • @reduce syntax for supporting complex reduction in a forward-compatible manner
    • Note: threadid-based reduction (that is commonly used in conjunction with @threads) may not be forward-compatible to Julia that supports migrating tasks across threads.
  • There is a trick for "changing" the effective number of threads without restarting julia using the basesize option.

The relative disadvantages may be that @floop is much newer than Threads.@threads and has much more flexible internals. These points can contribute to undiscovered bugs.

How it works

@floop works by converting the native Julia for loop syntax to foldl defined by Transducers.jl. Unlike foldl defined in Base, foldl defined by Transducers.jl is powerful enough to cover the for loop semantics and more.


A singleton type that indicates ScratchSpace is deallocated.

ScratchSpace(f::F, value::T)

An object for appropriately caching the value returned from f. The value is thrown away when serialized.

Construct this by x = ScratchSpace(f, f()). At the use site, update the variable with x = allocated(x) and then fetch the value by x.value.


Set assistant mode; i.e., what to do when FLoops.jl finds a problematic usage pattern.

Assistant modes:

  • :ignore: do nothing
  • :warn: print warning once
  • :warn_always: print warning always
  • :error: throw an error

FLoops.assistant(false) is equivalent to FLoops.assistant(:ignore) and FLoops.assistant(true) is equivalent to FLoops.assistant(:warn).


Recursively copy expr::Expr but not ReduceOpSpec or InitSpec.

remove_at_simd(__module__, ex) -> (ex′, simd)

Return a tuple with following items:

  • ex′: ex without the first top-level macrocall to Base.@simd removed.
  • simd: true if @simd for ... end is found. :ivdep if @simd ivdep for ... end is found. false otherwise.

Macros that happened to have the name @simd but not identical to Base.@simd are ignored.

@floop begin
    s₁ = initialization of s₁
    s₂  # pre-initialized variable
    for x in xs, ...

@floop begin ... end expects a (possibly empty) series of assignments or variable declaration (as in s₂ above) followed by a for loop.

When there is no induction variables, begin ... end can be omitted:

@floop for x in xs, ...

Use @reduce for parallel execution:

@floop for x in xs, ...
    @reduce ...

@floop can also take an executor argument (which should be an instance of executor such as SequentialEx, ThreadedEx and DistributedEx) or a nothing (indicating an appropriate parallel executor):

@floop executor for x in xs, ...
    @reduce ...

See the module docstring of FLoops for examples.

@init begin
    pv₁ = init₁
    pvₙ = initₙ

Initialize private variables pvᵢ with initializer expression initᵢ for each task. This can be used for mutating objects in a data race-free manner.

@reduce() do (acc₁ [= init₁]; x₁), ..., (accₙ [= initₙ]; xₙ)
@reduce(acc₁ ⊗₁= x₁, ..., accₙ ⊗ₙ= xₙ)
@reduce(acc₁ .⊗₁= x₁, ..., accₙ .⊗ₙ= xₙ)
@reduce(acc₁ = ⊗₁(init₁, x₁), ..., accₙ = ⊗ₙ(initₙ, xₙ))
@reduce(acc₁ .= (⊗₁).(init₁, x₁), ..., accₙ = (⊗ₙ).(initₙ, xₙ))

Declare how accumulators are updated in the sequential basecase and how the resulting accumulators from two basecases are combined.

The arguments accᵢ and xᵢ must be symbols except for xᵢ of the last three forms in which an expression can be used at xᵢ.

In the first form,

function ((acc₁, acc₂, ..., accₙ), (x₁, x₂, ..., xₙ))
    ...  # body of the `do` block
    return (acc₁, acc₂, ..., accₙ)

should be an associative function.

In the last three forms, every binary operation ⊗ᵢ should be an associative function.

If initᵢ is specified, the tuple (init₁, init₂, ..., initₙ) should be the identify of the related associative function. accᵢ = initᵢ is evaluated for each basecase (each Task) in the beginning.

Consider a loop with the following form

@floop for ...
    # code computing (x₁, x₂, ..., xₙ)
    @reduce() do (acc₁ = init₁; x₁), ..., (accₙ = initₙ; xₙ)
        # code updating (acc₁, acc₂, ..., accₙ) using (x₁, x₂, ..., xₙ)

This is converted to

acc₁ = init₁
accₙ = initₙ
for ...
    # code computing (x₁, x₂, ..., xₙ)
    # code updating (acc₁, acc₂, ..., accₙ) using (x₁, x₂, ..., xₙ)

for computing (acc₁, acc₂, ..., accₙ) of each basecase. The accumulators accᵢ of two basecases are combined using "code updating (acc₁, acc₂, ..., accₙ) using (x₁, x₂, ..., xₙ)" where (x₁, x₂, ..., xₙ) are replaced with (acc₁, acc₂, ..., accₙ) of the next basecase. Note that "code computing (x₁, x₂, ..., xₙ)" is not used for combining the basecases.


@reduce() do (vmax=-Inf; v), (imax=0; i)
    if isless(vmax, v)
        vmax = v
        imax = i

@reduce(s += y, p *= y)

@reduce(xs = append!!(EmptyVector(), x), ys = append!!(EmptyVector(), y))