Tullio is a very flexible einsum macro. It understands many array operations written in index notation, for example:
@tullio M[x,y,c] := N[x+i, y+j,c] * K[i,j] # sum over i,j, and create M
@tullio S[x] = P[x,y] * log(Q[x,y] / R[y]) # sum over y, and write into S
@tullio A[i,j] += B[i,k,l] * C[l,j] * D[k,j] # sum over k,l, and add to values in A
@tullio (*) Z[j] := X[ind[k],j] * exp(-Y[k]) # product over k
Used by itself the macro writes ordinary nested loops much like Einsum.@einsum
.
One difference is that it can parse more expressions (such as the convolution M
, and worse).
Another is that it will use multi-threading (via Threads.@spawn
) and recursive tiling, on large enough arrays.
But it also co-operates with various other packages, provided they are loaded before the macro is called:
It uses
LoopVectorization.@avx
to speed many things up. (Disable withavx=false
.) On a good day this will match the speed of OpenBLAS for matrix multiplication.It uses
TensorOperations.@tensor
on expressions which this understands. (Disable withtensor=false
.) These must be Einstein-convention contractions of one term; none of the examples above qualify.It uses
KernelAbstractions.@kernel
to make a GPU version. (Disable withcuda=false
.) This is somewhat experimental, and may not be fast.
The macro also tries to provide a gradient for use with Tracker or Zygote.
(Disable with grad=false
, or nograd=A
.) This is done in one of two ways:
By default it takes a symbolic derivative of the right hand side expression. When using
@tensor
, this writes another@tensor
expression for each input array, otherwise it simply fills in all the gradient arrays at once. (Only for reductions over+
ormin
/max
.)The option
grad=Dual
uses instead ForwardDiff to differentiate the right hand side (only for reductions over+
). This allows for more complicated expressions.
The expression need not be just one line, for example:
@tullio out[x, y] := @inbounds(begin # sum over k
a,b = off[k]
mat[mod(x+a), mod(y+b)]
end) (x in axes(mat,1), y in axes(mat,2)) grad=Dual nograd=off
Here the macro cannot infer the range of the output's indices x,y
, so they must be provided explicitly.
(If writing into an existing array, with out[x,y] = begin ...
or +=
, then ranges would be taken from there.)
Because it sees assignment being made, it does not attempt to sum over a,b
, and it assumes that indices could go out of bounds so does not add @inbounds
for you.
(Although in fact mod(x+a) == mod(x+a, axes(mat,1))
is safe.)
It will also not be able to take a symbolic derivative, but dual numbers will work fine.
Pipe operators |>
or <|
indicate functions to be performed outside the sum, for example:
@tullio lse[j] := log <| exp(mat[i,j]) # vec(log.(sum(exp.(mat), dims=1)))
The option @tullio verbose=true
will cause it to print index ranges, symbolic derivatives,
and notices when it is unable to use the packages mentioned above.
And verbose=2
will print everything.
Notation
using Pkg; Pkg.add("Tullio") # now registered
using Tullio
A = [abs2(i - 11) for i in 1:21]
# Downsample -- range of i is that allowed by both terms:
@tullio D[i] := (A[2i] + A[2i+1])/2 # 1:10 == intersect(1:10, 0:10)
# Shifts -- range of i calculated in terms of that given for j:
@tullio M[i,j] := A[i+j-1] (j in 1:15) # i in 1:7
@tullio M[i+_,j] := A[i+j] (j in 1:15) # i in 0:6, automatic shift "i+_"
using OffsetArrays # Convolve a filter:
K = OffsetArray([1,-1,2,-1,1], -2:2)
@tullio C[i] := A[i+j] * K[j] # j ∈ -2:2 implies i ∈ 3:19
# Index by the values in K
@tullio D[i,j] := A[2K[j]+i] ÷ K[j] # extrema(K)==(-1,2) implies i ∈ 3:17
# Wrapped & padded:
@tullio M[i,j] := A[mod(i+j)] (j in 1:15, i in 1:15) # wraps around, mod(i+j, axes(A,1))
@tullio M[i,j] := A[clamp(i+j)] (j in 1:15, i in 1:15) # instead repeats "100"
@tullio M[i+_,j] := A[pad(i+j, 3)] (j in 1:15) # fills with zeros
using FFTW # Functions of the indices are OK:
S = [0,1,0,0, 0,0,0,0]
fft(S) ≈ @tullio F[k] := S[x] * exp(-im*pi/8 * (k-1) * x) (k ∈ axes(S,1))
# Finalisers <| or |> are applied after sum (the two are equivalent):
@tullio N2[j] := sqrt <| M[i,j]^2 # N2 ≈ map(norm, eachcol(M))
@tullio n3[_] := A[i]^3 |> (_)^(1/3) # n3[1] ≈ norm(A,3), with _ anon. func.
# Reduction over any function:
@tullio (*) P[i] := A[i+k] (k in 0:2) # product
@tullio (max) X[i,_] := D[i,j] # maximum(D, dims=2), almost
# Access to fields & arrays -- this uses j ∈ eachindex(first(N).c)
N = [(a=i, b=i^2, c=fill(i^3,3)) for i in 1:10]
@tullio T[i,j] := (N[i].a // 1, N[i].c[j])
# Functions which create arrays are evaluated once:
@tullio R[i,j] := abs.((rand(Int8, 5)[i], rand(Int8, 5)[j]))
using NamedDims, AxisKeys # Dimension names, plus pretty printing:
@tullio M[row=i, col=j, z=k] := A[i+j-1] (j in 1:15, k in 1:2)
@tullio S[i] := M[col=j-i, z=k, row=i+1] # sum over j,k
Threads & SIMD
using Tullio, LoopVectorization, NNlib, BenchmarkTools
# Batched matmul with batch index first in B, defined with @avx loops:
bmm_rev(A, B) = @tullio C[i,k,b] := A[i,j,b] * B[b,k,j] # (sum over j)
A = randn(20,30,500); B = randn(500,40,30);
bmm_rev(A, B) ≈ NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true
@btime bmm_rev($A, $B); # 317.526 μs, same speed as un-permuted bmm
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1))); # 1.478 ms, with MKL
# Complete reduction, without first materialising X .* log.(Y')
sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])
X = rand(1000,1000);
@btime sum_opp($X) # 499.814 μs (173 allocations: 14.20 KiB)
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)
Derivatives & GPU
using Tullio
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
A = rand(3,40); B = rand(40,500);
A * B ≈ mul(A, B) # true
using Tracker # or Zygote
ΔA = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
ΔA ≈ ones(3,500) * B' # true
using CUDA, KernelAbstractions # Now defined with a GPU version:
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
cu(A * B) ≈ mul(cu(A), cu(B)) # true
cu(ΔA) ≈ Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true
# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]
Larger expressions
using Tullio, OffsetArrays
# A convolution with cyclic indices
mat = zeros(10,10,1); mat[2,2] = 101; mat[10,10] = 1;
@tullio kern[i,j] := 1/(1+i^2+j^2) (i in -3:3, j in -3:3)
@tullio out[x,y,c] := begin
xi = mod(x+i, axes(mat,1)) # xi = ... means that it won't be summed,
# yj = mod(y+j, axes(mat,2))
@inbounds trunc(Int, mat[xi, mod(y+j), c] * kern[i,j]) # and disables automatic @inbounds,
end (x in 1:10, y in 1:10) # and prevents range of x from being inferred.
# A stencil?
offsets = [(a,b) for a in -2:2 for b in -2:2 if a>=b] # vector of tuples
@tullio out[x,y,1] = begin
a,b = offsets[k]
i = clamp(x+a, extrema(axes(mat,1))...)
# j = clamp(y+b, extrema(axes(mat,2))...) # can be written clamp(y+b)
@inbounds mat[i, clamp(y+b), 1] * 10
end # ranges of x,y read from out[x,y,1]
# Applying a vector of functions
fs = [sin, cos, tan]
xs = randn(3,100)
@tullio ys[r,c] := (fs[r])(xs[r,c])
using Zygote, ForwardDiff
rowmap(fs, xs) = @tullio ys[r,c] := (fs[r])(xs[r,c]) grad=Dual nograd=fs
Zygote.gradient(sum∘rowmap, fs, ones(3,2))
[f'(1) for f in fs] # agrees
Options
The default setting is:
@tullio threads=true fastmath=true avx=true tensor=true cuda=256 grad=Base verbose=false A[i,j] := ...
threads=false
turns off threading, whilethreads=64^3
sets a threshold size at which to divide the work (replacing the macro's best guess).avx=false
turns off the use ofLoopVectorization
, whileavx=4
inserts@avx unroll=4 for i in ...
.grad=false
turns off gradient calculation, andgrad=Dual
switches it to useForwardDiff
(which must be loaded).nograd=A
turns of the gradient calculation just forA
, andnograd=(A,B,C)
does this for several arrays.tensor=false
turns off the use ofTensorOperations
.- Assignment
xi = ...
removesxi
from the list of indices: its range is note calculated, and it will not be summed over. It also disables@inbounds
since this is now up to you. verbose=true
prints things like the index ranges inferred, and gradient calculations.verbose=2
prints absolutely everything.A[i,j] := ...
makes a new array, whileA[i,j] = ...
andA[i,j] += ...
write into an existing one.A[row=i, col=j] := ...
makes a newNamedDimsArray
.@tullio (*) A[i,j] := ...
is a product, as is@tullio A[i,j] *= ...
. For other reductions,@tullio (f) A[i,j] ^= ...
is an in-place update.init=0.0
gives the initial value for reductions. For+
,*
,min
,min
,&
,|
it has sensible defaults, for other reductions uses zero.
Implicit:
- Indices without shifts must have the same range everywhere they appear, but those with shifts (even
A[i+0]
) run over the intersection of possible ranges. - Shifted output indices must start at 1, unless
OffsetArrays
is visible in the calling module. - The use of
@avx
, and the calculation of gradients, are switched off by sufficiently complex syntax (such as arrays of arrays). - Gradient hooks are attached for any or all of
ReverseDiff
,Tracker
&Zygote
. These packages need not be loaded when the macro is run. - Gradients are only defined for reductions over
(+)
(default) andmin
,max
. - GPU kernels are only constructed when both
KernelAbstractions
andCUDA
are visible. The defaultcuda=256
is passed tokernel(CUDA(), 256)
. - The CPU kernels from
KernelAbstractions
are called only whenthreads=false
; they are not at present very fast, but perhaps useful for testing.
Extras:
A[i] := i^2 (i in 1:10)
is how you specify a range for indices when this can't be inferred.A[i] := B[i, $col] - C[i, 2]
is how you fix one index to a constant (to preventcol
being summed over).A[i] := $d * B[i]
is the preferred way to include other constants. Note that no gradient is calculated ford
.- Within indexing,
A[mod(i), clamp(j)]
both mapsi
&j
to lie withinaxes(A)
, and disables inference of their ranges fromA
. - Similarly,
A[pad(i,3)]
extends the range ofi
, inserting zeros outside ofA
. Instead of zero,pad=NaN
uses this value as padding. The implementation of this (andmod
,clamp
) is not very fast at present. - On the left, when making a new array, an underscore like
A[i+_] :=
inserts whatever shift is needed to makeA
one-based. Tullio.@printgrad (x+y)*log(x/z) x y z
prints out how symbolic derivatives will be done.
Elsewhere
Back-end friends & relatives:
LoopVectorization.jl is used here, if available.
Gaius.jl and PaddedMatrices.jl build on that.
GPUifyLoops.jl and KernelAbstractions.jl generate GPU-compatible kernels.
ThreadsX.jl does threaded reductions, and much else.
Strided.jl does multi-threaded broadcasting.
Front-end near-lookalikes:
Einsum.jl makes simple loops. See tests/einsum.jl where
using Tullio: @einsum
is an almost-seamless replacement.TensorOperations.jl and OMEinsum.jl identify patterns on which they can call various basic operations.
TensorCast.jl expresses everything as Julia array operations, broadcasting and reduction. (OMEinsum.jl also treats some cases as a special lazy broadcast-reduction.)
Things you can't run:
Tortilla.jl seems to exist, publicly, only in this very nice talk.
ArrayMeta.jl was a Julia 0.5 take on some of this.
Tokamak.jl was another, see readme here.