Tullio.jl

Build Status

This is a package is for writing array operations in index notation, such as:

@tullio M[x,y,c] := N[x+i, y+j,c] * K[i,j]     # sum over i,j, and create M

@tullio S[x] = P[x,y] * log(Q[x,y] / R[y])     # sum over y, and write into S

@tullio A[i,j] += B[i,k,l] * C[l,j] * D[k,j]   # sum over k,l, and add to values in A

@tullio (*) Z[j] := X[ind[k],j] * exp(-Y[k])   # product over k

Used by itself the macro writes ordinary nested loops much like Einsum.@einsum. One difference is that it can parse more expressions (such as the convolution M, and worse). Another is that it will use multi-threading (via Threads.@spawn), dividing large enough arrays into blocks. But it also co-operates with various other packages, provided they are loaded before the macro is called:

  • It can use LoopVectorization.@avx to speed many things up. (Disable with avx=false.)

  • It can use KernelAbstractions.@kernel to make a GPU version. (Disable with cuda=false.)

  • It can use TensorOperations.@tensor on expressions which this understands. (Disable with tensor=false.) These must be Einstein-convention contractions of one term; none of the examples above qualify.

The macro also tries to provide a gradient for use with any of Tracker, Zygote or ReverseDiff. (Disable with grad=false.) This is done in one of two ways:

  • By default it takes a symbolic derivative of the right hand side expression. When using @tensor, this writes another @tensor expression for each input array, otherwise it simply fills in all the gradient arrays at once. (Only for reductions over + or min/max.)

  • The option grad=Dual uses instead ForwardDiff to differentiate the right hand side (only for reductions over +). This allows for more complicated expressions.

The expression need not be just one line, for example:

@tullio out[x,y,n] := begin              # sum over a,b
        i = mod(x+a, axes(mat,1))
        j = mod(y+b, axes(mat,2))
        @inbounds mat[i,j,n] * abs(kern[a,b])
    end (x in axes(mat,1), y in axes(mat,2)) grad=Dual

Here the macro cannot infer the range of the output's indices x,y, so they must be provided explicitly. (If writing into an existing array, with out[x,y,n] = begin ... or +=, then ranges would be taken from there.) It knows that it should not sum over indices i,j, but since it can't be sure of their ranges, it will not add @inbounds in such cases. It will also not be able to take a symbolic derivative here, but dual numbers will work fine.

The option @tullio verbose=true will cause it to print index ranges, symbolic derivatives, and notices when it is unable to use the packages mentioned above.

Notation
using Pkg; pkg"add Tullio" # now registered
using Tullio
A = [abs2(i - 11) for i in 1:21]

# Downsample -- range of i is that allowed by both terms:
@tullio D[i] := (A[2i] + A[2i+1])/2  # 1:10 == intersect(1:10, 0:10)

# Shifts -- range of i calculated in terms of that given for j:
@tullio M[i,j] := A[i+j-1]  (j in 1:15)  # i in 1:7

using OffsetArrays # Convolve a filter:
K = OffsetArray([1,-1,2,-1,1], -2:2)
@tullio C[i] := A[i+j] * K[j]  # j ∈ -2:2 implies i ∈ 3:19

# Index by the values in K
@tullio D[i,j] := A[2K[j]+i] ÷ K[j] # extrema(K)==(-1,2) implies i ∈ 3:17

using FFTW # Functions of the indices are OK:
S = [0,1,0,0, 0,0,0,0]
fft(S)  @tullio (k  axes(S,1)) F[k] := S[x] * exp(-im*pi/8 * (k-1) * x)

# Reduction over any function:
@tullio (*) P[i] := A[i+k]  (k in 0:2) # product
@tullio (max) X[i,_] := D[i,j]         # maximum(D, dims=2), almost

# Access to fields & arrays -- this uses j ∈ eachindex(first(N).c)
N = [(a=i, b=i^2, c=fill(i^3,3)) for i in 1:10]
@tullio T[i,j] := (N[i].a // 1, N[i].c[j])

# Functions which create arrays are evaluated once:
@tullio R[i,j] := abs.((rand(Int8, 5)[i], rand(Int8, 5)[j]))

using NamedDims, AxisKeys # Dimension names, plus pretty printing:
@tullio M[row=i, col=j, z=k] := A[i+j-1]  (j in 1:15, k in 1:2)
@tullio S[i] := M[col=j-i, z=k, row=i+1] # sum over j,k
Threads & SIMD
using Tullio, LoopVectorization, NNlib, BenchmarkTools

# Batched matmul with batch index first in B, defined with @avx loops:
bmm_rev(A, B) = @tullio C[i,k,b] := A[i,j,b] * B[b,k,j]  # (sum over j)

A = randn(20,30,500); B = randn(500,40,30);
bmm_rev(A, B)  NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true

@btime bmm_rev($A, $B); # 317.526 μs μs, same speed as un-permuted bmm
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1))); # 1.478 ms, with MKL

# Complete reduction, without first materialising X .* log.(Y')
sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])

X = rand(1000,1000);
@btime sum_opp($X)                    #   499.814 μs (173 allocations: 14.20 KiB)
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)
Derivatives & GPU
using Tullio
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k] 

A = rand(3,40); B = rand(40,500);
A * B  mul(A, B) # true

using Tracker
ΔA = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
ΔA  ones(3,500) * B' # true

using CUDA, KernelAbstractions # Now defined with a GPU version:
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

cu(A * B)  mul(cu(A), cu(B)) # true

cu(ΔA)  Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true

# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]
Larger expressions
mat = zeros(10,10,1); mat[1,1] = 101;
@tullio kern[i,j] := 1/(1+i^2+j^2)  (i in -2:2, j in -2:2)

@tullio out[x,y,c] := begin
    xi = mod(x+i, axes(mat,1)) # xi = ... means that it won't be summed,
    yj = mod(y+j, axes(mat,2))
    @inbounds trunc(Int, mat[xi, yj, c] * kern[i,j]) # and disables automatic @inbounds,
end (x in 1:10, y in 1:10) # and prevents range of x from being inferred.

fs = [sin, cos, tan]
xs = randn(3,100)

@tullio ys[r,c] := (fs[r])(xs[r,c]) # works, but not the gradient

function rowmap(fs, xs)
    axes(fs,1) == axes(xs,1) || error()
    @tullio ys[r,c] := begin
        @inbounds f = getindex($fs, r) # not writing fs[r] avoids trying to make Δfs
        @inbounds f(xs[r,c]) # assignment f = ... has again disabled @inbounds.
    end grad=Dual
end

using Zygote, ForwardDiff
Zygote.gradient(sumrowmap, fs, ones(3,2))
[f'(1) for f in fs]
Options

The default setting is: @tullio threads=true fastmath=true avx=true tensor=true cuda=256 grad=Base verbose=false A[i,j] := ...

  • threads=false turns off threading, while threads=64^3 sets a threshold size at which to divide the work (replacing the macro's best guess).
  • avx=false turns off the use of LoopVectorization, while avx=4 inserts @avx unroll=4 for i in ....
  • grad=false turns off gradient calculation, and grad=Dual switches it to use ForwardDiff (which must be loaded).
  • tensor=false turns off the use of TensorOperations.
  • Assignment xi = ... removes xi from the list of indices: its range is note calculated, and it will not be summed over. It also disables @inbounds since this is now up to you.
  • verbose=true prints things like the index ranges inferred, and gradient calculations. verbose=2 prints absolutely everything.
  • A[i,j] := ... makes a new array, while A[i,j] = ... and A[i,j] += ... write into an existing one. A[row=i, col=j] := ... makes a new NamedDimsArray.
  • @tullio (*) A[i,j] := ... is a product, as is @tullio A[i,j] *= ....

Implicit:

  • Indices without shifts must have the same range everywhere they appear, but those with shifts (even A[i+0]) run over the inersection of possible ranges.
  • Shifted output indices must start at 1, unless OffsetArrays is visible in the calling module.
  • The use of @avx, and the calculation of gradients, are switched off by sufficiently complex syntax (such as arrays of arrays).
  • Gradient hooks are attached for any or all of ReverseDiff, Tracker & Zygote. These packages need not be loaded when the macro is run.
  • Gradients are only defined for reductions over (+) (default) and min, max.
  • GPU kernels are only constructed when both KernelAbstractions and CuArray are visible. The default cuda=256 is passed to kernel(CUDA(), 256).
  • The CPU kernels from KernelAbstractions are called only when threads=false; they are not at present very fast, but perhaps useful for testing.

Extras:

  • A[i] := i^2 (i in 1:10) is how you specify a range for indices when this can't be inferred.
  • A[i] := B[i, $col] - C[i, 2] is how you fix one index to a constant (to prevent col being summed over).
  • A[i] := $d * B[i] is the preferred way to include other constants. Note that no gradient is calculated for d.
  • Tullio.@printgrad (x+y)*log(x/z) x y z prints out how symbolic derivatives will be done.
Elsewhere

Back-end friends & relatives:

Front-end near-lookalikes:

  • Einsum.jl makes simple loops. See tests/einsum.jl where using Tullio: @einsum is an almost-seamless replaceement.

  • TensorOperations.jl and OMEinsum.jl identify patterns on which they can call various basic operations.

  • TensorCast.jl expresses everything as Julia array operations, broadcasting and reduction. (OMEinsum.jl also treats some cases as a special lazy broadcast-reduction.)

Things you can't run: