API reference
Macros
LoopVectorization.@avx
— Macro@avx
Annotate a for
loop, or a set of nested for
loops whose bounds are constant across iterations, to optimize the computation. For example:
function AmulBavx!(C, A, B)
@avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
Cₘₙ = zero(eltype(C))
for k ∈ 1:size(A,2)
Cₘₙ += A[m,k] * B[k,n]
end
C[m,n] = Cₘₙ
end
end
The macro models the set of nested loops, and chooses an ordering of the three loops to minimize predicted computation time.
It may also apply to broadcasts:
julia> using LoopVectorization
julia> a = rand(100);
julia> b = @avx exp.(2 .* a);
julia> c = similar(b);
julia> @avx @. c = exp(2a);
julia> b ≈ c
true
Extended help
Advanced users can customize the implementation of the @avx
-annotated block using keyword arguments:
@avx inline=false unroll=2 body
where body
is the code of the block (e.g., for ... end
).
inline
is a Boolean. When true
, body
will be directly inlined into the function (via a forced-inlining call to _avx_!
). When false
, it wont force inlining of the call to _avx_!
instead, letting Julia's own inlining engine determine whether the call to _avx_!
should be inlined. (Typically, it won't.) Sometimes not inlining can lead to substantially worse code generation, and >40% regressions, even in very large problems (2-d convolutions are a case where this has been observed). One can find some circumstances where inline=true
is faster, and other circumstances where inline=false
is faster, so the best setting may require experimentation. By default, the macro tries to guess. Currently the algorithm is simple: roughly, if there are more than two dynamically sized loops or and no convolutions, it will probably not force inlining. Otherwise, it probably will.
check_empty
(default is false
) determines whether or not it will check if any of the iterators are empty. If false, you must ensure yourself that they are not empty, else the behavior of the loop is undefined and (like with @inbounds
) segmentation faults are likely.
unroll
is an integer that specifies the loop unrolling factor, or a tuple (u₁, u₂) = (4, 2)
signaling that the generated code should unroll more than one loop. u₁
is the unrolling factor for the first unrolled loop and u₂
for the next (if present), but it applies to the loop ordering and unrolling that will be chosen by LoopVectorization, not the order in body
. uᵢ=0
(the default) indicates that LoopVectorization should pick its own value, and uᵢ=-1
disables unrolling for the correspond loop.
The @avx
macro also checks the array arguments using LoopVectorization.check_args
to try and determine if they are compatible with the macro. If check_args
returns false, a fall back loop annotated with @inbounds
and @fastmath
is generated. Note that VectorizationBase
provides functions such as vadd
and vmul
that will ignore @fastmath
, preserving IEEE semantics both within @avx
and @fastmath
. check_args
currently returns false for some wrapper types like LinearAlgebra.UpperTriangular
, requiring you to use their parent
. Triangular loops aren't yet supported.
LoopVectorization.@_avx
— Macro@_avx
This macro transforms loops similarly to @avx
. While @avx
punts to a generated function to enable type-based analysis, _@avx
works on just the expressions. This requires that it makes a number of default assumptions. Use of @avx
is preferred.
This macro accepts the inline
and unroll
keyword arguments like @avx
, but ignores the check_empty
argument.
map
-like constructs
LoopVectorization.vmap
— Functionvmap(f, a::AbstractArray)
vmap(f, a::AbstractArray, b::AbstractArray, ...)
SIMD-vectorized map
, applying f
to each element of a
(or paired elements of a
, b
, ...) and returning a new array.
LoopVectorization.vmap!
— Functionvmap!(f, destination, a::AbstractArray)
vmap!(f, destination, a::AbstractArray, b::AbstractArray, ...)
Vectorized-map!
, applying f
to each element of a
(or paired elements of a
, b
, ...) and storing the result in destination
.
LoopVectorization.vmapnt
— Functionvmapnt(f, a::AbstractArray)
vmapnt(f, a::AbstractArray, b::AbstractArray, ...)
A "non-temporal" variant of vmap
. This can improve performance in cases where destination
will not be needed soon.
LoopVectorization.vmapnt!
— Functionvmapnt!(::Function, dest, args...)
This is a vectorized map implementation using nontemporal store operations. This means that the write operations to the destination will not go to the CPU's cache. If you will not immediately be reading from these values, this can improve performance because the writes won't pollute your cache. This can especially be the case if your arguments are very long.
julia> using LoopVectorization, BenchmarkTools
julia> x = rand(10^8); y = rand(10^8); z = similar(x);
julia> f(x,y) = exp(-0.5abs2(x - y))
f (generic function with 1 method)
julia> @benchmark map!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 439.613 ms (0.00% GC)
median time: 440.729 ms (0.00% GC)
mean time: 440.695 ms (0.00% GC)
maximum time: 441.665 ms (0.00% GC)
--------------
samples: 12
evals/sample: 1
julia> @benchmark vmap!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 178.147 ms (0.00% GC)
median time: 178.381 ms (0.00% GC)
mean time: 178.430 ms (0.00% GC)
maximum time: 179.054 ms (0.00% GC)
--------------
samples: 29
evals/sample: 1
julia> @benchmark vmapnt!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 144.183 ms (0.00% GC)
median time: 144.338 ms (0.00% GC)
mean time: 144.349 ms (0.00% GC)
maximum time: 144.641 ms (0.00% GC)
--------------
samples: 35
evals/sample: 1
LoopVectorization.vmapntt
— Functionvmapntt(f, a::AbstractArray)
vmapntt(f, a::AbstractArray, b::AbstractArray, ...)
A threaded variant of vmapnt
.
LoopVectorization.vmapntt!
— Functionvmapntt!(::Function, dest, args...)
Like vmapnt!
(see vmapnt!
), but uses Threads.@threads
for parallel execution.
filter
-like constructs
LoopVectorization.vfilter
— Functionvfilter(f, a::AbstractArray)
SIMD-vectorized filter
, returning an array containing the elements of a
for which f
return true
.
LoopVectorization.vfilter!
— Functionvfilter!(f, a::AbstractArray)
SIMD-vectorized filter!
, removing the element of a
for which f
is false.