VectorizationBase.jl
VectorizationBase.CACHE_SIZE
VectorizationBase.AbstractStridedPointer
VectorizationBase.GroupedStridedPointers
VectorizationBase.MM
VectorizationBase.Unroll
VectorizationBase.align
VectorizationBase.bitselect
VectorizationBase.grouped_strided_pointer
VectorizationBase.ifmahi
VectorizationBase.ifmalo
VectorizationBase.inv_approx
VectorizationBase.lazymul
VectorizationBase.offset_ptr
VectorizationBase.pause
VectorizationBase.promote_div
VectorizationBase.unrolled_indicies
VectorizationBase.vinv_fast
VectorizationBase.vrangeincr
VectorizationBase.CACHE_SIZE
— ConstantL₁, L₂, L₃, L₄ cache size
VectorizationBase.AbstractStridedPointer
— Typeabstract type AbstractStridedPointer{T,N,C,B,R,X,O} end
T: element type N: dimensionality C: contiguous dim B: batch size R: rank of strides X: strides O: offsets
VectorizationBase.GroupedStridedPointers
— TypeP are the pointers I contains indexes into strides
(dynamic) and X
(static) X static strides
VectorizationBase.MM
— TypeThe name MM
type refers to MM registers such as XMM
, YMM
, and ZMM
. MMX
from the original MMX SIMD instruction set is a [meaningless initialism](https://en.wikipedia.org/wiki/MMX(instruction_set)#Naming).
The MM{W,X}
type is used to represent SIMD indexes of width W
with stride X
.
VectorizationBase.Unroll
— TypeAU - Unrolled axis F - Factor, step size per unroll N - How many times is it unrolled AV - Vectorized axis W - vector width M - bitmask indicating whether each factor is masked i::I - index
VectorizationBase.align
— Functionalign(x::Union{Int,Ptr}, [n])
Return aligned memory address with minimum increment. align
assumes n
is a power of 2.
VectorizationBase.bitselect
— Methodbitselect(m::Unsigned, x::Unsigned, y::Unsigned)
If you have AVX512, setbits of vector-arguments will select bits according to mask m
, selecting from x
if 0 and from y
if 1
. For scalar arguments, or vector arguments without AVX512, setbits
requires the additional restrictions on y
that all bits for which m
is 1, y
must be 0. That is for scalar arguments or vector arguments without AVX512, it requires the restriction that ((y ⊻ m) & m) == m
VectorizationBase.grouped_strided_pointer
— MethodG is a tuple(tuple((Aind,A's dim),(Aind,A's dim)), ()) it gives the groups.
VectorizationBase.ifmahi
— Methodifmalo(v1, v2, v3)
Multiply unsigned integers v1
and v2
, adding the upper 52 bits to v3
.
Requires VectorizationBase.AVX512IFMA
to be fast.
VectorizationBase.ifmalo
— Methodifmalo(v1, v2, v3)
Multiply unsigned integers v1
and v2
, adding the lower 52 bits to v3
.
Requires VectorizationBase.AVX512IFMA
to be fast.
VectorizationBase.inv_approx
— MethodFast approximate reciprocal.
Guaranteed accurate to at least 2^-14 ≈ 6.103515625e-5.
Useful for special funcion implementations.
VectorizationBase.lazymul
— MethodBasically:
if I ∈ [3,5,7,9] c[(I - 1) >> 1] else b * I end
because
c = b .* [3, 5, 7, 9]
VectorizationBase.offset_ptr
— MethodAn omnibus offset constructor.
The general motivation for generating the memory addresses as LLVM IR rather than combining multiple lllvmcall Julia functions is that we want to minimize the inttoptr
and ptrtoint
calculations as we go back and fourth. These can get in the way of some optimizations, such as memory address calculations. It is particulary import for gather
and scatter
s, as these functions take a Vec{W,Ptr{T}}
argument to load/store a Vec{W,T}
to/from. If sizeof(T) < sizeof(Int)
, converting the <W x $(typ)*
vectors of pointers in LLVM to integer vectors as they're represented in Julia will likely make them too large to fit in a single register, splitting the operation into multiple operations, forcing a corresponding split of the Vec{W,T}
vector as well. This would all be avoided by not promoting/widenting the <W x $(typ)>
into a vector of Int
s.
For this last issue, an alternate workaround would be to wrap a Vec
of 32-bit integers with a type that defines it as a pointer for use with internal llvmcall functions, but I haven't really explored this optimization.
VectorizationBase.pause
— Methodpause()
For use in spin-and-wait loops, like spinlocks.
VectorizationBase.promote_div
— MethodPromote, favoring <:Signed or <:Unsigned of first arg.
VectorizationBase.unrolled_indicies
— MethodReturns a vector of expressions for a set of unrolled indices.
VectorizationBase.vinv_fast
— Methodvinv_fast(x)
More accurate version of inv_approx, using 1 (Float32
) or 2 (Float64
) Newton iterations to achieve reasonable accuracy. Requires x86 CPUs for Float32
support, and AVX512F
for Float64
. Otherwise, it falls back on vinv(x)
.
y = 1 / x Use a Newton iteration: yₙ₊₁ = yₙ - f(yₙ)/f′(yₙ) f(yₙ) = 1/yₙ - x f′(yₙ) = -1/yₙ² yₙ₊₁ = yₙ + (1/yₙ - x) * yₙ² = yₙ + yₙ - x * yₙ² = 2yₙ - x * yₙ² = yₙ * ( 2 - x * yₙ ) yₙ₊₁ = yₙ * ( 2 - x * yₙ )
VectorizationBase.vrangeincr
— Methodvrange(::Val{W}, i::I, ::Val{O}, ::Val{F})
W - Vector width i::I - dynamic offset O - static offset F - static multiplicative factor