bitonic_merge(input_a::Vec{N,T}, input_b::Vec{N,T}) where {N,T}

Merges two SIMD.Vec objects of the same type and size using a bitonic sort network. The inputs are assumed to be sorted. Returns a pair of vectors with the first and second halves of the merged sequence.

bitonic_merge_interleaved(output_a, output_b, input_a, input_b, Val(N))

Merges K pairs of vectors using a bitonic sort network, with interleaved execution. The inputs are assumed to be sorted. Inputs and outputs are acessed directly in memory.


Regular Comb sort followed by Insertion sort when the interval becomes 1.

chipsort_medium!(data, Val(V), Val(J), Val(K))

Sort a medium array in-place. The data is divided between K blocks of J SIMD vectors of size V. The array should typically fit inside lower levels of cache memory (L1 or L2), for instance 8192 Int32 values on an conventional desktop computer. Our recipe is:

  • Generate small sorted vectors.
  • Use vectorized Comb sort.
  • Transpose in-place.
  • Finish with insertion sort over the nearly sorted array.


julia> chipsort_medium!(Int32[1:2^13...] .* 0xf0ca .%0x10000, Val(8), Val(8), Val(64))'
1×8192 LinearAlgebra.Adjoint{Int32,Array{Int32,1}}:
 30  32  34  36  38  70  72  74  76  108  110  112  114  116  …  65488  65490  65492  65494  65496  65528  65530  65532  65534
chipsort_small!(data, Val(V), Val(J))

Sort a small array using a vectorized sorting network and bitonic merge networks. The initial sort is over J SIMD vectors of size V. The array should typically fit inside register memory, for instance 64 Int32 values on an AVX2 machine.


julia> chipsort_small!(Int32[1:16...] .* Int32(1729) .%0x100, Val(4), Val(4))'
1×16 LinearAlgebra.Adjoint{Int32,Array{Int32,1}}:
 4  8  12  16  67  71  75  79  130  134  138  142  193  197  201  205

Regular version of Comb sort.

Insertion sort, adapted from the Julia codebase.


This brave function merges 4, 8 vectors or even more. Relies on the great ability from SIMD.jl and LLVM to handle large vectors. Performance not yet tested.

sort_net(input::Vararg{T, L}) where {L,T}

Applies a sorting network of size L to the input elements, returning a sorted tuple.

The elements must support the min and max functions. In the case of SIMD.Vec objects each "lane" across the vectors will be sorted. Therefore with L vectors of size N this function will produce N sorted sequences of size L, after the data is transposed (see transpose_vecs).

Sort blocks from the input array, optionally transposing to generate sorted sequences of size V (default), or further merging those into sequences of size V*L.

In-place transpose of a 3 dimensional array VKJ into VJK.

transpose_vecs(input::Vararg{Vec{N,T}, L}) where {L,N,T}

Transposes a matrix of L vectors of size N into N vectors of size L. Sizes should be powers of 2.

Find the cycle seeds to transpose a matrix with K rows and J columns into J rows and K columns.

Returns the stream with the smallest first element. If the two streams are empty, returns nothing.