BitPermutations.jl

Efficient routines for repeated bit permutations.

Introduction

Permutations of n bits can be performed in O(log(n)) operations which reshuffle the individual bits in parallel. To have each reshuffling layer efficient, they have to be chosen from sets of operations which are efficient on the CPU. Precomputing these operations is slighly non-trivial, so this package may be useful only if you need to compute the application of a given permutation to a large number of words.

Installation

BitPermutations is a Julia Language package. To install this package, please open Julia's interactive session (known as REPL) and press the ] key in the REPL to use the package mode. Then type the following command

pkg> add BitPermutations

Usage

To define a permutation of the bits in type T, construct a BitPermutation{T}.

Here is an example with T = UInt8. Bits are ordered from LSB to MSB.

using BitPermutations

v = [2, 6, 5, 8, 4, 7, 1, 3]

p = BitPermutation{UInt8}(v)

we can then apply the permutation on any bitstring of type UInt8 by using bitpermute. If a functional approach is your jam, you can equivalently call the BitPermutation directly

x = 0x08          # 0 0 0 1 0 0 0 0 (LSB -> MSB)

bitpermute(x, p)  # 0 0 0 0 1 0 0 0, or 0x10

p(x)              # Idem

To inspect the result, we can use bitstring, or we can use a Bits (defined by the package). It is basically a faster BitVector (defined in Base), since its size is fixed (but is mutable).

bits = Bits(x)

[bits[v], Bits(bitpermute(x, p))]

The neat thing of the underlying network is that the inverse permutation can be computed at the same cost. This can be performed using invbitpermute or calling the adjoint permutation

invbitpermute(x, p) === p'(x)

[bits[invperm(v)], Bits(invbitpermute(x, p))]

Internally, a Vector of bit masks and shifts are stored and then applied sequentially at each call to bitpermute. If you need to apply the permutation to an array of data, you can use the following syntax

xs = rand(T, 10)

p.(xs)                  # Regular permutation element-wise (also `bitpermute.(xs, p)`)

bitpermute(xs, p)       # Idem

bitpermute!(p, xs)      # Idem, but in-place

p'.(xs)                 # Inverse permutation element-wise (also `invbitpermute.(xs, p)`)

invbitpermute(xs, p)    # Idem

invbitpermute!(xs, p)   # Idem, but in-place

Internally, the broadcasted loop gets slices such that each stage is performed in parallel. This leads to a significant performance increase. As usual, if you do not need the original vector after the permutation step, the in-place version is faster, as it saves some allocations.

Benchmarks

The functionalities mentioned above are best summarized in the plot below. The results were obtained on an Intel Icelake processor running Julia 1.9.2 with --threads=1 --optimize=3 --check-bounds=no.

Benchmark results

To accurately measure the repeatedly perform the same permutation N different times, similarly to

x = rand(T)

for _ in 1:N
    x = bitpermute(x, p)
end

while the broadcasted permutation is performed on an array in an element-wise fashion

xs = rand(T, N)

bitpermute!(p, xs)

In both cases we choose N = 10_000, and divide the median execution time by N. Considering single permutations, Beneš networks are consistently faster than doing naïve permutations of BitVectors by a factor of 10, approximately. As discussed later on, for types UInt32 and UInt64, choosing a GRPNetwork can lead to significant speedups on processors which support BMI2 instructions. Indeed, this leads to a speedup of around 20 compared to BitVectors, since it exploits advanced processor intrinsincs at each stage. For other types however, the fallback implementation of these primitive operations is rather slow and should be avoided. If you're lucky enough to have AVX-512 intrinsics available on your processor, then using the AVXCopyGather is the method of choice for UInt16, UInt32 and UInt64, since it can perform arbitrary permutations in a few nanoseconds.

This benchmark was performed for random permutations, and is somewhat of a worst-case scenario, since each of these network has typically the most number of stages. If your permutations are not completely random but have some structure, it is possible that you might achieve even larger speedups.

The most dramatic speedup are, however, for array-wise permutations. In this case a BenesNetwork can permute bitstrings in a couple nanoseconds, yielding speedups of more than two orders of magnitude. This is because the order of the operations is rearranged: the single layers of the network are not applied in sequence for each bit string, but are applied globally on each element. This allows the processor to do the same operation over and over again, and potentially even use AVX instructions (although the same speedups are observed on Apple silicon). I am unsure why the GRPNetwork is not faster than BenesNetwork for UInt32 and UInt64: I believe the reason is that the compiler uses AVX instructions, but the PDEP/PEXT intrisics are not vectorized. The AVXCopyGather algorithm does not have a drastic improvement using broadcasting, since the single permutation is already exploiting vectorized operations, but it remains competitive compared to the other algorithms. For UInt128, the bitstrings exceed the processor word size, and everything is a bit slower. The speedup will also depend on the choice of N. Your mileage may vary, especially depending on whether or not the array fits in cache.

The full benchmark routine can be found in benchmark/benchmarks.jl, while the script for plotting the results is benchmark/plot.jl.

Backends

For a more in-depth introduction to performing bit permutations, the wonderful https://programming.sirrida.de/bit_perm.html is well worth reading.

Three different backends to perform the permutation are implemented: rearranged Beneš networks, GRP networks, and AVX-512 bit shuffles. The latter two are faster only because they exploit CPU instrisics, special operations that are available on certain x86-64 processors, in particular the AVX-512 and BMI2 instruction set. If the AVX-512 instruction set is detected, a AVXCopyGather{T} is used by default if T is 16, 32, or 64 bits long. Otherwise, if the BMI2 instruction set is detected, and T is 32 or 64 bits long, a GRPNetwork{T} is chosen For all other cases, the default choice is a BenesNetwork{T}. The backend can be chosen manually when constructing the permutation, for example

p = BitPermutation{UInt32}(GRPNetwork, v)

Using intrinsics can be disabled by setting ENV["BIT_PERMUTATIONS_USE_INTRINSICS"] = false before loading the package or setting

export BIT_PERMUTATIONS_USE_INTRINSICS=false

before launching Julia.

Beneš networks

A Beneš network is a series of butterfly or delta-swap operations, in which each node can be swapped or not with the corresponding node shifted by a fixed amount δ. These operations are arranged in pairs, in a nested fashion, with the shifts chosen to be powers of two.

Beneš network

Above is an example of a network with 8 nodes, with 3 different stages (pairs of reshuffling) which have as δ respectively 4, 2, 1. The two innermost operations can always be fused into a single one. Somewhat remarkably, in this way one can perform any arbitrary permutation. It is also relatively efficient, as each delta-swap should take around 6 cycles on modern processors. The construction of the masks for the swaps is explained in: Donald E. Knuth, The Art of Computer Programming, Vol. 4, Fascicle 1A, (Addison-Wesley, Stanford, 2008), available as pre-fascicle here. The general idea is to view each stage (pair of masks) as a repartition of the nodes into two sets. We can then construct a graph in which each edge corresponds to the constraint of having two nodes in different partitions, both on the input and output side. If we can 2-color the graph, we can then route each node to their corresponding partition, or color. Fortunately, the graph is bipartite and it is very easy to do so: we iterate through the cycles of the graph and assign alternating colors to the nodes we visit. One would wish to optimize the network in such a way that most masks are trivial (i. e. no swaps). Unfortunately I do not know of any other way that exhaustively checking all possible log(n)! arrangements. This can be disabled by passing the keyword argument rearrage=false to the constructor.

GRP networks

GRP networks work in a similar way to Beneš network, except that each layer is a different reshuffling, known as GRP or sheeps-and-goats (see also TAOCP). GRP networks are typically shallower, but the reshuffling operation is only efficient if specific instructions are available in hardware, as they can be performed in 8 cycles. The PEXT/PDEP instructions used for the GRP reshuffling is supported by Intel starting from the Haswell processors (released in 2013) and by AMD from the Zen 3 generation (released in 2020). On older AMD architectures, PEXT/PDEP is implemented in microcode and are reportedly slower. On such machines you may want to experiment which method is faster and possibly disable calls to BMI2 with ENV["BIT_PERMUTATIONS_USE_INTRINSICS"] = false. Fallback operations are implemented but are typically much slower then butterfly operations. The construction of the masks follows the algorithm in: R. Lee, Z. Shi, X. Yang, Efficient permutation instructions for fast software cryptography, IEEE Micro (2001); which is well explained here.

AVX-512 bit shuffles

As mentioned in this blog post, new x86-64 processors have very powerful built-in instructions that can perform arbitrary bit shuffles. These types of operations exploit SIMD-vectorized operations on 128-, 256-, or 512-bit registers. By duplicating the data to permute, we can then extract the bits we need using a mask. Despite the copying overhead, this turns out to be the fastest options for UInt16, UInt32 and UInt64. Unfortunately such operations are consistently available only from Intel's Icelake generation (starting from 2019).

Enhancements

Several improvements could be made. Here I just list the first ones off the top of my head:

Use ARM64-specific instructions It would be nice to improve performance on ARM64 processors by exploiting intrinsics for these processors, specifically Apple silicon. In SVE there is some kind of PEXT/PDEP equivalent here, as well as GRP operations, but unfortunately they don't seem to be supported by Apple. There are also some interesting possibilities described here.
Preconditioning A simple way of reducing the depth of the network is to try cheap operations like bitreverse and bitshift before or after the network to try to save masks. This is what is done here.
Lookup tables Small permutations could benefit from doing sub-permutations with a precomputed table. One could use pext to extract say 8 bits at a time, look up their permutation in a table of 256 elements, and join the results with |. This approach is fast but scales linearly with size, both in time and space, so it is interesting for permutations on UInt8s, UInt16s and possibly UInt32s.
PSHUFB The PSHUFB instruction is part of SSE3 and can perform arbitrary byte reshufflings. It could be used to compress a bunch of layers or combined with lookup tables for some very promising speedups.
Rearrangement Finding the best possible arrangement of the stages of a Beneš network (such that the most masks are trivial) is currently done by exhaustive search. It is likely there is a better way of constructing the masks, but I am not aware of any.
Better fallbacks Faster software fallbacks for pext/pdep exist, ~~like zp7~~ (uses x86 intrinsics anyway).
Code refactoring It should be possible to take a more functional approach and define a permutation as a series of transformations T ↦ T, but I'm not sure how to do that while preserving type inference and performance. This would allow for more generic algorithms and extensions.

Compatibility

This package is compatible with Julia 1.6 and above.