For the BLE From Scratch project, I need a cryptographically secure random number generator (RNG). There are several options, including just using the true RNG found in the nRF52 chips, which uses a physical process to generate truly random (unpredictable) numbers. However this RNG can not produce numbers at will and costs current while it runs.
Therefore I went looking for a Pseudo RNG that is re-seedable, meaning that it can be fed new true random data as seed also after startup to increase the entropy of the numbers generated. Such a PRNG can be constructed from the Keccak sponge function used in SHA-3 as described in this paper: https://keccak.team/files/SpongePRNG.pdf. This has the advantage of always having random numbers at disposal, but also to be able to increase the entropy of these numbers by feeding in data from the true RNG when available.
The algorithm at the core of the Keccak sponge function is described in here: https://keccak.team/keccak_specs_summary.html
The most important part is the mixing function which has been replicated here from the link above:
Basically it is a mixing function that mixes data into a state matrix of 25 elements. The elements in the state matrix – or sponge – can vary in precision and thereby in the entropy it is able to absorb. The precision can be down to 4 bit and maximum 64 bit. For SHA-3 a width of 64 bit is used, but for a PRNG a bit width of 8 bits is enough, meaning that the PRNG can do with only 25 bytes of state. This is an advantage in an embedded environment where RAM is scarce.
Since the sponge function state can vary in precision it makes sense to write a generic version that can take any unsigned integer and use as underlying type and thereby precision, which is the approach taken here.
TL;DR – the full source code for KRaNG – the secure random number generator – can be found here: https://github.com/bdpedersen/ble-from-scratch/tree/master/krang
The Keccak mixing function (Keccak-p)
Looking at the algorithm, most of the operations allow for using the same code for all bitwidths. However one operation – the rotate left (or rot() in the description above) – does not behave the same for all bitwidths. Therefore we have to make a rotate function that is specialized for each of the types:
GCC and Clang will recognize the (x << n) | (x >> (32-n)) as a rotation operation on a 32 bit number and replacing the type and constant in the final term any other number. It will then emit a rotate instruction (if available). However this construct only works if n is less than the bitwidth and not 0 (as shifting by 0 is implementation specific according to the C++ standard). Hence we mask the rotation amount so that we stay below the bitwidth and check for 0. These two operations could seem expensive, but as we will see below they disappear during optimization.
Looking further at the algorithm, all indices into the vectors and matrices are modulo 5. This is an issue for a naïve implementation using loops and the modulo operator. In reality we want to fully unroll all loops so that indices into the matrices can be pre-computed. This is especially an advantage on the ARM architecture where base+immediate offset loads and stores are available and support a large range of offsets.
If we want to force loop unrolling with for loops we need compiler specific extensions, and I would prefer a solution that is cross-compiler.
Template loops
We can force unrolling by using template metaprogramming, where a loop is expressed as a recursive template. It is easiest to implement template loops that decrement the index from a maximum (given as template parameter) to 0. Such a template can be seen here:
The first template is a generic template that will expand to the run function + insert a recursion in that function.
The last template is a specialization that will stop the recursion when 0 is reached. If we consider Loop1D<2> it will expand to this
void run2 (Functor f){
f(2); // First level
run1();
}
Void run1(Functor f){
f(1);
run0();
}
Void run0(Functor f){
f(0); // This is generated by the specialized template, hence no recursion occurs here.
}
As all these can be inlined the compiler will collapse these functions to:
f(2);
f(1);
f(0);
If the Functor f can be further inlined, the code will be generated and optimized with the argument as constant. This allows pre-computation of the terms that depend on x.
Note that the functor itself is a generic type. It only expects f to be a callable object taking a single integer parameter x.
In the following I would recommend to have the full sponge.hh as a reference.
Looking at the second loop in the algorithm we have an inner operation looking like this:
D[x] = C[(x+4)%5] ^ rol<T>(C[(x+1)%5],1);
We can use C++ lambda functions to construct a functor that takes only x as parameter by capturing C and D by reference and then take x as parameter. This allows us to write the loop like this:
The compiler will inline the run function and the lambda expression and expand this to:
D[4] = C[(4+4)%5] ^ rol<T>(C[(4+1)%5],1);
D[3] = C[(3+4)%5] ^ rol<T>(C[(3+1)%5],1);
D[2] = C[(2+4)%5] ^ rol<T>(C[(2+1)%5],1);
D[1] = C[(1+4)%5] ^ rol<T>(C[(1+1)%5],1);
D[0] = C[(0+4)%5] ^ rol<T>(C[(0+1)%5],1);
This allows it to optimize away the expensive % operation and use direct indices into D and C. This inlining happens already when -O1 optimization is used.
Similar unrolling can be done for the 2D loops, but the template structure is a bit more complicated for this, as recursion for the inner loop need to be restart-able. This calls for a start template, an intermediate worker template, an inner loop stop template and a general stop template. These can be seen in loops.hh, but will not be explained further.
Constant tables
Looking at the fourth loop:
We see a table lookup rix[][] used to determine the rotation amounts for the mixing. By making this table a static constexpr and using immediate indexing via the template loops, we actually allow these values to be fed as immediates to the compiler. On ARM – which has immediate rotation as an implicit operation for many other operations, this is a clear advantage:
To get this to work genericly with only one table, we use the fact that rotations are modulo the bitwidth. This is the reason for the mask and check for zero in the rotate functions we saw earlier. During inlining these will be optimized away as the rotate argument end up being constant and determined at compile time.
As it can be seen from this example, it is possible to write readable and straight forward code even if it gets heavily optimized underneath. However, thinking programming in terms of how the compiler will understand and transform the code is a bit difficult to begin with, but looking at assembly output and benchmarking will bring you a long way. And the result ends up being much faster (in this case the speedup is 3-4 times over the naïve implementation) and still portable.
Caveats
There is a number of caveats in stretching the compiler into these less used corners, especially on embedded platforms. In particular using the RC table as constexpr works in some cases and not in others, hence I had to leave this optimization out.
Introducing KRaNG - Keccak Random Number Generator
Implementing the sponge based RNG described in the keccak.team paper is straight-forward once the sponge structure has been implemented. It amounts to two functions – one to introduce new entropy and one to extract random numbers. These two can be seen below
