lørdag den 31. juli 2021

Secure random numbers – template metaprogramming for a good reason

For the BLE From Scratch project, I need a cryptographically secure random number generator (RNG). There are several options, including just using the true RNG found in the nRF52 chips, which uses a physical process to generate truly random (unpredictable) numbers. However this RNG can not produce numbers at will and costs current while it runs. 

Therefore I went looking for a Pseudo RNG that is re-seedable, meaning that it can be fed new true random data as seed also after startup to increase the entropy of the numbers generated. Such a PRNG can be constructed from the Keccak sponge function used in SHA-3 as described in this paper: https://keccak.team/files/SpongePRNG.pdf. This has the advantage of always having random numbers at disposal, but also to be able to increase the entropy of these numbers by feeding in data from the true RNG when available.

 

The algorithm at the core of the Keccak sponge function is described in here: https://keccak.team/keccak_specs_summary.html


The most important part is the mixing function which has been replicated here from the link above:


Basically it is a mixing function that mixes data into a state matrix of 25 elements. The elements in the state matrix – or sponge – can vary in precision and thereby in the entropy it is able to absorb. The precision can be down to 4 bit and maximum 64 bit. For SHA-3 a width of 64 bit is used, but for a PRNG  a bit width of 8 bits is enough, meaning that the PRNG can do with only 25 bytes of state. This is an advantage in an embedded environment where RAM is scarce.

 

Since the sponge function state can vary in precision it makes sense to write a generic version that can take any unsigned integer and use as underlying type and thereby precision, which is the approach taken here.

 

TL;DR – the full source code for KRaNG – the secure random number generator – can be found here: https://github.com/bdpedersen/ble-from-scratch/tree/master/krang

 

 

The Keccak mixing function (Keccak-p)

Looking at the algorithm, most of the operations allow for using the same code for all bitwidths. However one operation – the rotate left (or rot() in the description above) – does not behave the same for all bitwidths. Therefore we have to make a rotate function that is specialized for each of the types:

 

GCC and Clang will recognize the (x << n) | (x >> (32-n)) as a rotation operation on a 32 bit number and replacing the type and constant in the final term any other number. It will then emit a rotate instruction (if available). However this construct only works if n is less than the bitwidth and not 0 (as shifting by 0 is implementation specific according to the C++ standard). Hence we mask the rotation amount so that we stay below the bitwidth and check for 0. These two operations could seem expensive, but as we will see below they disappear during optimization.

 

Looking further at the algorithm, all indices into the vectors and matrices are modulo 5. This is an issue for a naïve implementation using loops and the modulo operator. In reality we want to fully unroll all loops so that indices into the matrices can be pre-computed. This is especially an advantage on the ARM architecture where base+immediate offset loads and stores are available and support a large range of offsets. 

If we want to force loop unrolling with for loops we need compiler specific extensions, and I would prefer a solution that is cross-compiler.

 

Template loops

We can force unrolling by using template metaprogramming, where a loop is expressed as a recursive template. It is easiest to implement template loops that decrement the index from a maximum (given as template parameter) to 0. Such a template can be seen here:

 

 

The first template is a generic template that will expand to the run function + insert a recursion in that function. 

 

The last template is a specialization that will stop the recursion when 0 is reached. If we consider Loop1D<2> it will expand to this

 

void run2 (Functor f){

   f(2); // First level

   run1();

}

 

Void run1(Functor f){

  f(1);

  run0(); 

}

 

Void run0(Functor f){

  f(0); // This is generated by the specialized template, hence no recursion occurs here.

} 

 

As all these can be inlined the compiler will collapse these functions to:

 

f(2);

f(1);

f(0);

 

If the Functor f can be further inlined, the code will be generated and optimized with the argument as constant. This allows pre-computation of the terms that depend on x.

 

Note that the functor itself is a generic type. It only expects f to be a callable object taking a single integer parameter x. 

 

In the following I would recommend to have the full sponge.hh as a reference. 

 

Looking at the second loop in the algorithm we have an inner operation looking like this:

 

D[x] = C[(x+4)%5] ^ rol<T>(C[(x+1)%5],1);


We can use C++ lambda functions to construct a functor that takes only x as parameter by capturing C and D by reference and then take x as parameter. This allows us to write the loop like this:

 

 

The compiler will inline the run function and the lambda expression and expand this to:

 

D[4] = C[(4+4)%5] ^ rol<T>(C[(4+1)%5],1);
D[3] = C[(3+4)%5] ^ rol<T>(C[(3+1)%5],1);

D[2] = C[(2+4)%5] ^ rol<T>(C[(2+1)%5],1);
D[1] = C[(1+4)%5] ^ rol<T>(C[(1+1)%5],1);

D[0] = C[(0+4)%5] ^ rol<T>(C[(0+1)%5],1);

This allows it to optimize away the expensive % operation and use direct indices into D and C. This inlining happens already when -O1 optimization is used. 

 

Similar unrolling can be done for the 2D loops, but the template structure is a bit more complicated for this, as recursion for the inner loop need to be restart-able. This calls for a start template, an intermediate worker template, an inner loop stop template and a general stop template. These can be seen in loops.hh, but will not be explained further.

Constant tables

Looking at the fourth loop:

 

 

We see a table lookup rix[][] used to determine the rotation amounts for the mixing. By making this table a static constexpr and using immediate indexing via the template loops, we actually allow these values to be fed as immediates to the compiler. On ARM – which has immediate rotation as an implicit operation for many other operations, this is a clear advantage: 

 

To get this to work genericly with only one table, we use the fact that rotations are modulo the bitwidth. This is the reason for the mask and check for zero in the rotate functions we saw earlier. During inlining these will be optimized away as the rotate argument end up being constant and determined at compile time.

 

As it can be seen from this example, it is possible to write readable and straight forward code even if it gets heavily optimized underneath. However, thinking programming in terms of how the compiler will understand and transform the code is a bit difficult to begin with, but looking at assembly output and benchmarking will bring you a long way. And the result ends up being much faster (in this case the speedup is 3-4 times over the naïve implementation) and still portable.

 

Caveats

There is a number of caveats in stretching the compiler into these less used corners, especially on embedded platforms. In particular using the RC table as constexpr works in some cases and not in others, hence I had to leave this optimization out. 

 

Introducing KRaNG - Keccak Random Number Generator

Implementing the sponge based RNG described in the keccak.team paper is straight-forward once the sponge structure has been implemented. It amounts to two functions – one to introduce new entropy and one to extract random numbers. These two can be seen below

 

onsdag den 14. juli 2021

BLE from scratch part 5 - interfacing characteristics to the real world

In the last installment I showed how to make services and characteristics to the real world. But reading out static data from a device is not very useful. Most if not all BLE devices need to either monitor some device or control some device.

In this post I will show how to get BLE to interact with the real world, but first we need something to interact with:

We will make a LED that can be turned on and off by writing to a characteristic, and we will make a button that can notify a gatt client on how many times it has been pushed.

We first define the GPIOs where the LED and button is attached:

Then we define the variables that will hold the LED state and button count, as well as the characteristic handles that will be associated with these variables:

We then create a function that will set the LED GPIO up correctly, and another function that will update the LED according to the led_state variable:

Note that the LED is driven with an open-drain circuit, so when the port is set to 0, the led will turn on because the lower output transistor opens. When the port is set to 1, the output is tri-stated so that no current will flow and the LED is turned off.

For this project we need to insert code for handling button events into the mainloop. We therefore introduce a function chain that allows plugging in functionality in the main loop without actually changing it. The code to do this is shown here:

Likewise we need to be able to handle write events when they occur in order to capture data changes that come via the BLE interface. This is also done by function chaining as the write event is common for all characteristics, and each characteristic that needs to execute code when written therefore has to be visited:

Finally it is important that we know if we are connected to a device, as the soft device will crash if we try to notify a characteristic without a connection being available. Hence we have introduced a function ble_conn_handle() to query the connection handle. This will return 0xFFFF if no connection is available.

Armed with this functionality we can turn to the button handling. This part is a bit more involved, primarily because I wanted to use interrupts to detect the button push. Note that the method I use in this blog is not the way to do it if you want to have a low power button handler. It lacks debouncing and can therefore double-trigger on a push. But it serves to show how the NVIC is set up when running under the SoftDevice and how critical sections are used with the SoftDevice as well.

 We use the GPIOTE module to generate interrupts based on GPIO events. First the button is configured as input and then the module is set up to listen to a high-to-low transition the following way:

The GPIOTE interrupt must be enabled in the NVIC at priority 3, which is the application interrupt priority in the soft device. This is done with this code, which first disables interrupts. then enables the GPIOTE interrupt in the NVIC and sets priority. Note that these operations are done as calls through the softdevice, as direct manipulation of the NVIC would cause a hard fault. Then the code enables the GPIOTE to generate interrupts and then enables interrupts in the NVIC again. The last line will install a local extension to the mainloop handler that will look at the data generated by the button interrupt.

The GPIOTE has an interrupt handler that is defined like this:

It is declared extern "C" as it is defined in a C++ file. Without this declaration, the linker would not be able to find it. The interrupt handler increments a global variable, clears the button event from the GPIOTE hardware and then clears the interrupt in the NVIC. Again this is done via a call to the SoftDevice.

The last bit is a function that runs every time the mainloop runs. This function works as follows:

First it will look at the button push counter to see if the button was pushed since last mainloop iteration. This is done by fetching the button counter in a critical section. If so, it will check whether the peripheral is connected to a central. If not, it will just write a debug message and return. This is done because notifications when not connected can cause a fault in the SoftDevice. 

[ui_service.cc:41-64]

Then - if the button was pushed, it will check if notifications are actually enabled on the button count characteristic. If it is, it will update the variable that backs it, and notify the SoftDevice of the change. If not, the button push is ignored. Note that you could have updated the backing variable and put it back for normal reading instead - that is just not the behavior in this case.

In the end we install the service much like the DeviceInfo service. This is done as follows:

Using these building blocks we can test the LED and button functionality with LightBlue. These building blocks are what allows any BLE device to interface with the real world.

Full source code can be found here