Brians Technology Blog: 2021

tirsdag den 3. august 2021

Flash DB - a small C++ database for flash memory

Background - why another flash storage system ?

In Bluetooth LE applications there is a need for persistent storage of data as soon as bonding is being used. This is because the standard requires an LE device (either central or peripheral) to persistently store at least two things: For each device that is bonded, the identity of that device must be stored together with its Long Term Key (LTK), and for each connection that is created with a known bonded deivce, the configuration of server and client characteristic descriptors must be associated with that device and stored so that it can be reused upon the next connection.

These requirements calls for a very simple relational database, where a number of device identities for bonded devices (0..N) can be stored and associated 1:1 with a set of configuration data that can be updated at each connection. Such a structure has been implemented in the Flash DataBase (FDB) that I have written in conjunction with the BLE from scratch. This simple database addresses a number of needs unique to small embedded applications:

It is small and lightweight compared to e.g. SQLite
It stores data in a way that preserves flash health by not erasing flash pages unless absolutely necessary.
It only consumes program code for things that are actually used – e.g. search terms are expressed as C++ lambda expresssions rather than with an interpreted language.

FDB does not try to be a solution to bigger database needs – in fact in its default configuration it only allows for storing up to 4 kByte of data, but that is also enough for most needs in an embedded BLE device.

This is a long blogpost, so TL;DR: The source code for FDB can be found here: https://github.com/bdpedersen/ble-from-scratch/tree/master/fdb. The test code shows how a simple 1:1 relation can be set up and how to use the data type system.

General design

FDB stores data in data records that each have a unique 32-bit ID, an 8-bit data type, a length of up to 256 4-byte words and a checksum allowing FDB to verify each record individually when loading it. There are two reserved IDs in the system: 0xFFFFFFFF which means not found and 0x00000000 which means that the record has been deleted. These two IDs are chosen due to the unique properties of flash memory:

Flash memory are special in that you can’t write 1s to it. When you want to write to flash memory it first needs to be erased. Erase means setting all values to 1. After this operation each bit can be changed from a 1 to a 0, but not back to 1 without a new erase. Flash memory is page oriented, which means that individual bits or bytes can’t be erased independently. You have to erase at least a full page, which in nRF52 is 4 kByte big.

Typically this leads to data being written to flash by first reading the full page to a RAM buffer, then modify some data before the page is erased and data is written back. This is called read-modify-write. The problem with this approach is that if a power loss occurs between the erase and writeback, not only will the new data being written be lost – all data that resided at that page would be lost. And for an embedded application with not much data available this could mean all data that is stored at all.

Furthermore most people have heard that “flash has a limited lifetime and can only endure so many writes before it fails”. Actually it is not (so much) writing that wears out the flash. It is the erase cycles, so if these can be avoided, the flash will have a much longer lifespan.

FDB gets around the problems with read-modify-write in the way data data records are written and deleted. Note you can not update data in a data record, you can only write a new one and delete the old one.

FDB uses two pages of flash at all time. One is the active page, and the other is put to use in case the first one gets filled up.

When a new data record is to be written to the system, FDB looks for the first free area in the active page. Since this area is always erased and not written, it will look like a data record with an ID of 0xFFFFFFFF. This is why this value is used to signify not found. If there is enough free space in the active page, the new data record is simply written and then given an ID that is one bigger than the largest ID that was already in use (to keep it unique).

If we need to delete a data record, the ID is simply set to 0. This is possible because any 1s in the ID can be changed to 0 by a new write. This is why 0x00000000 signify a deleted record. The length and checksum fields are kept intact so that the beginning of the next record can be found even if the item was deleted. FDB has no central registry of the locations of each of the data records, so it needs to be able to trust the length fields of the records that are in the active page.

If a write fails due to missing space, FDB be will compact the record space. This is done by verifying and then copying all records that are not deleted to the unused flash page and then erase the previous page. An active page starts with a 32 bit page counter that will increment every time it is activated (meaning all data was successfully copied to it). That way the system will be able to deduce which state it is in even if power is lost at any point in time. Pages are only erased during compaction which greatly increases the lifetime of the flash pages. Apart from allowing FDB to keep track of the data state, the active page counter can also be used as an indicator of flash health. When it approaches 10000, health can be compromised, and no flash will make it all the way to the 4 billion erases that it can indicate.

FDB consists of three layers:

A storage layer, that handles the physical interface to the flash
A record layout layer that allows read, write and verify of data records, as well as low-overhead deletion without flash page erases in most cases. This layer implements the logic described above.
An index layer that allows fast retrieval of data and fast searching for records.

These three layers are described in the following. Note that only the interfaces will be shown – the actual implementation can be found in the source repository.

Storage layer

The storage provides a basic interface to the physical flash device. It makes a few assumptions about the hardware (which could be relaxed if needed):

Flash can be read as memory mapped data
Both the active and the spare flash page is memory mapped as one contiguous memory region
Write can take place without erasing
Page erase can be initiated without writing

The nRF52 internal flash – both raw hardware and when running via the SoftDevice API - fulfills the above requirements. It will be possible also to write a backend for other devices with internal flash, and for STM32 devices with external SPI flash using DCR memory mapping.

The interface is quite simple and can be seen here:

Note that all data is written as 32-bit words and the offset in the Write() method is also in 32-bit words. This is because nRF52 (and possibly other chips) allows 4-byte aligned writes only.

Layout layer

A flash page containing data will always start with the 4 byte page counter followed by one or more data records. A record starts with a header, whose format can be seen below:

Adler16 is the Fletcher16 checksum but with 251 (largest 8-bit prime) as modulo rather than 255. This allows slightly better detection of single bit and small burst errors in small amounts of data than Fletcher-16, which is what we need in this application.

The header contains the ID, the data type (where semantics is decided by the application) and the length in 32 bit words. If byte data needs to be written, the byte length must be stored explicitly by the application.

By traversing these RecordHdr structures, all records can be located in the data store layout. An application should never use the RecordHdr directly, instead it should use the Layout class to query the data structure and read/manipulate records.

When searching for records they can be found in two ways:

By searching directly by ID. This is useful for reverse relations (you have a record that is related to another by ID and want the related record). It is also used internally to re-acquire a record that has moved in memory.
By searching for records with a specific data type. This is used for enumerating records of a given type.

Finding by ID is done via the following method

Enumeration is done by these two methods:

They return a Record struct. Only the ID field is meant to be accessed publicly, the other two are for keeping track of the data in memory. If a query doesn’t return any records (e.g. if a findNext() can’t find any more records) then the ID field is set to NOT_FOUND.

It is possible to delete a record using the Delete method:

.. and insert it using the Insert method

Note that the insert method allows that one and optionally two different data buffers are written into the record. This API is constructed this way to minimize copying operations. Using this approach it is possible to have a record type with a fixed header (e.g. an ID) and a variable size payload (e.g. a string or data buffer) so that variable size data can be associated with other records. Insert() may fail due to lack of free space in the flash page, in which case the ID field in the returned record will be set to NOT_FOUND. Otherwise the Record will refer to the newly inserted data.

Due to the way IDs work in fdb, the ID of the most recently inserted record can be retrieved by GetLastUsedID(), which will return NOT_FOUND if the database is empty.

Data is accessed by first getting the length and then a pointer to the data using these two methods

Note that you should never access the item field in the Record directly but use these methods instead. They will retrieve the the information using the fastest method possible, also taking into account that data may have moved since last time it was queried. Never use the pointer returned by GetData() outside the calling scope, i.e. never store the pointer permanently in an object or static variable. The data it points to may move during a Compact() call so it may be invalidated. Note that the pointer will point to the first of the two possible data buffers that were inserted, and the length will be the joint length of the two. It is up to a higher protocol to determine or store the boundary between the two.

The Compact() method:

can be called to purge deleted records and free up space. It will cause a flash page delete so it should only be used if an Insert() operation fails.

It is possible to find the amount of free space and get a debug status by these methods:

Index layer

While it is perfectly possible to use FDB via the Layout class only, a number of operations are made easier by the Index template class. An index can be seen as a searchable variable-sized array of data records, each record represented by a struct or other data type with a size divisible by 4. The class maintains its relation to the underlying persistent store automatically

Index is a template class shown here:

Data is copied out of an Index so that the result of an operation can be passed around. To get the record at a given numbered offset you use GetEntry which will return a Result with the data. A result contains the offset of the data, as well as the data itself. If you need the ID of the data at a given offset, you call GetEntryId(offset). The Index is searchable using GetEntryMatching() where a functor returning bool can be used to filter. Start can be set explicitly if a search needs to be re-started at some offset.

The data in a Result is always valid but the index is only valid until Delete() or DeleteIndex() is called

It is also possibly to search by ID using GetEntryWithID()

To create an Index you first declare the underlying datatype and decides which number from 0-255 should represent it in the underlying records datatype field. Suppose we have a index of paired devices identified by their Bluetooth address (for examples sake – real world is not so simple) we could do a

typedef struct {
uint8_t bt_addr[16];
} device_t;

const uint8_t device_t_dtype=1;
const int pairing_table_size=5;

And then build a pairing table of up to 5 devices by instantiating an index like this (given a Layout instance in layout):

Index<device_t,pairing_table_size,device_t_dtype> pairing_table(layout);

We could then also make a cache of auxiliary data associated to each device by ID. That index would look like this:

const uint8_t aux_dtype=2;

Index<uint32_t,pairing_table_size,aux_type> aux_data(layout);

We can append a new device after pairing by calling Append() with a device_t as argument, and we can store auxiliary data associated with it by calling Append taking both it ID and the auxiliary data buffer as argument.

Last we can copy out the variable part of the data by first allocating a buffer (on heap or stack by e.g. bfs_alloca()) with the size returned by GetExtendedSize() and then copy it out using GetExtendedData()

Using these primitives a simple relational database can be built. Chained deletes needs to be done manually but are easy to do since Delete methods returns the ID of the deleted records, so a relation can be looked up in other Index instances by GetEntryMatching()

lørdag den 31. juli 2021

Secure random numbers – template metaprogramming for a good reason

For the BLE From Scratch project, I need a cryptographically secure random number generator (RNG). There are several options, including just using the true RNG found in the nRF52 chips, which uses a physical process to generate truly random (unpredictable) numbers. However this RNG can not produce numbers at will and costs current while it runs.

Therefore I went looking for a Pseudo RNG that is re-seedable, meaning that it can be fed new true random data as seed also after startup to increase the entropy of the numbers generated. Such a PRNG can be constructed from the Keccak sponge function used in SHA-3 as described in this paper: https://keccak.team/files/SpongePRNG.pdf. This has the advantage of always having random numbers at disposal, but also to be able to increase the entropy of these numbers by feeding in data from the true RNG when available.

The algorithm at the core of the Keccak sponge function is described in here: https://keccak.team/keccak_specs_summary.html

The most important part is the mixing function which has been replicated here from the link above:

Basically it is a mixing function that mixes data into a state matrix of 25 elements. The elements in the state matrix – or sponge – can vary in precision and thereby in the entropy it is able to absorb. The precision can be down to 4 bit and maximum 64 bit. For SHA-3 a width of 64 bit is used, but for a PRNG a bit width of 8 bits is enough, meaning that the PRNG can do with only 25 bytes of state. This is an advantage in an embedded environment where RAM is scarce.

Since the sponge function state can vary in precision it makes sense to write a generic version that can take any unsigned integer and use as underlying type and thereby precision, which is the approach taken here.

TL;DR – the full source code for KRaNG – the secure random number generator – can be found here: https://github.com/bdpedersen/ble-from-scratch/tree/master/krang

The Keccak mixing function (Keccak-p)

Looking at the algorithm, most of the operations allow for using the same code for all bitwidths. However one operation – the rotate left (or rot() in the description above) – does not behave the same for all bitwidths. Therefore we have to make a rotate function that is specialized for each of the types:

GCC and Clang will recognize the (x << n) | (x >> (32-n)) as a rotation operation on a 32 bit number and replacing the type and constant in the final term any other number. It will then emit a rotate instruction (if available). However this construct only works if n is less than the bitwidth and not 0 (as shifting by 0 is implementation specific according to the C++ standard). Hence we mask the rotation amount so that we stay below the bitwidth and check for 0. These two operations could seem expensive, but as we will see below they disappear during optimization.

Looking further at the algorithm, all indices into the vectors and matrices are modulo 5. This is an issue for a naïve implementation using loops and the modulo operator. In reality we want to fully unroll all loops so that indices into the matrices can be pre-computed. This is especially an advantage on the ARM architecture where base+immediate offset loads and stores are available and support a large range of offsets.

If we want to force loop unrolling with for loops we need compiler specific extensions, and I would prefer a solution that is cross-compiler.

Template loops

We can force unrolling by using template metaprogramming, where a loop is expressed as a recursive template. It is easiest to implement template loops that decrement the index from a maximum (given as template parameter) to 0. Such a template can be seen here:

The first template is a generic template that will expand to the run function + insert a recursion in that function.

The last template is a specialization that will stop the recursion when 0 is reached. If we consider Loop1D<2> it will expand to this

void run2 (Functor f){

f(2); // First level

run1();

}

Void run1(Functor f){

f(1);

run0();

}

Void run0(Functor f){

f(0); // This is generated by the specialized template, hence no recursion occurs here.

}

As all these can be inlined the compiler will collapse these functions to:

f(2);

f(1);

f(0);

If the Functor f can be further inlined, the code will be generated and optimized with the argument as constant. This allows pre-computation of the terms that depend on x.

Note that the functor itself is a generic type. It only expects f to be a callable object taking a single integer parameter x.

In the following I would recommend to have the full sponge.hh as a reference.

Looking at the second loop in the algorithm we have an inner operation looking like this:

D[x] = C[(x+4)%5] ^ rol<T>(C[(x+1)%5],1);

We can use C++ lambda functions to construct a functor that takes only x as parameter by capturing C and D by reference and then take x as parameter. This allows us to write the loop like this:

The compiler will inline the run function and the lambda expression and expand this to:

D[4] = C[(4+4)%5] ^ rol<T>(C[(4+1)%5],1);
D[3] = C[(3+4)%5] ^ rol<T>(C[(3+1)%5],1);

D[2] = C[(2+4)%5] ^ rol<T>(C[(2+1)%5],1);
D[1] = C[(1+4)%5] ^ rol<T>(C[(1+1)%5],1);

D[0] = C[(0+4)%5] ^ rol<T>(C[(0+1)%5],1);

This allows it to optimize away the expensive % operation and use direct indices into D and C. This inlining happens already when -O1 optimization is used.

Similar unrolling can be done for the 2D loops, but the template structure is a bit more complicated for this, as recursion for the inner loop need to be restart-able. This calls for a start template, an intermediate worker template, an inner loop stop template and a general stop template. These can be seen in loops.hh, but will not be explained further.

Constant tables

Looking at the fourth loop:

We see a table lookup rix[][] used to determine the rotation amounts for the mixing. By making this table a static constexpr and using immediate indexing via the template loops, we actually allow these values to be fed as immediates to the compiler. On ARM – which has immediate rotation as an implicit operation for many other operations, this is a clear advantage:

To get this to work genericly with only one table, we use the fact that rotations are modulo the bitwidth. This is the reason for the mask and check for zero in the rotate functions we saw earlier. During inlining these will be optimized away as the rotate argument end up being constant and determined at compile time.

As it can be seen from this example, it is possible to write readable and straight forward code even if it gets heavily optimized underneath. However, thinking programming in terms of how the compiler will understand and transform the code is a bit difficult to begin with, but looking at assembly output and benchmarking will bring you a long way. And the result ends up being much faster (in this case the speedup is 3-4 times over the naïve implementation) and still portable.

Caveats

There is a number of caveats in stretching the compiler into these less used corners, especially on embedded platforms. In particular using the RC table as constexpr works in some cases and not in others, hence I had to leave this optimization out.

Introducing KRaNG - Keccak Random Number Generator

Implementing the sponge based RNG described in the keccak.team paper is straight-forward once the sponge structure has been implemented. It amounts to two functions – one to introduce new entropy and one to extract random numbers. These two can be seen below

onsdag den 14. juli 2021

BLE from scratch part 5 - interfacing characteristics to the real world

In the last installment I showed how to make services and characteristics to the real world. But reading out static data from a device is not very useful. Most if not all BLE devices need to either monitor some device or control some device.

In this post I will show how to get BLE to interact with the real world, but first we need something to interact with:

We will make a LED that can be turned on and off by writing to a characteristic, and we will make a button that can notify a gatt client on how many times it has been pushed.

We first define the GPIOs where the LED and button is attached:

Then we define the variables that will hold the LED state and button count, as well as the characteristic handles that will be associated with these variables:

We then create a function that will set the LED GPIO up correctly, and another function that will update the LED according to the led_state variable:

Note that the LED is driven with an open-drain circuit, so when the port is set to 0, the led will turn on because the lower output transistor opens. When the port is set to 1, the output is tri-stated so that no current will flow and the LED is turned off.

For this project we need to insert code for handling button events into the mainloop. We therefore introduce a function chain that allows plugging in functionality in the main loop without actually changing it. The code to do this is shown here:

Likewise we need to be able to handle write events when they occur in order to capture data changes that come via the BLE interface. This is also done by function chaining as the write event is common for all characteristics, and each characteristic that needs to execute code when written therefore has to be visited:

Finally it is important that we know if we are connected to a device, as the soft device will crash if we try to notify a characteristic without a connection being available. Hence we have introduced a function ble_conn_handle() to query the connection handle. This will return 0xFFFF if no connection is available.

Armed with this functionality we can turn to the button handling. This part is a bit more involved, primarily because I wanted to use interrupts to detect the button push. Note that the method I use in this blog is not the way to do it if you want to have a low power button handler. It lacks debouncing and can therefore double-trigger on a push. But it serves to show how the NVIC is set up when running under the SoftDevice and how critical sections are used with the SoftDevice as well.

We use the GPIOTE module to generate interrupts based on GPIO events. First the button is configured as input and then the module is set up to listen to a high-to-low transition the following way:

The GPIOTE interrupt must be enabled in the NVIC at priority 3, which is the application interrupt priority in the soft device. This is done with this code, which first disables interrupts. then enables the GPIOTE interrupt in the NVIC and sets priority. Note that these operations are done as calls through the softdevice, as direct manipulation of the NVIC would cause a hard fault. Then the code enables the GPIOTE to generate interrupts and then enables interrupts in the NVIC again. The last line will install a local extension to the mainloop handler that will look at the data generated by the button interrupt.

The GPIOTE has an interrupt handler that is defined like this:

It is declared extern "C" as it is defined in a C++ file. Without this declaration, the linker would not be able to find it. The interrupt handler increments a global variable, clears the button event from the GPIOTE hardware and then clears the interrupt in the NVIC. Again this is done via a call to the SoftDevice.

The last bit is a function that runs every time the mainloop runs. This function works as follows:

First it will look at the button push counter to see if the button was pushed since last mainloop iteration. This is done by fetching the button counter in a critical section. If so, it will check whether the peripheral is connected to a central. If not, it will just write a debug message and return. This is done because notifications when not connected can cause a fault in the SoftDevice.

[ui_service.cc:41-64]

Then - if the button was pushed, it will check if notifications are actually enabled on the button count characteristic. If it is, it will update the variable that backs it, and notify the SoftDevice of the change. If not, the button push is ignored. Note that you could have updated the backing variable and put it back for normal reading instead - that is just not the behavior in this case.

In the end we install the service much like the DeviceInfo service. This is done as follows:

Using these building blocks we can test the LED and button functionality with LightBlue. These building blocks are what allows any BLE device to interface with the real world.

Full source code can be found here

lørdag den 9. januar 2021

BLE From Scratch part 4: Adding a service

OK - so COVID-19 and other life circumstances struck - and I had to leave this project alone for some time. Now it is time to pick it up again, and look at how to add services and characteristics in the S112 SoftDevice. Other than Services and Characteristics being the way you communicate with your BLE peripheral, I will not go into details on exactly how they are supposed to work conceptually. For that I refer to the Bluetooth Specification Volume 3 part G. Admittedly this chapter can be sort of hard to read, but read it forwards, then backwards, then forwards again, and you should get the picture.

The basic relationships between the various GATT primitives look like this:

In order to have the SoftDevice work properly, each service and descriptor will have to be added in the order it appears in the database. For the first simple example we are going to build the DeviceInformation service specified by Bluetooth (documentation can be found here). All characteristics are optional so we are going to implement four characteristics: The Manufacturer name string, the model number string and the Firmware/Software revision strings. We do this because iOS expect these to be defined if DeviceInfo service is supported

By looking at the 16-bit UUIDs document (found here) we can find the following UUIDs:

Device Information Service has the uuid 0x180A,
Manufacturer name has the uuid 0x2A29
Model number has the uuid 0x2A24
Firmware revision has the uuid 0x2A26
Software revision has the uuid 0x2A28

Note that these are short UUIDs as opposed to the custom service and characteristics that we are advertising. Also note that this document is not that user-friendly as it is listing in UUID order. The search function is therefore your friend.

To define a new service we only need to define the UUID for the service

Each characteristic needs two set of metadata to be defined, one for the characteristic itself, and one for configuring the attribute value. Since all the characteristics here are of the same type (public read-only with fixed length) we only need to make two sets of metadata.

The one for the characteristic looks like this

Here we only set the read property to 1 as all other properties need to be 0 (which is default for partial initialization). This will make the characteristic read-only. The other properties are commented out but can be used as template for future configurations. It should be noted that the soft device groups a number of Characteristic concepts in the definition of this metadata. They map in the following way to the bluetooth specification volume 3 part G:

The char_props (Characteristic properties) are documented in section 3.3.1.1
The char_ext_props (Extended Characteristic properties are documented in section 3.3.3.1
The p_char_user_desc can point to a string containing a human readable description of the characteristic. If not set to NULL, char_user_desc_max_size and char_user_desc_size must be set as well. This is documented in 3.3.3.2
p_char_pf can be set to point to a format specifier as described in 3.3.3.5. If set, the format specifier has of be of the type described here
The *_md fields are used if metadata for user description, CCCD (section 3.3.3.3) or SCCD (section 3.3.3.4) should be different than the metadata for the attribute value. This would usually not be the case

Then we have the attribute value metadata:

We set the read permission to security mode 1 level 1, which means public readable without encryption. The security levels are explained here: Security Levels

You can set separate permissions for read and write, however in most cases it makes most sense to set them to the same. Since the characteristics are read-only we only set access for the read part.

We store the value in user space as indicated by setting .vloc to BLE_GATTS_VLOC_USER. This especially makes sense for read-only values as it allows us to store the value directly in flash memory.

If the value had variable length, we would set .vlen to 1. If we would have to get permission from the application to read or write we would set .rd_auth or .wr_auth to 1 respectively.

Now the actual attribute can be set up like this (manufacturer name as example):

We declare the UUID as being a standard Bluetooth UUID, and then we set up the attribute with its UUID, attribute metadata, the initial length, max length and offset of the value pointed to by p_value.

If this had been a variable length attribute, these two could have been different and the length could have been stored in e.g. the first byte in the value so that the offset had been 1 instead.

Once we have declared the data for all the characteristics, we register them with the SoftDevice like this - first installing the service, and then associate the characteristics with the service via its handle:

In this case we are going to throw the handles for the characteristics away as we don't need to deal further with them after they have been created. This will change in the next installment where we will look at dynamic characteristics that interact with the physical world.

For now the last thing we need to do is to install the service. We do that in ble_start() right before we start advertising: