Category Archives: Test Tools

Serial and SIMD implementation of the Xoshiro256+ random number generator – Part 1 Implementation and Usage

The Xoshiro256PlusSIMD project provides a C++ implementation of Xoshiro256+ random number generator that matches the performance of the reference C implementation of David Blackman and Sebastiano Vigna ( Xoshiro256+ combines high speed, small memory space requirements for stored state and excellent statistical quality. For cryptographic use cases or use cases where absolutely the best statistical quality is required – maybe consider a different RNG like the Mersenne Twist. For any any other conventional simulation or testing use case, Xoshiro256+ should be perfectly fine statistically and better than a whole lot of other slower alternatives.

This implementation is a header-only library and provides the following capabilities:

  • Single 64 bit unsigned random value
  • Single 64 bit unsigned random value reduced to a [lower, upper) range
  • Four 64 bit unsigned random values
  • Four 64 bit unsigned random values reduced to a [lower, upper) range
  • Single double length real random value in a range of (0,1)
  • Single double length real random value in a (lower, upper) range
  • Four double length real random values in a range of (0,1)
  • Four double length real random values in a (lower, upper) range

Implementation Details

For platforms supporting the AVX2 instruction set, the RNG can be configured to use AVX2 instructions or not on an instance by instance basis. AVX2 instructions are only used for the four-wide operations, there is no advantage using them for single value generation.

The four-wide operations use a different random seed per value and the the seed for single value generation is distinct as well. The same stream of values will be returned by the serial and AVX2 implementations. It might be faster for the serial implementation to use only a single seed across all the four values – each increasing index being the next value in a single series, instead of each of the four values having its unique series. The downside of that approach is that the serial implementation would return different four wide values than the AVX2 implementation. The AVX2 implementation must use distinct seeds for each of the four values.

The random series for each of the four-wide values are separated by 2^192 values – i.e. a Xoshiro256+ ‘long jump’ separates the seed for each of the four values. For clarity, the Xoshiro256+ has a state space of 2^256.

The reduction of the uint64s to an integer range takes uint32 bounds. This is a significant reduction in the size of the random values but permits reduction while avoiding taking a modulus. If you have a need for random integer values beyond uint32 sizes, I’d suggest taking the full 64 bit values and applying your own reduction algorithm. The modulus approach to reduction is slower than the approach in the code which uses shifts and a multiply.

Finally, the AVX versions are coded explicitly with AVX intrinsics, there is no reliance on the vageries of compiler vectorization. The SIMD version could be written such that gcc should unroll loops and vectorize but others have reported that it is necessary to tweak optimization flags to get the unrolling to work. For these implementations, all that is needed is to have the -mavx2 compiler option and the AVX2_AVAILABLE symbol defined.


The class Xoshiro256Plus is a template class and takes an SIMDInstructionSet enumerated value as its only template parameter. SIMDInstructionSet may be ‘NONE’, ‘AVX’ or ‘AVX2’. The SIMD acceleration requires the AVX2 instruction set and uses ‘if contexpr’ to control code generation at compile time. There is also a preprocessor symbol AVX2_AVAILABLE which must be defined to permit AVX2 instances of the RNG to be created. It it completely reasonable to have the AVX2 instruction set available but still use an RNG instance with no SIMD acceleration.

#define __AVX2_AVAILABLE__

#include "Xoshiro256Plus.h"

constexpr size_t NUM_SAMPLES = 1000;
constexpr uint64_t SEED = 1;

typedef SEFUtility::RNG::Xoshiro256Plus Xoshiro256PlusSerial;
typedef SEFUtility::RNG::Xoshiro256Plus Xoshiro256PlusAVX2;

bool InsureFourWideRandomStreamsMatch()
    Xoshiro256PlusSerial serial_rng(SEED);
    Xoshiro256PlusAVX2 avx_rng(SEED);

    for (auto i = 0; i < NUM_SAMPLES; i++)
        auto next_four_serial = serial_rng.next4( 200, 300 );
        auto next_four_avx = avx_rng.next4( 200, 300 );

        if(( next_four_serial[0] != next_four_avx[0] ) ||
           ( next_four_serial[1] != next_four_avx[1] ) ||
           ( next_four_serial[2] != next_four_avx[2] ) ||
           ( next_four_serial[3] != next_four_avx[3] ))
        { return false; }

    return true;

HeapWatcher : Memory Leak Detector for Automated Testing

This project provides a simple tool for tracking heap allocations between start/finish points in C++ code. It is intended for use in unit test and perhaps some feature tests. It is not a replacement for Valgrind or other memory debugging tools – the primary intent is to provide an easy-to-use tool that can be added to unit tests built with GoogleTest or Catch2 to find leaks and provide partial or full stack dumps of leaked allocations.

The project can be found in github at:


The C standard library functions of malloc(), calloc(), realloc() and free() are ‘weak symbols‘ in glibc and can be replaced by user-supplied functions with the same signatures supplied in a user static library or shared object. This tool wraps the c standard library calls and then tracks all allocations and frees in a map. The ‘book-keeping’ is performed in a separate thread to (1) limit the need for mutexes or critical sections to protect shared state and (2) limit the run-time performance impact on the code under test. The functions in HeapWatcher are not intrusive in that they simply delegate to the glibc functions and then track allocations in a separate data structure. Allocation tracking can be paused in any thread being tracked and there is a facility to capture stack traces for ‘intentional leaks’ and then ignore those for tracking purposes.

There exists a single global static instance of HeapWatcher which can be accessed with the SEFUtility::HeapWatcher::get_heap_watcher() function.

Additionally, there are a pair of multi-threaded test fixtures provided in the project. One fixture launches workload threads and requires the user to manage the heap watcher. The second test fixture integrates the heap watcher and tracks all allocations made while the instance of the fixture itself is in scope.

For memory intensive applications running on many cores, the single tracker thread may be insufficient. All allocation records go into a queue, will not be lost and will eventually be processed. Potential problems can arise if the application allocates faster than the single thread can keep up and the queue used for passing the records to the tracker thread grows to the point that it exhausts system memory. When the HeapWatcher stops, the memory snapshot it returns is the result of processing all allocation records – so it should be correct.

Including into a Project

Probably the easiest way to use HeapWatcher is to include it through the fetch mechanism provided by CMake:




The CMake specification for HeapWatcher will build the library which muct be linked into your peoject. In addition, for the call stack decoding to work properly, the following linker option must be included in your project as well:


HeapWatcher is not a header-only project, the linker must have concrete instances of malloc(), calloc(), realloc() and free() to link to the rest of the code under test. Given the ease of including the library with CMake, this doesn’t present much of a problem overall.

Using HeapWatcher

Only a single header file HeapWatcher.hpp must be included in any file wishing to use the tool. This header contains all the data structures and classes needed to use the tool. The HeapWatcher class itself is fairly simple and the call to retrieve the global instance is trivial :

namespace SEFUtility::HeapWatcher
    class HeapWatcher
            virtual void start_watching() = 0;
            virtual HeapSnapshot stop_watching() = 0;

            [[nodiscard]] virtual PauseThreadWatchGuard pause_watching_this_thread() = 0;
            virtual uint64_t capture_known_leak(std::list<std::string>& leaking_symbols, std::function<void()> function_which_leaks) = 0;
            [[nodiscard]] virtual const KnownLeaks known_leaks() const = 0;

            [[nodiscard]] virtual const HeapSnapshot snapshot() = 0;
            [[nodiscard]] virtual const HighLevelStatistics high_level_stats() = 0;

    HeapWatcher& get_heap_watcher();

Note the namespace declaration. There are a number of other classes declared in the HeapWatcher.cpp header for the HeapSnapshot and to provide the pause watching capability. A simple example of using HeapWatcher in a Catch2 test appears below:

void OneLeak() { int* new_int = static_cast(malloc(sizeof(int))); }

void OneLeakNested() { OneLeak(); }
TEST_CASE("Basic HeapWatcher Tests", "[basic]")
    SECTION("One Leak Nested", "[basic]")


        auto leaks(SEFUtility::HeapWatcher::get_heap_watcher().stop_watching());

        REQUIRE(leaks.open_allocations().size() == 1);

        REQUIRE_THAT(leaks.open_allocations()[0].stack_trace()[0].function(), Catch::Matchers::Equals("OneLeak()"));

        REQUIRE(leaks.high_level_statistics().number_of_mallocs() == 1);
        REQUIRE(leaks.high_level_statistics().number_of_frees() == 0);
        REQUIRE(leaks.high_level_statistics().number_of_reallocs() == 0);
        REQUIRE(leaks.high_level_statistics().bytes_allocated() == sizeof(int));
        REQUIRE(leaks.high_level_statistics().bytes_freed() == 0);

Capturing Known Leaks

In various third party libraries there exist intentional leaks. A good example is the leak of a pointer for thread local storage for each thread created by the pthread library. There is a leak from the symbol ‘dl_allocate_tls‘ that appears to remain even after std::thread::join() is called. This appears not infrequently in Valgrind reports as well. Given the desire to make this a library for automated testing, there is the capability to capture and then ignore allocations from certain functions or methods. An example appears below:

SECTION("Known Leak", "[basic]")
    std::list<std::string> leaking_symbol({"KnownLeak()"});

    REQUIRE( SEFUtility::HeapWatcher::get_heap_watcher().capture_known_leak(leaking_symbol, []() { KnownLeak(); }) == 1 );

    REQUIRE(SEFUtility::HeapWatcher::get_heap_watcher().known_leaks().addresses().size() == 2);



    auto leaks(SEFUtility::HeapWatcher::get_heap_watcher().stop_watching());

    REQUIRE(leaks.open_allocations().size() == 2);

The capture_known_leak() method takes two arguments: 1) a std::list<std::string> containing one or more symbols which if located in a stack trace will cause the allocation associated with the trace to be ignored and 2) a function (or lambda) which will evoke one or more leaks associated with the symbols passed in the first argument. The leaking function need not be just adjacent to the malloc, it may be further up the call stack but the allocation will only be ignored if it appears at the same number of frames above the memory allocation as at the time the leak was captured.

This approach of actively capturing the leak at runtime is effective for dealing with ASLR (Address Space Layout Randomization) and does not require loading of shared libraries or other linking or loading gymnastics.

Pausing Allocation Tracking

The PauseThreadWatchGuard instance returned by a call to HeapWatcher::pause_watching_this_thread() is a scope based mechanism for suspending heap activity tracking in a thread. For example, the above snippet can be modified to not log the leak in OneLeakNested() by obtaining a guard and putting the leaking call into the same scope as the guard:


      auto pause_watching = SEFUtility::HeapWatcher::get_heap_watcher().pause_watching_this_thread();


    auto leaks(SEFUtility::HeapWatcher::get_heap_watcher().stop_watching());

    REQUIRE(leaks.open_allocations().size() == 0);

Once the guard instance goes out of scope, HeapWatcher will again start tracking allocations in the thread.

Test Fixtures

Two test fixtures are included with HeapWatcher and both are intended to ease the creation of multi-threaded unit test cases, which are useful for detecting race conditions or dead locks. The test fixtures feature the ability to add functions or lambdas for ‘workload functions’ and then start all of those ‘workload functions’ simultaneously. Alternatively, ‘workload functions’ may be given a random start delay in seconds (as a double so it may be fractions of a second as well). This permits stress testing with a lot of load started at one time or allows for load to ramp over time.

The SEFUtility::HeapWatcher::ScopedMultithreadedTestFixture class starts watching the heap on creation and takes a function or lambda which will be called with a HeapSnapshot when all threads have completed, to permit testing the final heap state. This test fixture effectively hides the HeapWatcher instructions whereas the SEFUtility::HeapWatcher::MultithreadedTestFixture class requires the user to wrap the test fixture with the HeapWatcher start and stop.

Examples of both test fixtures appear below. First is an example of MultithreadedTestFixture :

    SECTION("Torture Test, One Leak", "[basic]")
        constexpr int64_t num_operations = 2000000;
        constexpr int NUM_WORKERS = 20;

        SEFUtility::HeapWatcher::MultithreadedTestFixture test_fixture;


                                  std::bind(&RandomHeapOperations, num_operations));  //  NOLINT(modernize-avoid-bind)
        test_fixture.add_workload(1, &OneLeak);



        auto leaks = SEFUtility::HeapWatcher::get_heap_watcher().stop_watching();

        REQUIRE(leaks.open_allocations().size() == 1);

An example of ScopedMultiThreadedTestFixture follows :

    SECTION("Two Workloads, Few Threads, one Leak", "[basic]")
        constexpr int NUM_WORKERS = 5;

        SEFUtility::HeapWatcher::ScopedMultithreadedTestFixture test_fixture(
            [](const SEFUtility::HeapWatcher::HeapSnapshot& snapshot) { REQUIRE(snapshot.numberof_leaks() == 5); });

        test_fixture.add_workload(NUM_WORKERS, &BuildBigMap);
        test_fixture.add_workload(NUM_WORKERS, &OneLeak);




HeapWatcher and the multithreaded test fixture classes are intended to help developers create tests which check for memory leaks either in simple procedural test cases written with GoogleTest or Catch2 or in more complex multi-threaded tests in those same base frameworks.

Managing Privileges for Automated Raspberry Pi GPIO Testing

Many RPi libraries manipulate the GPIO pins by mapping various GPIO control memory blocks into the process address space. For GPIO input/output pins only, the Raspberry Pi OS kernel supports the /dev/gpiomem device which can be accessed from user space. For any other GPIO functions, such as setting up a PWM output, other memory blocks must be accessed through /dev/mem.

Typically, the base address of the desired control block is mapped into the process address space using the mmap command which takes a file descriptor for a /dev/mem device. User space processes cannot open the /dev/mem device, so a common workaround is to run the process as root using sudo. Additionally, GPIO ISR handling typically has much higher fidelity when the ISR dispatching thread runs at one of the ‘real-time’ thread schedules. This too requires elevated privileges.

For many use cases, like automated CI/CD running a test process under sudo is a less than optimal approach and certainly violates the ‘Principle of Least Privilege’. Typically, these kinds of impediments result in skipping automated testing -or- using workarounds like putting root passwords in response files.

Bluntly, there is no way to provide elevated privileges to a process without incurring some security risk and for clarity the approach described below is not strictly *secure* – I feel it is better and more constrained than most of the alternatives I have found. Certainly for hobbyists and individuals working on a benchtop, this is probably more than ‘good enough’.


The RPi maps the GPIO controls into physical memory at 0x3F000000 for BCM2835 (Models 2 &3) and 0x7E200000 for BCM2711 (Model 4) based RPis. To do this, a snippet of code like the following is used :


uint32_t peripheral_base = 0xFE000000;
uint32_t gpio_offset = 0x200000;
uint32_t gpio_block_size = 0xF4;

int dev_mem_fd = open("/dev/mem", O_RDWR | O_SYNC );
void*  gpios = mmap( nullptr, gpio_block_size, MAPPED_MEMORY_PROTECTION, MAPPED_MEMORY_FLAGS, dev_mem_fd, peripheral_base + gpio_offset );

The /dev/mem device cannot be opened if the process does not have the CAP_SYS_RAWIO capability. There are a lot of other operations that are permitted for processes with that capability – but an ability to map physical memory into a virtual address space opens up a whole plethora of potential compromises.

Unfortunately, without this privilege any app needing to mount /dev/mem will have to be run with sudo – which is difficult to manage in an automated pipeline or even when running unit tests within an IDE, like using Catch2 in VSCode.


Linux permits capabilities to be assigned to files, so it is possible to provide the CAP_SYS_RAWIO capability to specific files – for instance the unit test app created by a makefile. To do this, the following will suffice:

sudo setcap cap_sys_rawio tests

However, every time the tests file is rebuilt, the capabilities must be re-assigned – so we have not really made much progress, there is still a need for the user to intervene and provide a root password after every build.

To workaround this, the interactive user could grant herself the CAP_SETFCAP capability and then the snippet above can be run without requiring sudo. Giving a process CAP_SETFCAP capability is just one small step away from simply running as root, so we should strive for something better.

It is possible to permit a user or group to execute commands with sudo but without requiring a password by adding entries to a sudoers file. In fact, this capability can be fairly tightly constrained to very specific command patterns. These files can be placed under /etc/sudoers.d/ and will be picked up by the sudo processor. An example appears below :

steve ALL=(root) NOPASSWD:  /sbin/setcap cap_sys_rawio+eip /home/sef/dev/unit_tests/tests
steve ALL=(root) NOPASSWD: !/sbin/setcap cap_sys_rawio+eip /home/sef/dev/unit_tests/tests*..*
steve ALL=(root) NOPASSWD: !/sbin/setcap cap_sys_rawio+eip /home/sef/dev/unit_tests/tests*[ ]*

In the example, the first line permits user steve to run /sbin/setcap cap_sys_rawio+eip /home/sef/dev/unit_tests/tests without having to supply a password. The next two lines have an exclamation point to negate the operation and effectively eliminate any permutations of the prior line which could be used to grant the capability to a different, unintended file.

This combination puts us in a place where any process running under steve can provide the CAP_SYS_RAWIO capability to only the /home/sef/dev/unit_tests/tests file without having to supply the root password. Clearly, if steve‘s account is compromised it would possible for someone to gain root – but the attacker would have to do a lot more work to get there if privileges had been provided indiscriminately or if the root password were placed in a response file.

Doing the above gets us close, but there is one more step needed. The /dev/mem file is owned by root and can only be accessed by root. Assigning capabilities granularly elevates privileges in the interactive process but that process is still not root. To resolve this final stumbling block, we can modify the ACL for /dev/mem to permit the interactive user to access it. An example of how to do this appears below :

sudo setfacl -m u:steve:rw /dev/mem

This command will not persist through reboots, but needs to be executed only once after a reboot. It would be possible to make this assignment persistent if desired.

Putting It All Together

The good news is that all of the above can be *mostly* automated as part of a CMake specification. Only two infrequent manual steps are required.

The following example uses a pair of template files and some CMake specifications to create a specialized sudoers file and a specialized shell script for setting the ACLs properly for /dev/mem. Peronally, I put the templates in the misc subdirectory of my unit tests folder and the CMakeLists.txt file is in the unit test folder itself. For the purposes of this example, the templates must simply be in the subdirectory misc of the directory holding the CMakeLists.txt file.

#   Allow setcap execute without a password only for the CAP_SYS_RAWIO capability on
#       the tests file.  The negative patterns are intended to reduce the risk of anything 
#       other than just 'tests' being modified
#   Copy the generated file with the variables replaced into the /etc/sudoers.d directory

$ENV{USER} ALL=(root) NOPASSWD:  /sbin/setcap cap_sys_rawio+eip ${CMAKE_CURRENT_BINARY_DIR}/tests
$ENV{USER} ALL=(root) NOPASSWD: !/sbin/setcap cap_sys_rawio+eip ${CMAKE_CURRENT_BINARY_DIR}/tests*..*
$ENV{USER} ALL=(root) NOPASSWD: !/sbin/setcap cap_sys_rawio+eip ${CMAKE_CURRENT_BINARY_DIR}/tests*[ ]*

Adding the following to the CMakeLists.txt file will generate the final sudoers file. The CMake file command copies the generated file back into the source directory next to the template whilst also setting the file permissions appropriately. After the file is generated, it must be manually copied with the right file permissions to the /etc/sudoers.d/ directory, as that operation requires root privilege.

configure_file( ./misc/ ${CMAKE_CURRENT_BINARY_DIR}/misc/020_setcap_rawio_on_test_app )
file( COPY ${CMAKE_CURRENT_BINARY_DIR}/misc/020_setcap_rawio_on_test_app

Finally, adding the following to the CMakeLists.txt file will assign the CAP_SYS_RAWIO capability to the tests file every time it is generated.

add_custom_command(TARGET tests POST_BUILD
                   COMMAND sudo setcap cap_sys_rawio+eip ${CMAKE_CURRENT_BINARY_DIR}/tests)

To make the ACL assignment easier, a similar process is used. First, a template file which will be processed by CMake is needed :

sudo setfacl -m u:$ENV{USER}:rw /dev/mem

Then, the right magic in CMakeLists.txt to process the template file :

configure_file( ./misc/ ${CMAKE_CURRENT_BINARY_DIR}/misc/ )

This will create a shell script with the proper substitutions for the interactive user. This script needs to be executed once per session which seems a reasonable compromise. Alternatively, the sudoers file could be enriched to permit the command to be executed without a password and then even the process of permitting the interactive user access to /dev/mem can be used in automated scripts.


As mentioned in the introduction, ISRs servicing GPIO interrupts will typically need to run with realtime scheduling for reasonable performance. The main risk that concerns me is *missing interrupts* and beyond a couple kilohertz on an RPi4 it is easy to lose interrupts. Realtime scheduling in Linux can be applied at the thread level using pthread_setschedparam in something like the following:

if (realtime_scheduling_)
    struct sched_param thread_scheduling;

    thread_scheduling.sched_priority = 5;

    int result = pthread_setschedparam(pthread_self(), SCHED_FIFO, &thread_scheduling);

    if( result != 0 )
        SPDLOG_ERROR( "Unable to set ISR Thread scheduling policy - file may need cap_sys_nice capability.  Error: {}", result );

Using realtime scheduling is something of a risk as poorly designed code can starve the rest of the system, or perhaps more frequently a user looking for more responsivity can ‘nice’ their processes to the detriment of other processes. Therefore, the CAP_SYS_NICE capability is required to execute the above snippet.

The templates above can be enriched to include CAP_SYS_NICE, but there are a few details that *really matter*. The nastiest little complication is the difference in how the comma (i.e. ‘,’) is used by the setcap command and how the comma is interpreted in a sudoers file. In both cases it is a separator, to separate multiple capabilities for setcap and to separate different commands in the sudoers file. Therefore, within the setcap command in the sudoers file, the comma must be escaped with a backslash. The following is the template from above including the CAP_SYS_NICE capability.

#   Allow setcap execute without a password only for the CAP_SYS_RAWIO and CAP_SYS_NICE capabilities
#       on on the tests file.  The negative patterns are intended to reduce the risk of anything
#       other than just 'tests' being modified
#   Copy the generated file with the variables replaced into the /etc/sudoers.d directory

$ENV{USER} ALL=(root) NOPASSWD:  /sbin/setcap cap_sys_rawio\,cap_sys_nice+eip ${CMAKE_CURRENT_BINARY_DIR}/tests
$ENV{USER} ALL=(root) NOPASSWD: !/sbin/setcap cap_sys_rawio\,cap_sys_nice+eip ${CMAKE_CURRENT_BINARY_DIR}/tests*..*
$ENV{USER} ALL=(root) NOPASSWD: !/sbin/setcap cap_sys_rawio\,cap_sys_nice+eip ${CMAKE_CURRENT_BINARY_DIR}/tests*[ ]*

As shown above, the comma separating capabilities is escaped with a backslash. Similarly, the setcap command used to assign capabilities to the tests file will have to be modified in the CMakeLists.txt specification.


Hopefully the content in this post will help not only manage permissions necessary for developing GPIO applications on the RPi but also provide some insight into how CMake can be used to generate various kinds of files from templates for specific use cases. I use these in my CMake files in VSCode while developing remotely on RPi 3s and 4s and it is certainly a lot more fluid a development experience than having to enter the root password all the time.