Category Archives: High Performance Computing

Serial and SIMD implementation of the Xoshiro256+ random number generator – Part 1 Implementation and Usage

The Xoshiro256PlusSIMD project provides a C++ implementation of Xoshiro256+ random number generator that matches the performance of the reference C implementation of David Blackman and Sebastiano Vigna ( Xoshiro256+ combines high speed, small memory space requirements for stored state and excellent statistical quality. For cryptographic use cases or use cases where absolutely the best statistical quality is required – maybe consider a different RNG like the Mersenne Twist. For any any other conventional simulation or testing use case, Xoshiro256+ should be perfectly fine statistically and better than a whole lot of other slower alternatives.

This implementation is a header-only library and provides the following capabilities:

  • Single 64 bit unsigned random value
  • Single 64 bit unsigned random value reduced to a [lower, upper) range
  • Four 64 bit unsigned random values
  • Four 64 bit unsigned random values reduced to a [lower, upper) range
  • Single double length real random value in a range of (0,1)
  • Single double length real random value in a (lower, upper) range
  • Four double length real random values in a range of (0,1)
  • Four double length real random values in a (lower, upper) range

Implementation Details

For platforms supporting the AVX2 instruction set, the RNG can be configured to use AVX2 instructions or not on an instance by instance basis. AVX2 instructions are only used for the four-wide operations, there is no advantage using them for single value generation.

The four-wide operations use a different random seed per value and the the seed for single value generation is distinct as well. The same stream of values will be returned by the serial and AVX2 implementations. It might be faster for the serial implementation to use only a single seed across all the four values – each increasing index being the next value in a single series, instead of each of the four values having its unique series. The downside of that approach is that the serial implementation would return different four wide values than the AVX2 implementation. The AVX2 implementation must use distinct seeds for each of the four values.

The random series for each of the four-wide values are separated by 2^192 values – i.e. a Xoshiro256+ ‘long jump’ separates the seed for each of the four values. For clarity, the Xoshiro256+ has a state space of 2^256.

The reduction of the uint64s to an integer range takes uint32 bounds. This is a significant reduction in the size of the random values but permits reduction while avoiding taking a modulus. If you have a need for random integer values beyond uint32 sizes, I’d suggest taking the full 64 bit values and applying your own reduction algorithm. The modulus approach to reduction is slower than the approach in the code which uses shifts and a multiply.

Finally, the AVX versions are coded explicitly with AVX intrinsics, there is no reliance on the vageries of compiler vectorization. The SIMD version could be written such that gcc should unroll loops and vectorize but others have reported that it is necessary to tweak optimization flags to get the unrolling to work. For these implementations, all that is needed is to have the -mavx2 compiler option and the AVX2_AVAILABLE symbol defined.


The class Xoshiro256Plus is a template class and takes an SIMDInstructionSet enumerated value as its only template parameter. SIMDInstructionSet may be ‘NONE’, ‘AVX’ or ‘AVX2’. The SIMD acceleration requires the AVX2 instruction set and uses ‘if contexpr’ to control code generation at compile time. There is also a preprocessor symbol AVX2_AVAILABLE which must be defined to permit AVX2 instances of the RNG to be created. It it completely reasonable to have the AVX2 instruction set available but still use an RNG instance with no SIMD acceleration.

#define __AVX2_AVAILABLE__

#include "Xoshiro256Plus.h"

constexpr size_t NUM_SAMPLES = 1000;
constexpr uint64_t SEED = 1;

typedef SEFUtility::RNG::Xoshiro256Plus Xoshiro256PlusSerial;
typedef SEFUtility::RNG::Xoshiro256Plus Xoshiro256PlusAVX2;

bool InsureFourWideRandomStreamsMatch()
    Xoshiro256PlusSerial serial_rng(SEED);
    Xoshiro256PlusAVX2 avx_rng(SEED);

    for (auto i = 0; i < NUM_SAMPLES; i++)
        auto next_four_serial = serial_rng.next4( 200, 300 );
        auto next_four_avx = avx_rng.next4( 200, 300 );

        if(( next_four_serial[0] != next_four_avx[0] ) ||
           ( next_four_serial[1] != next_four_avx[1] ) ||
           ( next_four_serial[2] != next_four_avx[2] ) ||
           ( next_four_serial[3] != next_four_avx[3] ))
        { return false; }

    return true;