Introduction

This page describes some new pseudorandom number generators (PRNGs) we (David Blackman and I) have been working on recently, and a shootout comparing them with other generators. Details about the generators can be found in our paper. Information about my previous xorshift-based generators can be found here, but they have been entirely superseded by the new ones, which are faster and better. As part of our study, we developed a very strong test for Hamming-weight dependencies which gave a number of surprising results.

64-bit Generators

xoshiro256** (XOR/shift/rotate) is our all-purpose, rock-solid generator (not a cryptographically secure generator, though, like all PRNGs in these pages). It has excellent (sub-ns) speed, a state space (256 bits) that is large enough for any parallel application, and it passes all tests we are aware of.

If, however, one has to generate only 64-bit floating-point numbers (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) faster generator with analogous statistical properties. For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits).

If you are tight on space, xoroshiro128** (XOR/rotate/shift/rotate) and xoroshiro128+ have the same speed and use half of the space; the same comments apply. They are suitable only for low-scale parallel applications; moreover, xoroshiro128+ exhibits a mild dependency in Hamming weights that generates a failure after 8 TB of output in our test. We believe this slight bias cannot affect any application.

Finally, if for any reason (which reason?) you need more state, we provide in the same vein xoshiro512** / xoshiro512+ and xoroshiro1024** / xoroshiro1024* (see the paper).

All generators, being based on linear recurrences, provide jump functions that make it possible to simulate any number of calls to the next-state function in constant time, once a suitable jump polynomial has been computed. We provide ready-made jump functions for a number of calls equal to the square root of the period, to make it easy generating non-overlapping sequences for parallel computations.

We suggest to use a SplitMix64 to initialize the state of our generators starting from a 64-bit seed, as research has shown that initialization must be performed with a generator radically different in nature from the one initialized to avoid correlation on similar seeds.

32-bit Generators

xoshiro128** is our 32-bit all-purpose, rock-solid generator, whereas xoshiro128+ is its counterpart for floating-point generation. They are the 32-bit counterpart of xoshiro256** and xoshiro256+, so similar comments apply. Their state is too small for large-scale parallelism: their intended usage is inside embedded hardware or GPUs. For an even smaller scale, you can use xoroshiro64** and xoroshiro64*. We not believe at this point in time 32-bit generator with a larger state can be of any use (but there 32-bit xoroshiro generators of much larger size).

All 32-bit generators pass all tests we are aware of, with the exception of linearity tests (binary rank and linear complexity) for xoshiro128+ and xoroshiro64*: in this case, due to the smaller number of output bits the low linear complexity of the lowest bits is sufficient to trigger BigCrush tests when the output is bit-reversed. Analogously to the 64-bit case, generating 32-bit floating-point number using the upper bits will not use any of the bits with low linear complexity.

16-bit Generators

We do not suggest any particular 16-bit generator, but it is possible to design relatively good ones using our techniques. For example, Parallax has embedded in their Propeller 2 microcontroller multiple 16-bit xoroshiro32++ generators (for information about the ++ scrambler, see the paper).

A PRNG Shootout

We provide here a shootout of a few recent 64-bit PRNGs that are quite widely used. The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. The code used to perform the tests and all the output from statistical test suites is available for download.

Speed

The speed reported in this page is the time required to emit 64 random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators. Otherwise, with suitable hardware support we could just use AES in counter mode and get 64 secure bits in 1.12ns. The tests were performed on an Intel® Core™ i7-7700 CPU @ 3.60GHz (Kaby Lake).

A few caveats:

To ease replicability, I distribute a harness performing the measurement. You just have to define a next() function and include the harness. But the only realistic suggestion is to try different generators in your application and see what happens.

Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007).

We run BigCrush starting from 100 equispaced points of the state space of the generator and collect failures—tests in which the p-value statistics is outside the interval [0.001..0.999]. A failure is systematic if it happens at all points.

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values are passed to the test suite for each generated 64-bit value. Floating point numbers are generated instead by dividing the unsigned output of the generator by 264. Since this implies a bias towards the high bits (which is anyway a known characteristic of TestU01), we run the test suite also on the reverse generator. More detail about the whole process can be found in this paper.

Beside BigCrush, we analyzed our generators using a test for Hamming-weight dependencies described in our paper. As we already remarked, our only generator failing the test (but only after 8 TB of output) is xoroshiro128+.

We report the period of each generator and its footprint in bits: a generator gives “bang-for-the-buck” if the base-2 logarithm of the period is close to the footprint. Note that the footprint has been always padded to a multiple of 64, and it can be significantly larger than expected because of padding and cyclic access indices.

PRNG Footprint (bits) Period BigCrush Systematic Failures ns/64 bits cycles/B
xoroshiro128+128 2128 − 10.790.36
xoroshiro128**128 2128 − 10.950.43
xoshiro256+256 2256 − 10.790.36
xoshiro256**256 2256 − 10.900.41
xoshiro512+5122512 − 10.960.43
xoshiro512**5122512 − 11.070.48
xoroshiro1024*106821024 − 11.200.54
xoroshiro1024**106821024 − 11.270.57
SplitMix6464 2641.250.56
Lehmer128 128 21261.300.60
PCG RXS M XS 64 (LCG) 64 2641.430.64
Ran192 ≈21911.910.86
MT19937-64 (Mersenne Twister)20032 219937 − 1LinearComp2.551.26
SFMT19937 (uses SSE2 instructions) 20032 219937 − 1LinearComp1.510.66
Tiny Mersenne Twister (64 bits)1282127 − 13.821.72
WELL1024a1068 21024 − 1 MatrixRank, LinearComp8.964.09

The following table compares instead two ways of generating floating-point numbers, namely the 521-bit dSFMT, which generates directly floating-point numbers with 52 significant bits, and xoshiro256+ followed by a standard conversion of its upper bits to a floating-point number with 53 significant bits (see below).

PRNG Footprint (bits) Period BigCrush Systematic Failures ns/double
xoshiro256+ (returns 53 significant bits) 2562256 − 11.15
dSFMT (uses SSE2 instructions, returns only 52 significant bits)7042521 − 1MatrixRank, LinearComp0.95

xoshiro256+ is ≈20% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), has a much smaller footprint, and does not fail any test.

Remarks

A long period does not imply high quality

This is a common misconception. The generator x++ has period \(2^k\), for any \(k\geq0\), provided that x is represented using \(k\) bits: nonetheless, it is a horrible generator. The generator returning \(k-1\) zeroes followed by a one has period \(k\).

It is however important that the period is long enough. A first heuristic rule of thumb is that if you need to use \(t\) values, you need a generator with period at least \(t^2\). Moreover, if you run \(n\) independent computations starting at random seeds, the sequences used by each computation should not overlap. We can stay on the safe side and require that the period is long enough so that the probability that \(n\) sequences of \(t^2\) elements starting at random positions overlap is very low.

Now, given a generator with period \(P\), the probability that \(n\) subsequences of length \(L\) starting at random points in the state space overlap is \[ 1 - \left( 1 - \frac{nL}{P-1}\right)^{n-1} \approx 1 - \left(e^{-Ln}\right)^{\frac{n-1}{P-1}} \approx \frac{Ln^2}P, \] assuming that \(P\) is large and \(nL/P\) is close to zero. If your generator has period \(2^{256}\) and you run on \(2^{64}\) cores (you will never have them) a computation using \(2^{64}\) pseudorandom numbers (you will never have the time) the probability of overlap would be less than \(2^{-64}\).

In other words: any generator with a period beyond \(2^{256}\) has a period that is sufficient for every imaginable application. Unless there are other motivations (e.g., provably increased quality), a generator with a larger period is only a waste of memory (as it needs a larger state), of cache lines, and of precious high-entropy random bits for seeding (unless you're using small seeds, but then it's not clear why you would want a very long period in the first place—the computation above is valid only if you seed all bits of the state with independent, uniformly distributed random bits).

In case the generator provides a jump function that lets you skip through chunks of the output in constant time, even a period of \(2^{128}\) can be sufficient, as it provides \(2^{64}\) non-overlapping sequences of length \(2^{64}\).

Equidistribution

Every 64-bit generator of ours with n bits of state scrambled with * or ** is n/64-dimensionally equidistributed: every n/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the largest possible dimension). Generators based on the + scrambler are however only (n/64 − 1)-dimensionally equidistributed: every (n/64 − 1)-tuple of consecutive 64-bit values appears exactly 264 times in the output, except for a missing zero tuple. The same considerations apply to 32-bit generators.

Generating uniform doubles in the unit interval

A standard double (64-bit) floating-point number in IEEE floating point format has 52 bits of significand, plus an implicit bit at the left of the significand. Thus, the representation can actually store numbers with 53 significant binary digits.

Because of this fact, in C99 a 64-bit unsigned integer x should be converted to a 64-bit double using the expression

    #include <stdint.h>

    (x >> 11) * (1. / (UINT64_C(1) << 53))

In Java, the same result can be obtained with

    (x >>> 11) * 0x1.0p-53

This conversion guarantees that all dyadic rationals of the form k / 2−53 will be equally likely. Note that this conversion prefers the high bits of x, but you can alternatively use the lowest bits.

An alternative, faster multiplication-free operation is

    #include <stdint.h>

    static inline double to_double(uint64_t x) {
       const union { uint64_t i; double d; } u = { .i = UINT64_C(0x3FF) << 52 | x >> 12 };
       return u.d - 1.0;
    }

The code above cooks up by bit manipulation a real number in the interval [1..2), and then subtracts one to obtain a real number in the interval [0..1). If x is chosen uniformly among 64-bit integers, d is chosen uniformly among dyadic rationals of the form k / 2−52. This is the same technique used by generators providing directly doubles, such as the dSFMT.

This technique is supposed to be fast (it is not on my recent hardware, however), but you will be generating half the values you could actually generate. The same problem plagues the dSFMT. All doubles generated will have the lowest significand bit set to zero (I must thank Raimo Niskanen from the Erlang team for making me notice this—a previous version of this site did not mention this issue).

In Java you can obtain an analogous result using suitable static methods:

    Double.longBitsToDouble(0x3FFL << 52 | x >>> 12) - 1.0

To adhere to the principle of least surprise, my implementations now use the multiplicative version, everywhere.

Interestingly, these are not the only notions of “uniformity” you can come up with. Another possibility is that of generating 1074-bit integers, normalize and return the nearest value representable as a 64-bit double (this is the theory—in practice, you will almost never use more than two integers per double as the remaining bits would not be representable). This approach guarantees that all representable doubles could be in principle generated, albeit not every returned double will appear with the same probability. A reference implementation can be found here. Note that unless your generator has at least 1074 bits of state and suitable equidistribution properties, the code above will not do what you expect (e.g., it might never return zero).