News

Optimizing SIMD/GPU-Friendly Random Number Generators

Optimizing SIMD/GPU-Friendly Random Number Generators

April 22, 2025
SIMD GPU Random Number Generator Vectorization Parallelization Algorithm Optimization
This article explores strategies for optimizing random number generators (RNGs) for SIMD and GPU environments, focusing on algorithm selection, vectorization, parallelization, and balancing quality with speed.

Optimizing SIMD/GPU-Friendly Random Number Generators

Chapter 37. Efficient Random Number Generation and ...

Optimizing a SIMD/GPU-friendly random number generator (RNG) involves leveraging parallel processing capabilities to generate multiple random numbers simultaneously. Here are key considerations and strategies for optimization:

1. Algorithm Selection

Choose an RNG algorithm that is both fast and suitable for parallelization. Common choices include:

  • Linear Congruential Generators (LCG): Simple and fast, but may have limitations in statistical quality. They can be vectorized effectively.
  • Park-Miller-Carta: A specific type of LCG known for its speed and simplicity, often used in real-time applications.
  • Mersenne Twister: Offers high statistical quality but is slower. It can be parallelized but may not be the best choice for SIMD/GPU optimization.
  • Xoshiro/Xoroshiro: Modern algorithms designed for speed and quality, suitable for SIMD/GPU environments.

2. Vectorization

Use SIMD (Single Instruction, Multiple Data) instructions to generate multiple random numbers in parallel. For example:

  • AVX/SSE Intrinsics: Utilize AVX or SSE instructions to process 8 or 4 random numbers at a time, respectively.
  • SIMD Wrappers: Use libraries or custom wrappers (e.g., JUCE's SIMDRegister) to simplify SIMD operations.

3. Parallelization on GPU

For GPU optimization, consider the following:

  • Thread Independence: Ensure each thread generates its own sequence of random numbers without dependencies on other threads.
  • Memory Coalescing: Optimize memory access patterns to minimize latency and maximize throughput.
  • Kernel Design: Design GPU kernels to efficiently generate and store random numbers, possibly using shared memory for intermediate results.

4. Quality vs. Speed Trade-off

Balance the need for speed with the quality of random numbers. For applications requiring high statistical quality, consider algorithms like Mersenne Twister or Xoshiro. For real-time applications where speed is critical, simpler algorithms like LCG or Park-Miller may be more appropriate.

5. Implementation Example

Here’s a simplified example of vectorizing an LCG using AVX intrinsics:

struct Int8v {
  __m256i v;
  Int8v(int a) : v{ _mm256_set1_epi32(a) } {}
  Int8v operator*(Int8v b) { return _mm256_mul_epi32(v, b.v); }
  Int8v operator+(Int8v b) { return _mm256_add_epi32(v, b.v); }
  Int8v operator&(Int8v b) { return _mm256_and_si256(v, b.v); }
};

Int8v nextRandom(Int8v state) {
  const Int8v A{ 48271 };
  auto low = (state & 0x7fff) * A;
  auto high = (state >> 15) * A;
  state = low + ((high & 0xffff) << 15) + (high >> 16);
  return state = (state & 0x7fffffff) + (state >> 31);
}

This example demonstrates how to generate 8 random numbers in parallel using AVX instructions.

6. Further Reading

For more detailed discussions and implementations, refer to the following resources:

Sources

Building a Fast, SIMD/GPU-friendly Random Number Generator For ... I wanted an algorithm that balanced simplicity, speed, statistical quality, and ease of use for SIMD/GPU programming.
[PDF] Parallel Random Number Generation for SIMD/SIMT - CERN Indico In this report, we present design considerations, implementation details and preliminary computing performance of parallel PRNG engines which support both SIMD ...