I was looking into this problem viewtopic.php?f=11&t=11559 and I ended to write a new implementation of the Sobol sampler in OpenCL.

The common way to have parallel implementation of samplers is just to have a local sampler for each thread: this isn't a big issue with CPUs where you have 6-12 samplers instead of 1. However it is a problem on modern GPUs where you can easily have half a million of threads.

The problem is in the speed of the exploration of the samples space, with 10 threads is 10 times slower (i.e. not a big deal) but with 500,000 threads is 500,000 slower ! I have a developed a new implementation of the Sobol sampler that will explore the samples space like if there was only a single sampler, making the exploration thousand of times faster on GPUs.

The result is a well noticeable reduction of the rendering noise. This is a rendering with the old Sobol sampler (64 samples per pixel):

and this is with the new Sobol sampler (64 samples per pixel):

P.S. the new implementation also address the original issue linked at the beginning.

P.S.S. if I only could find a way to obtain the same result with Metropolis sampler