Chiaroscuro wrote:you weren't kidding about the Windows performance hit.

(or could it just be Window's timer?)
Linux 64bit is about 2 time faster than Windows 7 64bit on my hardware. I'm not sure why, I think it is a combination of GCC optimizer being a lot better than VisualC++ optimizer and the Windows scheduler being pure crap. Check my last screenshot, my 5870 is about as fast your 2x4890 and our CPUs are equivalent in speed too, I can just throw 8 threads without loosing too much GPU workload while you can just barely use 2 (!).
Did I say the Window scheduler is crap ?
You could do a test with a linux live cd to check how much you could gain with Linux on your hardware
Chiaroscuro wrote:EDIT: I've been doing more tests, and it doesn't appears to be your stats, it appears to be a catch 22 situation with trying to utilize all of the processors in my PC. You can see a little bit of the pattern in the attachment, one single GPU performs well... bring in two and there's some loss; then start bringing in CPU threads, one at a time, and for every gain in performance from a native thread I get some peformance loss from the dual GPUs. It's like they trade off each other instead of adding up together. And the more native threads, the worse the impact on the dual GPUs, to the point that it performs nearly as well with a single GPU with the max native threads. So I reach a practical plateau long before reaching the theoretical one. Sucks.

Up to now the contribution of CPU was quite negligible in term of samples/sec so it wasn't really a problem to dedicate 1 core to each GPU. However now that native threads are so fast it is becoming a suboptimal solution. So I'm exploring various solutions to keep GPU busy with less cost for the CPU:
1) generate more rays for each step: for instance trace more shadow rays instead of only one, this would highly increase the GPU load (4 triangle lightsources in luxball scene = 3 more rays trace per step = 5 rays traced instead of 2 = 2.5 more load on the GPU). It would produce less samples/sec but with a lot less noise;
2) move the image pipeline to the GPU (in order to have less load on the CPU, add true image filtering, etc.)
3) move ray setup, and result collection on the GPU too;
I'm doing #1, #2 is very interesting and looking into it soon, #3 could be ridiculous fast but is not applicable to LuxrenderGPU so I'm not planning to move in that direction.
P.S. the difference you are seeing in Sponza scene is exactly the fix of path depth I was talking about. v1.2 was just doing the first 32 passes with a cut path depth resulting in a direct lighting-only rendering (this is useful in "low latency" mode to increase the responsiveness but totally useless in "high bandwidth" mode and it was artificially boosting the samples/sec).