Jens just tested my current patch on his 6 core xeon with hyperthreading enabled, thus 12 threads, on the simple scene you see in the initial post.
hybrid-bidir reference: 1min, 12T, 46.58s/p, 359kS/s, 266%, 105MC/s, GPU Load 7%
hybrid-bidir my patch: 1min, 12T, 77.08s/p, 655.03kS/s, 266%, 174MC/s, GPU Load 13%
plain bidir reference: 1min, 12T, 63.50s/p, 548.63kS/s, 294%, 161MC/s no-hyb
plain bidir my patch: 1min, 12T, 76.55s/p, 661.35kS/s, 294%, 194MC/s no-hyb
So roughly 65% increase in the hybrid case and a fair 20% increase for plain sampler renderer.
