## Memory usage and performance of Hybrid Path Renderer

I have spent some time working on hybrid rendering (with path surface integrator) in order to optimize both the memory usage and performance. This was is a somewhat prophetical work to hybrid rendering with BiDir surface integrator.

LuxRender v0.8

LuxRender v0.8 introduced both the new concept of Renderer and the first implementation of an hybrid (i.e. CPU+GPU) Renderer. LuxRender v0.8 uses a RayBuffer of 8192 rays for each rendering thread and 8192 SurfaceIntegratorState to fill the RayBuffer with rays to send to the GPU. For some technical reason the memory required for storing 8192 SurfaceIntegratorState for each rendering thread is really HUGE.

Reducing the size of SurfaceIntegratorState class

The first and most obvious route I took was trying to reduce the memory footprint of SurfaceIntegratorState class. I reduced (with some effort) the size from 616 to 552 bytes. I soon realized this was a pretty pointless route: there is vast ramification of dependencies starting form SurfaceIntegratorState to Sample, BSDF, Sampler, SemplerData and more. In order to have a noticeable improvement, I would have to reduce the size of tons of classes.

Reducing the number of SurfaceIntegratorState used

So I took another route, something I was already using in SLG: I start with just a very small number of SurfaceIntegratorState (I start with 2 in SLG) and I increase the size only if they are unable to fill the RayBuffer. This can work because a SurfaceIntegratorState can produce multiple rays (i.e. one shadow rays and a ray to estimate the next path vertex) and you don't really need 8192 SurfaceIntegratorState to produce 8192 rays.
This was a successful route, the code raise the amount SurfaceIntegratorState needed up to about 4500 than stop because they are enough to fill the RayBuffer all the time. This was effectively cutting the amount of used ram in half.

If you run LuxRender with "-V" option, there is some DEBUG print with information of dynamic resize of the set of SurfaceIntegratorState:

Code: Select all
[Lux 2011-Jun-17 22:40:50 DEBUG : 0] New allocated IntegratorStates: 1024 => 1152 [RayBuffer size = 16384][Lux 2011-Jun-17 22:40:50 DEBUG : 0] New allocated IntegratorStates: 1024 => 2176 [RayBuffer size = 16384]

Further reducing the number of SurfaceIntegratorState used

Given the success of this route I have further improved the code by adding the support for tracing multiple shadow rays and ALL_UNIFORM (and AUTO) light strategies. For instance, by using ALL_UNIFORM light strategy and just 4 shadow rays (instead of the default 1), I can further reduce the amount of SurfaceIntegratorState required, from ~4500 to ~1150 (however, please note, the amount of memory used by a single SurfaceIntegratorState is slightly increased in order to store the multiple shadow rays).

Using the saved memory to improve performances

Once I have saved a noticeable amount of ram, I have started to add options that permit to improve the performances by using more memory. The first one is "integer raybuffersize" and it permits to set the size of the RayBuffer to a number larger than 8192. This help to reduce the overhead of transmitting more buffers via PCIe bus, etc.I use 64k rays in SLG but this number doesn't fit LuxRender architecture, 8k/16k work pretty well on my hardware but your mileage may vary.
Second step was to add the "integer statebuffercount" option: it allows the use of multiple set of SurfaceIntegratorState to overlap the CPU and GPU work. This option is quite useful, using a value of 2 will help to improve the performance.

Let see the combined result of all the changes. I used a matte LuxBall5 with the following options as benchmark:

Code: Select all
Renderer "hybrid"   "integer opencl.platform.index" [-1]   "string opencl.devices.select" ["100"]   "integer statebuffercount" [2]   "integer raybuffersize" [16000]SurfaceIntegrator "path"   "integer maxdepth" [8]   "string lightstrategy" ["all"]   "integer shadowraycount" [4]

This is a rendering with standard path tracing (28.29k samples/sec 1.74M contribution/sec):

And this with the new hybrid code (43.62k samples/sec 2.84M contribution/sec):

It is about 60% faster with hybrid rendering.

Multi-GPUs support

I have also added the support for multiple GPUs and options to select which GPU to use. However this is not yet particularly useful because LuxRender has quite some problem feeding more than one GPU.

Code: Select all
Renderer "hybrid"   "integer opencl.platform.index" [-1]   "string opencl.devices.select" ["100"]

Hybrid BiDir

The good news is that hybrid BiDir is going to be even better that hybrid Path tracing in term of number of rays generated by each single state (i.e. not only multiple shadow rays but also all the rays to connect eye and light path).

### Re: Memory usage and performance of Hybrid Path Renderer

Thanks for the overview Dade, fantastic work!
### Re: Memory usage and performance of Hybrid Path Renderer

I can has hybrid bidir?
### Re: Memory usage and performance of Hybrid Path Renderer

U kun haz. Dade mentioned that a few days ago.
### Re: Memory usage and performance of Hybrid Path Renderer

SATtva wrote:U kun haz. Dade mentioned that a few days ago.

And Dade quietly slips it in the code. http://src.luxrender.net/lux/rev/99e2e0801bff
### Re: Memory usage and performance of Hybrid Path Renderer

### Re: Memory usage and performance of Hybrid Path Renderer

binarycortex wrote:
SATtva wrote:U kun haz. Dade mentioned that a few days ago.

And Dade quietly slips it in the code. http://src.luxrender.net/lux/rev/99e2e0801bff

Just the start, write everything is going to take some time.

The idea is however quite simple: trace the light and eye path on the CPU (!) than generate in a single shot all shadow rays for direct lighting and eyepath/lightpath connections. The generated (long) list of rays will be than traced on the GPUs.

For instance with a light/eye path of length = 6, I trace 6 +6 = 12 rays on the CPU and 6 + 6 + 6 * 6 = 48 rays on the GPU (assuming all path vertices are on a diffuse surface, etc.); with all rays generated in a single shot (this is particularly important for memory usage and performance). So I can fill a RayBuffer of 8192 rays with only ~170 SurfaceIntegratorState. The hybrid path tracer needs about 4500 states to fill the same RayBuffer (if using only 1 shadow ray).

This is somewhat similar to the CBPT idea however I avoid to have to evaluate BSDFs on GPU (not practical with LuxRender and a price I don't want really to pay for hybrid rendering).

### Re: Memory usage and performance of Hybrid Path Renderer

Wow, good stuff Dade. Does it work yet, in part at least?
### Re: Memory usage and performance of Hybrid Path Renderer

So..not only does it add bidirectional path tracing support, it's also more memory efficient?

Also, does it become faster/more efficient with increasing path depth?
-Jason

J the Ninja

### Re: Memory usage and performance of Hybrid Path Renderer

binarycortex wrote:Wow, good stuff Dade. Does it work yet, in part at least?

I just had a quick glance at the code, and I now realize that not only is it incomplete, but the connecting pieces are missing.
