jeanphi wrote:A thread pool API is mostly useful if we completely move to a data parallel architecture. This is the case for hybrid rendering but most of Lux is mostly independent parallel sequences. That's a huge change and I don't know what can come out of it. I think a data parallel architecture might be more difficult to understand and thus maintain.
I'm not planning on rewriting everything, but simple discussion. I had hit the limits of "independant parallel sequences" in SPPM because of the differents passes and all the synchronization involved.
Usually a data parallel architecture is simpler. You can take a look on the new SPPM code, but it is strait forward, it looks like single threaded code, except that the functions with a for are called differently. But you've got the point in the fact that I'm not sure how this approach can be applied to the current integrators code.
Since this is a thread pool to parallelize for() loops, how many places are there that this can come into play?
Weird integrators with multiples passes such as SPPM, hybrid Bidir, a future merge between bidir and SPPM. The current model was not wise at all with SPPM because it leads to important CPU usage drop. Also in thoses case the code is simpler because the synchronization points appear only in the API but does not clutter the code (please, take a count of the barrier->wait() which was in the previous SPPM code to get an idea. Any of them where important and any lacks may have leads to weird behaviors.
Here we also have some really huges scenes composed of really big models which tooks time to load, when the loading of any ply files can be done in parallel, on the "war machine" we have here, it may gives our users the feedback about their scene in 10 seconds instead of twos minutes.
We also use internally lot of photons map to replace lights (I should one day commit that stuff in main luxrender repository). Each photon map can be composed of millions of photons and each photons map must be scaled, projected back, inserted inside a data structure.
Some scenes here takes 10 minutes to load, when on the bi-xeon machines with 16 threads we have, it may only take 40 seconds if a parallel_for is used.
It is why I initially take a look on TBB an so on and decided to work on improving SPPM, but the main ideas behind is accelerating the loading process.
Lux is already creating multiple render threads, so you wouldn't necessarily want to use additional sub-threads within the existing render threads, right? So it strikes me that it's benefit will mostly be limited to startup, which isn't really a big win, unless you want to do a major rewrite of Lux's threading model and get rid of the current render thread paradigm. Granted, it can't hurt to add to startup. But I'm curious what other places would benefit from being re-written to use this.
My points was not to rewrite lux NOW, but to ask a few questions about what's important for lux. To be honest, if the answer of this discussion is "We don't care at all about changing the number of threads", I just change my code in SPPM to use TBB or GCD instead. It will go quicker and be less simple to use.
Also my points was to discuss which way to go for the future of the parallel API. Developers may be more interested in using a parallel_for in their code during warmup if one is available in the lux API.
[quote]
I will also add that I regularly make use of reducing thread counts in LuxRender. E.g., when I'm testing scene setups, I sometimes have two or three instances going each working on its own test, and reduce the thread counts on each so they don't thrash the cores. They are all of equal priority (which I always set to "low" from an OS perspective so I can keep working in other apps), so just changing their process priorities doesn't work in this case.
[/code]
As far as I know thoses API main purpose (such as TBB and GCD) is to be smart about the usage of thread. I saw that TBB does not launch new thread if I use a parallel_for inside a thread, and I have read in the documentation that TBB scale itself to the system load (but not read how).