LuxrenderGPU

General Project and community related discussion.

Moderator: coordinators

Re: LuxrenderGPU

Postby KyungSoo » Thu Feb 11, 2010 4:25 am

Dade wrote:
KyungSoo wrote:
Chiaroscuro wrote:That's interesting... one and two GPUs can be mostly utilized (although it still only says 85%?), but then at three the efficiency really starts to drop (individually; together they plateau) from there on (I wonder if the bus also gets busy?). Can you imagine if all 8 could be mostly utilized, you'd be getting over 12 million samples per second on that scene. Seems like it's going to be a challenge.


I think, it is possible, if LuxRay developers move some CPU task to GPU, something like rayBuffer feeding.
By removing the CPU dependency, CPU tasks will be done more efficiently, too.


While this is doable (and I have done it in SmallptGPU, btw, you could give a spin to SmallptGPU2 for a GPU-only Vs CPU+GPU comparison), it is not applicable to Luxrender so it is bit out of my scope. I can not port all the Luxrender code the GPU. Not only it would require an insane amount of time but it would not even work (too many materials/textures/light sources types, etc.). I'm looking for a solution where I can port the 1% of the code (i.e. only ray intersections code) and still have the 99% of the Classic Luxrender features.

Please, note, your system is highly asymmetrical (1CPU + 8GPU), it isn't really a surprise if a CPU+GPU architecture doesn't scale well there. Most users have just 1 CPU + 1 GPU or 1 CPU + 2x GPUs Your system configured with 2 CPUs would produce the awesome results Chiaroscuro was talking about (i.e. 12M samples/sec).

P.S. Are your 8xGPUs installed on the same PCIe 2.0 bus ? All 16x ? 8x ? I wonder too, like Chiaroscuro, if PCI bus performance have some influence. Both ATI and NVIDIA provide a sample application in their SDKs to evaluate PCI performance. It would be quite interesting to patch the application to work with multiple GPUs and to evaluate how the available PCI bus bandwidth changes with multiple GPUs.


I understood your near future plan, thank.

For the PCIe band width issue, of course overall performance depends on PCIe band width, too.
And, that is another reason why we should let GPU be independent ( work without feed from CPU ).
Following screen dump is made by all 16x slots machine, which has a high end CPU .

full_spec.png
KyungSoo
 
Posts: 374
Joined: Tue Feb 09, 2010 2:49 am
Location: Soeul, KOREA

Re: LuxrenderGPU

Postby Dade » Thu Feb 11, 2010 5:49 am

Just to add some information to the discussion, this is the result of a profiling session of a single native thread:

Code: Select all
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name   
71.17     58.98    58.98 99512438     0.00     0.00  QBVHAccel::Intersect(Ray const&, RayHit*) const
15.07     71.47    12.49 66519226     0.00     0.00  Path::AdvancePath(Scene*, Sampler*, RayBuffer const*, SampleBuffer*)
  6.46     76.82     5.35     1540     0.00     0.00  PathIntegrator::FillRayBuffer(RayBuffer*)
  2.52     78.91     2.09 221341484     0.00     0.00  RandomSampler::GetLazyValue(Sample*)
  1.30     79.99     1.08 22296905     0.00     0.00  Path::Init(Scene*, Sampler*)
  1.28     81.05     1.06     1540     0.00     0.01  PathIntegrator::AdvancePaths(RayBuffer const*)
  0.81     81.72     0.67 22296908     0.00     0.00  RandomSampler::GetNextSample(Sample*)
  0.54     82.17     0.45     1540     0.00     0.04  NativeIntersectionDevice::TraceRays(RayBuffer*)
  0.16     82.30     0.13                             std::deque<RayBuffer*, std::allocator<RayBuffer*> >::_M_reallocate_map(unsigned long, bool)
  0.14     82.42     0.12        1     0.12     0.22  QBVHAccel::BuildTree(unsigned int, unsigned int, unsigned int*, BBox*, Point*, BBox const&, BBox const&, int, int, int)
  0.07     82.48     0.06 11198604     0.00     0.00  Union(BBox const&, BBox const&)
  0.07     82.54     0.06        2     0.03     0.10  ply_read(t_ply_*)
  0.06     82.59     0.05  5291168     0.00     0.00  Union(BBox const&, Point const&)


So we have about 70% time spent inside intersection code (i.e. on the GPU in case of the OpenCL code) and 30% on the CPU. This gives also an idea of the optimal CPU/GPU ratio in a system to achieve the maximum performances with LuxrenderGPU/SmallLuxGPU at the moment.
User avatar
Dade
Developer
 
Posts: 4854
Joined: Sat Apr 19, 2008 6:04 pm
Location: Italy

Re: LuxrenderGPU

Postby vildanovak » Thu Feb 11, 2010 5:56 am

guys, Is there a wiki manual how to build lux? I'd love to try to build this awesomeness and try it out...
I have ubuntu 9.10 and win 7 so if anyone has a build can he upload it?
vildanovak
 
Posts: 53
Joined: Sat Dec 19, 2009 6:45 am

Re: LuxrenderGPU

Postby KyungSoo » Thu Feb 11, 2010 8:40 am

Dade wrote:Just to add some information to the discussion, this is the result of a profiling session of a single native thread:

Code: Select all
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name   
71.17     58.98    58.98 99512438     0.00     0.00  QBVHAccel::Intersect(Ray const&, RayHit*) const
15.07     71.47    12.49 66519226     0.00     0.00  Path::AdvancePath(Scene*, Sampler*, RayBuffer const*, SampleBuffer*)
  6.46     76.82     5.35     1540     0.00     0.00  PathIntegrator::FillRayBuffer(RayBuffer*)
  2.52     78.91     2.09 221341484     0.00     0.00  RandomSampler::GetLazyValue(Sample*)
  1.30     79.99     1.08 22296905     0.00     0.00  Path::Init(Scene*, Sampler*)
  1.28     81.05     1.06     1540     0.00     0.01  PathIntegrator::AdvancePaths(RayBuffer const*)
  0.81     81.72     0.67 22296908     0.00     0.00  RandomSampler::GetNextSample(Sample*)
  0.54     82.17     0.45     1540     0.00     0.04  NativeIntersectionDevice::TraceRays(RayBuffer*)
  0.16     82.30     0.13                             std::deque<RayBuffer*, std::allocator<RayBuffer*> >::_M_reallocate_map(unsigned long, bool)
  0.14     82.42     0.12        1     0.12     0.22  QBVHAccel::BuildTree(unsigned int, unsigned int, unsigned int*, BBox*, Point*, BBox const&, BBox const&, int, int, int)
  0.07     82.48     0.06 11198604     0.00     0.00  Union(BBox const&, BBox const&)
  0.07     82.54     0.06        2     0.03     0.10  ply_read(t_ply_*)
  0.06     82.59     0.05  5291168     0.00     0.00  Union(BBox const&, Point const&)


So we have about 70% time spent inside intersection code (i.e. on the GPU in case of the OpenCL code) and 30% on the CPU. This gives also an idea of the optimal CPU/GPU ratio in a system to achieve the maximum performances with LuxrenderGPU/SmallLuxGPU at the moment.


I'm afraid I don't agree with that the optimal CPU/GPU ratio would be 3 : 7.
Could you let me know the result of a profiling session of a single GPU thread, too?
KyungSoo
 
Posts: 374
Joined: Tue Feb 09, 2010 2:49 am
Location: Soeul, KOREA

Re: LuxrenderGPU

Postby Meelis » Thu Feb 11, 2010 11:23 am

Does it mean that tested CPU/s + mobo + memory working @ full load did 30% of work and GPU/s did 70%
And if GPU's worload aint 100% then GPU is too fast for that (mobo + memory + CPU/s).

If soo then otimal is X times in network (1 GPU video card (maybe 2 GPU card) + 1 CPU + mobo +...)
User avatar
Meelis
 
Posts: 888
Joined: Sat Oct 17, 2009 2:16 am

Re: LuxrenderGPU

Postby KyungSoo » Thu Feb 11, 2010 12:30 pm

Meelis wrote:Does it mean that tested CPU/s + mobo + memory working @ full load did 30% of work and GPU/s did 70%
And if GPU's worload aint 100% then GPU is too fast for that (mobo + memory + CPU/s).

If soo then otimal is X times in network (1 GPU video card (maybe 2 GPU card) + 1 CPU + mobo +...)


By now, it is an optimal configuration, I agree with that.
But, it is not the most optimal solution, we can pursue.

For example, if you have a plan to build a large render farm,
you will prefer more powerful and compact node, if you have options,
since it will reduce the cost of space and electricity, too.
KyungSoo
 
Posts: 374
Joined: Tue Feb 09, 2010 2:49 am
Location: Soeul, KOREA

Re: LuxrenderGPU

Postby rp181 » Thu Feb 11, 2010 9:19 pm

2 things: Awesome Job!

second, im trying to build the lux-opencl from the source code repository. When running make, i get:


Am i missing a dependancy?

EDIT: nvm, found this page: http://www.luxrender.net/wiki/index.php ... g_on_Linux

compiling now, ile see how it goes :)

EDIT EDIT:

I reached 100%, but i get this:
Code: Select all
[100%] Building CXX object CMakeFiles/pylux.dir/python/binding.o
make[2]: *** No rule to make target `/usr/lib/libboost_python.so', needed by `libpylux.so'.  Stop.
make[1]: *** [CMakeFiles/pylux.dir/all] Error 2
make: *** [all] Error 2


When i run luxrender, it opens, but when loading, it exits with a Segmentation fault
rp181
 
Posts: 16
Joined: Sun Sep 13, 2009 3:03 pm

Re: LuxrenderGPU

Postby KyungSoo » Fri Feb 12, 2010 2:57 am

rp181 wrote:When i run luxrender, it opens, but when loading, it exits with a Segmentation fault


Did you (1) "sudo make install" and "luxrender" or (2) just "./luxrender" ?
If you tried later case (2), you can get success with the first case (1).
KyungSoo
 
Posts: 374
Joined: Tue Feb 09, 2010 2:49 am
Location: Soeul, KOREA

Re: LuxrenderGPU

Postby agremlin » Fri Feb 12, 2010 5:14 am

Hello:)
I compiled, run luxGpu and i have some questions(problems):
1. preview material don't work(if replace luxconsole(gpu) on luxconsole(cpu) all fine );
2. in some scene don't work some integrators( In one scene don't work first, on another don't work second ).

May be problem in configure files?

Pentium D 805 2.66, 2GB, GeForce 9500 GT
Gentoo, nvidia-drivers-190.29

P.S. Now LuxRender very fast, even on my computer. Thanks!
agremlin
 
Posts: 23
Joined: Fri Feb 12, 2010 4:53 am
Location: Kaliningrad

Re: LuxrenderGPU

Postby Dade » Fri Feb 12, 2010 5:25 am

agremlin wrote:I compiled, run luxGpu and i have some questions(problems):
1. preview material don't work(if replace luxconsole(gpu) on luxconsole(cpu) all fine );
2. in some scene don't work some integrators( In one scene don't work first, on another don't work second ).

May be problem in configure files?


Most of the configuration is hard-coded in the sources. Same of the parameter of the scene is totally ignored/replaced with ad-hoc implementation for the OpenCL (i.e. the sampler, the integrator and accelerator). At the moment, it is like if the engine is mostly "disconnected" from the interface. As I wrote in the first post, it is not something yet usable by end users but it shows we are on the right path and we are obtaining the first results (there is still a load of work to do :D ).
User avatar
Dade
Developer
 
Posts: 4854
Joined: Sat Apr 19, 2008 6:04 pm
Location: Italy

PreviousNext

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 0 guests