using gcc 4.6.1 and -flto

Discussion related to the organization of the source code, repository and code-level/compiler optimization.

Moderator: coordinators

using gcc 4.6.1 and -flto

Postby daidai67 » Fri Jul 01, 2011 8:23 am

Hi,

I've just tried to compile everything for the first time and tried some gcc switch from my brand new arch linux installation. And since I have time to loose, I decided to benchmark (timed several demo scene until I reached a 100 S/P) every combination of switchs -flto, -fPIC, -fdata-sections, -ffunction-section. I checked that the output result is identical (with fixed seed) as the default / standard build options, although I believe that you are on the safe side as long as you don't use the -ffast-math and -Ofast

First thoughts:

Using gcc 4.6.1 + march=native (core2) give me a 2% speedup compared to regular luxrender available from this official site.

using -fPIC to compile luxconsole, luxrender give a 3% penalty in most cases (I understand it is mandatory for a shared lib for pylux, but why save some compile time to render time for luxconsole...)

using -flto + -fuse-linker-plugin + -fwholeprogram make the compilation takes ages, especially the link as the optimization are performed at the very end (you can even use -flto-partion=none for a slightly better code quality..), but the result is impressive: another +2% , but also a decrease of 36% of the size of the executable (8.7 MB to 5.7)

using -fdata-sections, -ffunction-section + Wl,--gc-sections (dead code elimination) at link reduce the size furthermore to 5.3 MB and another +1% performance (maybe lessen cache miss ??), although it should not have any impact of performances...

using -funroll-loops or finline-limit=10000 didn't do anything performance wise...

To sum up: gcc 4.6.1 + no -fPIC + -flto + -fdata-sections, -ffunction-section + Wl,--gc-sections give me roughly a +8% performance increase and a 40% executable size loss, not bad... But an awful increase in compile time and link time (didn't check it but certainly +500% )

BTW I tested only non-opencl code (no hybrid), only luxconsole/luxrender, with the some demos (including the two included in the archives), so I assume not every code paths have been traveled. I didn't see any artifact or fireflies, the output result was 100% the same as the default render, but this can happen very late in a render.

I can give you the raw number if you wish...

Regards,
daidai67
 
Posts: 10
Joined: Wed Dec 31, 2008 2:25 am

Re: using gcc 4.6.1 and -flto

Postby SATtva » Fri Jul 01, 2011 9:34 am

These are really interesting observations, thanks for sharing. I didn't updated the Linux compilation environment since 0.8 release, but going to in 7-10 days.
Linux builds packager
聞くのは一時の恥、聞かぬのは一生の恥
User avatar
SATtva
Developer
 
Posts: 5547
Joined: Tue Apr 07, 2009 12:19 pm
Location: from Siberia with love

Re: using gcc 4.6.1 and -flto

Postby jeanphi » Fri Jul 01, 2011 9:45 am

Hi,

Thanks a lot for taking the time to benchmark those new gcc options. It sure is interesting.

Jeanphi
jeanphi
Developer
 
Posts: 6624
Joined: Mon Jan 14, 2008 7:21 am

Re: using gcc 4.6.1 and -flto

Postby foobarbarian » Fri Jul 01, 2011 5:38 pm

Hai daidai67

thanks for testing this!

I've long been interested in setting up a way to automatically test different compile options. And you work
highlights how important that would be.
As you pointed out - there is a lot of cpu power to be gained. So in the end you help making lux more green :)

For windows the difference between the Intel compiler and MSVC is quite a lot, roughly 20%!

As for -fPIC: it does add another step of indirection to function calls - not only does that take cpu time it also
takes away a wee bit of locality. But I wouldn't have thought that the difference is noticeable for lux.

Depending on the material used most time is spent in a tight inner loop that is inlined.
This changes when using calculated materials - more time is then used in the noise functions etc.

Could you post the diff for reference to know where to change the seeds?

Additionally if you have the time to dig into this, one of the more cpu intensive files (depending on scene/render)
is not optimized because of a GCC bug which might or might have been fixed in recent gcc versions.
Comment out this line: SET_SOURCE_FILES_PROPERTIES(accelerators/qbvhaccel.cpp COMPILE_FLAGS "-O1")
in CMakeFiles.txt

I tried to reopen the bug case in mantis for this issue to request more information but it has been
closed without anymore details so I don't know how to check what exactly is broken.

cheers
amir

ps. From http://gcc.gnu.org/gcc-4.6/changes.html

Parallelism is controlled with -flto=n (where n specifies the number of compilations to execute in parallel). GCC can also cooperate with a GNU make job server by specifying the -flto=jobserver option and adding + to the beginning of the Makefile rule executing the linker.
Classical LTO mode can be enforced by -flto-partition=none. This may result in small code quality improvements.
foobarbarian
Developer
 
Posts: 36
Joined: Sat Mar 19, 2011 10:21 am

Re: using gcc 4.6.1 and -flto

Postby jeanphi » Fri Jul 01, 2011 6:06 pm

foobarbarian wrote:Could you post the diff for reference to know where to change the seeds?

You can use the -f option.

foobarbarian wrote:Additionally if you have the time to dig into this, one of the more cpu intensive files (depending on scene/render)
is not optimized because of a GCC bug which might or might have been fixed in recent gcc versions.
Comment out this line: SET_SOURCE_FILES_PROPERTIES(accelerators/qbvhaccel.cpp COMPILE_FLAGS "-O1")
in CMakeFiles.txt

I tried to reopen the bug case in mantis for this issue to request more information but it has been
closed without anymore details so I don't know how to check what exactly is broken.

No need to reopen the issue to test it. Enable full optimizations on the QBVH accelerator and try to render a scene like the mandatory test scene from the testscenes repository. If it's broken, you'll see through some meshes. Lots of scenes actually exhibit this behaviour.

Jeanphi
jeanphi
Developer
 
Posts: 6624
Joined: Mon Jan 14, 2008 7:21 am

Re: using gcc 4.6.1 and -flto

Postby daidai67 » Fri Jul 01, 2011 6:34 pm

How should it be broken ? I didn't see anything with the school coridor scene which use the qbvh accelerator, at least with the gcc 4.6.1 from archlinux (64bits) so I guess the bug seems corrected..

BTW, if you are masochist you can try http://stderr.org/doc/acovea/html/acoveaga.html which try to find the best of the best of compilation option combinaison, but it will take years to find out. I guess some people have a strange life to implement a genetic algorithm to test gcc...

Anyway using gcc 4.6.1 + noPIC + flto + fxxxsections seems to be really a winner, but I guess it will force you to recompile everything for pylux as a shared library (yes pic is unfortunately mandatory for shared library..)

ps yes I used -f option and simply compare pixel by pixel if some switch didn't break anything (for instance -ffast-math is not the option to use...)
ps2: I didn't see any performance boost by using -flto-partition=none
daidai67
 
Posts: 10
Joined: Wed Dec 31, 2008 2:25 am

Re: using gcc 4.6.1 and -flto

Postby jeanphi » Fri Jul 01, 2011 6:38 pm

daidai67 wrote:How should it be broken ? I didn't see anything with the school coridor scene which use the qbvh accelerator, at least with the gcc 4.6.1 from archlinux (64bits) so I guess the bug seems corrected..

That's because we force the compilation of accelerators/qbvhaccel.cpp with -O1 which disables quite a lot of optimizations. Otherwise it is miscompiled.

Jeanphi
jeanphi
Developer
 
Posts: 6624
Joined: Mon Jan 14, 2008 7:21 am

Re: using gcc 4.6.1 and -flto

Postby daidai67 » Fri Jul 01, 2011 6:47 pm

I did see the comment in the cmakelist, I was just wondering what "ruined the render" would "mean" and how should I quickly detect the miscompilation (I just didn't search in mantis the original bug report, but I will do). I must retest, but if the render was looking quite good so I assume the bug is no more there..

BTW, -flto is not to be used with gcc < 4.6, otherwise you have a great chance of a compiler crash.. Link time code optimization is an old optimization technique on msvc, it was time that gcc catch it...
daidai67
 
Posts: 10
Joined: Wed Dec 31, 2008 2:25 am

Re: using gcc 4.6.1 and -flto

Postby jeanphi » Fri Jul 01, 2011 6:53 pm

Hi,

If you haven't removed the -O1 option applied to QBVH, then the bug is not there because optimizations are disabled, not because it has been fixed in gcc.

Jeanphi
jeanphi
Developer
 
Posts: 6624
Joined: Mon Jan 14, 2008 7:21 am

Re: using gcc 4.6.1 and -flto

Postby jeanphi » Fri Jul 01, 2011 7:15 pm

Hi,

I just tested the removal of -O1 with gcc 4.5.2 on Ubuntu and it seems to work with a 2% boost. We should probably investigate the removal of that tweak, but we'll have to be very cautious with that and properly document known good compilers.

Jeanphi
jeanphi
Developer
 
Posts: 6624
Joined: Mon Jan 14, 2008 7:21 am

Next

Return to Organization & Optimization

Who is online

Users browsing this forum: No registered users and 0 guests