## using gcc 4.6.1 and -flto

Discussion related to the organization of the source code, repository and code-level/compiler optimization.

### Re: using gcc 4.6.1 and -flto

jeanphi wrote:Hi,

If you haven't removed the -O1 option applied to QBVH, then the bug is not there because optimizations are disabled, not because it has been fixed in gcc.

Jeanphi

no I commented the whole line, so qbvh was compiled with the default options (-O3 ...). I specifically checked the compile command, set cmake to be verbose, etc... Furthermore with flto flags, the real optimization is performed at the end link (you must pass -flto -Oxx when linking)

Listing all compiler (and especially unofficial gcc variant) will not be easy, finding old gcc will be tricky. I guess that setting and supporting a minimal version would be easier (I guess the oldest still alive is on mac with "apple" gcc 4.2.x). Does the bug also affect windows ? setting -O1 optimize for size on windows, "roughly" equivalent to Os which is a higher optimization level than -O1 on gcc world..
daidai67

Posts: 10
Joined: Wed Dec 31, 2008 2:25 am

### Re: using gcc 4.6.1 and -flto

Hi,

The -O1 is gcc only. Currently the Mac build doesn't use the compile tweak.
I don't think the Windows build is affected by this, but it is currently not available from CMake.

Jeanphi
jeanphi

Posts: 7889
Joined: Mon Jan 14, 2008 7:21 am

### Re: using gcc 4.6.1 and -flto

Btw, we use Clang on OS X for the core libs, Apple GCC is only used for the apps themselves, and even that is only because Qt doesn't play nice with Clang.
-Jason Stuff

J the Ninja

Posts: 2508
Joined: Wed May 19, 2010 9:54 pm
Location: Portland, USA

### Re: using gcc 4.6.1 and -flto

This is the mantis entry:

http://www.luxrender.net/mantis/view.php?id=693

The reason I did reopen the case was that profiling showed a lot of time is spent inside that file.
Thought it might be the best place to gather additional information to try and find the
specific bug with gcc bugtracking. After all such a big problem must have been noticed/tracked/fixed.

Then we'd know exactly which GCC versions are broken and which are not.
foobarbarian

Posts: 36
Joined: Sat Mar 19, 2011 10:21 am

### Re: using gcc 4.6.1 and -flto

foobarbarian wrote:The reason I did reopen the case was that profiling showed a lot of time is spent inside that file.

This is pretty expected, it is the "ray tracing" part of the code and you can expect a ray tracer to spend there most of the time. The true is that LuxRender should spend there even more time: other part of the code are quite expansive compared to the cost of tracing rays. It also true that this is a general trend seen in the last years in about every "ray tracer" (i.e. sampling, shading, image pipiline are more expansive than in the past).

Note: QBVH is mostly hand written SSE code, it is nearly like if it was written in assembler so the level of compiler optimization doesn't change too much the level of performance.

Posts: 7966
Joined: Sat Apr 19, 2008 6:04 pm
Location: Italy

### Re: using gcc 4.6.1 and -flto

Dade wrote:Note: QBVH is mostly hand written SSE code, it is nearly like if it was written in assembler so the level of compiler optimization doesn't change too much the level of performance.

This is from memory as my eval copy of the Intel Suite has run out... but I try to get my facts straight
Give the eval a try if you haven't done yet - it does provide an amazing amount of information while
still maintaining a somewhat coherent view. Shame its not free for GPL work on windows.

While most part of part of the QBVHAccel is hand coded the Intel compiler is still able to add a sizable boost to QBVHAccel::Intersect
over MSVC10 Pro builds - roughly 20% for the whole luxmark benchmark (cpu only) which does not contain any computationally expensive material.

It does so mainly via more clever instruction reordering to keep the pipelines filled. I don't think it did any auto vectorization on its own there.

Still, there is an issue of branch mispredicts in that part of the code, execution units where not utilized at 100% but far from that.

Also, while the code in QBVHAccel::Intersect from luxrays does look uglier it should offer better performance as the code most likely will end up
with less branches.

On a side note, I've now read multiple times that loop unrolling will hurt performance on Sandy-Bridge gen cpus but so far I haven't had a chance to test that myself.

If we talk about a 7 day render for a frame or multiple frames adding 2% performance means >3 hours saved.

cheers
amir
foobarbarian

Posts: 36
Joined: Sat Mar 19, 2011 10:21 am

### Re: using gcc 4.6.1 and -flto

J the Ninja wrote:Btw, we use Clang on OS X for the core libs, Apple GCC is only used for the apps themselves, and even that is only because Qt doesn't play nice with Clang.

Just for my curiosity and as I don't have any mac near me at home (only at work..) for testing by myself, did the use of CLang give a measurable performance boost ? I've heard anyway that clang optimization is not on the same level as recent gcc. BTW the mac lag behind with 4.2... And if rebuilding gcc is easy, even on mac, it take a lot of time to have a 4.6.1 running on this platforms, and having framework support is a real nightmare when using pure gnu tools.

regards,
daidai
daidai67

Posts: 10
Joined: Wed Dec 31, 2008 2:25 am

### Re: using gcc 4.6.1 and -flto

It did, something like 10% actually, IIRC. Flipping on link-time optimization pushes this up another 3-5% or so, but for some reason the resulting build only works on newer Macs (first gen C2D-era chips can't run it). Might be a bug in Clang slipping sse4 instructions in or something (no idea really, that explanation just makes some amount of sense)
-Jason Stuff

J the Ninja

Posts: 2508
Joined: Wed May 19, 2010 9:54 pm
Location: Portland, USA

### Re: using gcc 4.6.1 and -flto

Dade wrote:Note: QBVH is mostly hand written SSE code, it is nearly like if it was written in assembler so the level of compiler optimization doesn't change too much the level of performance.

I agree, but this is not incompatible with gain boost with higher optimization level since I guess the 2% gain is mostly due to auto inlining that -O3 turns on. The code is possibly not faster, but the code "flow" maybe. And 2% +2% +2% +2% begins to give a lot of %.

I think a profile guided optimization pass should even be better, it should even be easy to do, as the render process is quite simple: input data -> render engine -> result. x264 encoder use this kind of build procedure for its release and it is working well.

I will try to make a prototype once cmake files have been stabilized and/or if I have some times this week. The first step would be to gather some test scene collection that cover 80-90% of the real hot code path to have accurate result. Since I'm very new to luxrender world (and in 3d modeling to ), do you have any clue about where to find that and which render settings should I use ?

regards,
Didier
daidai67

Posts: 10
Joined: Wed Dec 31, 2008 2:25 am

### Re: using gcc 4.6.1 and -flto

I've spent today evening applying daidai67's results to my environment. Unfortunately, without any success at all.

Before going forward i benchmarked luxconsole compiled with gcc-4.5.2, cloog-ppl-0.15.10, ld-2.20.1 and glibc-2.11.3 and the current set of compiler flags:
Code: Select all
-march=amdfam10 -mabm -msse4a -fprefetch-loop-arrays -O3 -pipe -mfpmath=sse -ftree-vectorize -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math -fno-rounding-math -fno-signaling-nans -fcx-limited-range -DBOOST_DISABLE_ASSERTS -floop-interchange -floop-strip-mine -floop-block -fsee -ftree-loop-linear -ftree-loop-distribution -ftree-loop-im -fivopts -ftracer -DHAVE_PTHREAD_H

On 1090T (5 cores out of 6) luxtime testscene yielded 88.070 kS/s, schoolcorridor -- 46.220 kS/s.

Then i've upgraded toolchain to gcc-4.6.1 and ld-2.21.1 (glibc and cloog-ppl was left intact). I recompiled Lux dependencies with the new toolchain and these flags (the same flags that was used previously except for removed -fPIC):
Code: Select all
-march=amdfam10 -mabm -msse4a -fprefetch-loop-arrays -O3 -pipe

And finally compiled luxconsole with the following sets of flags (after removing lessened optimizations for QBVH):
Code: Select all
-march=amdfam10 -mabm -msse4a -fprefetch-loop-arrays -O3 -pipe -mfpmath=sse -ftree-vectorize -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math -fno-rounding-math -fno-signaling-nans -fcx-limited-range -DBOOST_DISABLE_ASSERTS -floop-interchange -floop-strip-mine -floop-block -fsee -ftree-loop-linear -ftree-loop-distribution -ftree-loop-im -fivopts -ftracer -DHAVE_PTHREAD_H -flto -fuse-linker-plugin -fwhole-program -fdata-sections -ffunction-sections -Wl,--gc-sections-march=amdfam10 -mabm -msse4a -fprefetch-loop-arrays -O3 -pipe -mfpmath=sse -ftree-vectorize -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math -fno-rounding-math -fno-signaling-nans -fcx-limited-range -DBOOST_DISABLE_ASSERTS -floop-interchange -floop-strip-mine -floop-block -fsee -ftree-loop-linear -ftree-loop-distribution -ftree-loop-im -fivopts -ftracer -DHAVE_PTHREAD_H -flto -fuse-linker-plugin -fwhole-program-march=amdfam10 -O3 -pipe -DBOOST_DISABLE_ASSERTS -DHAVE_PTHREAD_H -flto -fuse-linker-plugin -fwhole-program -fdata-sections -ffunction-sections -Wl,--gc-sections

Each luxconsole binary was in fact slightly smaller with LTO (12-13MB vs 15.5MB), however their performance was also lower: in the range of 84.5-88.0 kS/s (luxtime) and 44.5-46.0 (schoolcorridor) with different sets of flags. Really close, but clearly no improvement over gcc-4.5.2 without LTO.

daidai67, would you be so kind to show the actual list of flags you're passing for Lux compilation? Or point me please where i failed that miserably.

P.S. I also tried LTO with dependencies as well, but the final results was almost the same.
Linux builds packager

SATtva

Posts: 7142
Joined: Tue Apr 07, 2009 12:19 pm
Location: from Siberia with love

PreviousNext