GCC vs Intel CC

Discussion related to the organization of the source code, repository and code-level/compiler optimization.

Moderator: coordinators

GCC vs Intel CC

Postby Dade » Sat May 03, 2008 2:10 pm

I have spent some time playing with compiler options and the two main compilers available under Linux: GCC and Intel CC. The results are quite interesting.

GCC

I'm using "gcc (GCC) 4.2.3 (Ubuntu 4.2.3-2ubuntu7)" on an Intel 4xCore@2.4GHz and rendering the fullcornell-metropolis.lxs mandatory test.

Options: "-O2 -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"

This one are the default options on the CVS sources. Rendering performace:

/[4]00:00:05 76599 samples/sec 1.14021 samples/pix
-[4]00:00:10 76943 samples/sec 2.19203 samples/pix
\[4]00:00:15 77255 samples/sec 3.25186 samples/pix
|[4]00:00:20 76813 samples/sec 4.30034 samples/pix
/[4]00:00:25 76104 samples/sec 5.35618 samples/pix
-[4]00:00:30 77277 samples/sec 6.41456 samples/pix

I will take this one as the base line for computing speed up with other compiling options.

Speedup factor: 1.0

Options: "-O3 -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"

After the first test I enabled about all advanced optimization options available. The results are quite good, nice speedup.

/[4]00:00:05 88009 samples/sec 1.29311 samples/pix
-[4]00:00:10 88025 samples/sec 2.49972 samples/pix
\[4]00:00:15 88677 samples/sec 3.70596 samples/pix
|[4]00:00:20 88080 samples/sec 4.91487 samples/pix
/[4]00:00:25 87941 samples/sec 6.11684 samples/pix
-[4]00:00:30 88875 samples/sec 7.32542 samples/pix

Speedup factor: 1.1419 (+14.19%)

Options: "-O3 -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H -DLUX_USE_SSE"

Than I enabled SSE code available in Luxrender however the result is a bit disappointing (it slight slower than before), that code need some attention.

/[4]00:00:05 86699 samples/sec 1.28474 samples/pix
-[4]00:00:10 86679 samples/sec 2.46978 samples/pix
\[4]00:00:15 85877 samples/sec 3.65329 samples/pix
|[4]00:00:20 86178 samples/sec 4.83898 samples/pix
/[4]00:00:25 87708 samples/sec 6.02384 samples/pix
-[4]00:00:30 86302 samples/sec 7.21171 samples/pix

Speedup factor: 1.1242 (+12.42%)

Options: "-O3 -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"

Than I enabled the "-ffast-math". This option can produce, in some case, result not fully compliant to the IEEE stadard (however it works fine on every day work). It offers another little speedup.

/[4]00:00:05 90559 samples/sec 1.33743 samples/pix
-[4]00:00:10 90514 samples/sec 2.5676 samples/pix
\[4]00:00:15 88163 samples/sec 3.79639 samples/pix
|[4]00:00:20 90550 samples/sec 5.03864 samples/pix
/[4]00:00:25 90027 samples/sec 6.26448 samples/pix
-[4]00:00:30 89043 samples/sec 7.4906 samples/pix

Speedup factor: 1.1677 (+16.77%)

Options:
Pass 1 =>"-O3 --coverage -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"
Pass2 => "-O3 -fbranch-probabilities -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"

Than I used nice feature of most advanced compilers where you can collect statistics about the execution (with an executable compiled in "Pass 1") and compile again the code and optimize with the help of the collected stats (i.e. "Pass 2"). The results are quite good.

/[4]00:00:05 93331 samples/sec 1.38725 samples/pix
-[4]00:00:10 93885 samples/sec 2.66017 samples/pix
\[4]00:00:15 93759 samples/sec 3.93007 samples/pix
|[4]00:00:20 93210 samples/sec 5.21265 samples/pix
/[4]00:00:25 94213 samples/sec 6.4936 samples/pix
-[4]00:00:30 94096 samples/sec 7.7726 samples/pix

Speedup factor: 1.2117 (+21.17%)

This one was the best result I was able to achieve with GCC. Let's try with the Intel compiler.

Intel CC Compiler

Mmmm, How can I describe this compiler ? ... IT SMOKES !!! :mrgreen: I will jump directly to the best result obtained. However keep in mind the Manual of Intel Compiler Optimizer alone is more than 250 pages: it could be possible to obtain better results.

Options:
Pass 1 =>"-prof-gen -prof-dir /tmp -O3 -ipo -mtune=core2 -xT -unroll -fp-model fast=2 -rcd -no-prec-div -DLUX_USE_OPENGL -DHAVE_PTHREAD_H '-D"__sync_fetch_and_add(ptr,addend)=_InterlockedExchangeAdd(const_cast<void*>(reinterpret_cast<volatile void*>(ptr)), addend)"'"
Pass2 => "-prof-use -prof-dir /tmp -O3 -ipo -mtune=core2 -xT -unroll -fp-model fast=2 -rcd -no-prec-div -DLUX_USE_OPENGL -DHAVE_PTHREAD_H '-D"__sync_fetch_and_add(ptr,addend)=_InterlockedExchangeAdd(const_cast<void*>(reinterpret_cast<volatile void*>(ptr)), addend)"'"

I'm using here the 2 passes technique too.

/[4]00:00:05 109541 samples/sec 1.56615 samples/pix
-[4]00:00:10 111306 samples/sec 3.08352 samples/pix
\[4]00:00:15 110849 samples/sec 4.60267 samples/pix
|[4]00:00:20 111510 samples/sec 6.11914 samples/pix
/[4]00:00:25 106899 samples/sec 7.61249 samples/pix
-[4]00:00:30 110389 samples/sec 9.13803 samples/pix

It is fast, damn fast !

Speedup factor: 1.4245 (+42.45%)

I uploaded the Luxrender compiled with gcc here (http://davibu.interfree.it/luxrender/lu ... dition-gcc) and the one compiled with Intel CC here (http://davibu.interfree.it/luxrender/lu ... dition-icc keep in mind it should work only on latest pantium4 and core2).

In my opinion we should maintain a binary distribution of RC/final version of Luxrender compiled with Intel CC, it is damn fast. I can compile the version for Linux.
User avatar
Dade
Developer
 
Posts: 5641
Joined: Sat Apr 19, 2008 6:04 pm
Location: Italy

Re: GCC vs Intel CC

Postby latheix » Sat May 03, 2008 6:20 pm

Hi David and all the developer team,

It's my first post on this forum, but i follow carefully you work from the first day. Thanks for your work and for this excellent jewel :)
Permit me to introduce myself.
I'm not a developer but passionate for the free software spirit and the quality of his community from 10 years now. I'm working in a (small) architectural cabinet (French country, near Vannes) , and i whish to install a lot of GPL softwares like this excellent render in my office.
Currently, we are working with Nemetschek Allplan (data shared by Samba on a Debian Server), exporting our work to Blender with a modified 3ds python import script (this one -> http://blender-archi.tuxfamily.org/Scripts (Import section) and, i hope, Luxrender for the final presentation ( a good workflow which will allow us to remove Nemetsched C4D).

This is an impressive benchmark. I'm currently running Ubuntu Gutsy 64 on Q6600 Quad Core and i installed the latest Intel CC compiler following this tutorial
http://www.ubuntugeek.com/howto-install-intel-c-compiler-10-on-ubuntu-feisty-fawn.html to bench some of my scenes.

David, could you, please, explain me what files to modify on the actual CVS tree to compile correctly ?

Thanks you read.

Christian
latheix
 
Posts: 3
Joined: Thu May 01, 2008 12:48 pm

Re: GCC vs Intel CC

Postby Radiance » Sun May 04, 2008 1:17 am

Hey,

yeah i've got a copy of ICC 10.1 here for win32 installed in my vs2005.

I've had similar results in the past, haven't made any recent build with it as it's a laborious process,
i was planning to do it when we release our final (for win32 platforms).

The vectorizer works quite well,
especially on stuff that got committed recently.

I've done a few commits recently with 'coherency' in mind.
Eg, caching often used stuff in arrays and using loops to generate them when they empty/fill up,
causes the vectorizer to SSE vectorize the loop automatically etc...

You can also build universal binaries with it, and include support for all platforms (eg pentium III up to Core2),
which will give you a large binary (like 12MB) which contains all code optimized for each different CPU type,
and selects the right one to use on the machine it's launched (like a mac ppc/intel universal binary)

The PGO compilation (2pass you call it) is something i've had in mind playing with to see the difference,
thanks for your tests and the work you've spared me :)

I think our first final release should be supplied in both GCC and highly optimized ICC builds :)

But, be carefull and TEST your binaries very well, on many scenes, as i've noticed a lot of problems with ICC's optimizations,
it sometimes seriously fucks up floating point precision, resulting in artifacts and rounding of low luminance values to 0 etc...

greetz,
radiance
User avatar
Radiance
 
Posts: 3968
Joined: Wed Sep 19, 2007 2:13 am

Re: GCC vs Intel CC

Postby Dade » Sun May 04, 2008 9:28 am

latheix wrote:David, could you, please, explain me what files to modify on the actual CVS tree to compile correctly ?


Hi Christian, I wrote a wiki page about the procedure for the single-pass compilation (it offers about the same performances of the 2 passes method): http://www.luxrender.net/wiki/index.php/How_to_compile_the_sources#Compile_with_Intel_Compiler_.28single_pass_method.29

It should be the same on 64bit platforms. Let me know if it works.

Cheers,
David
User avatar
Dade
Developer
 
Posts: 5641
Joined: Sat Apr 19, 2008 6:04 pm
Location: Italy

Re: GCC vs Intel CC

Postby alenofx » Sun May 04, 2008 9:06 pm

Hi,

I've made some tests with GCC 4.3.0 (fast-math, vectorization, loops unroll, etc... enabled) and ICC 10.1.015 too, both on Arch Linux 32bit and 64bit.
64bit builds are a lot faster (performance of 32bit + ICC build is very similar to the 64bit + GCC build), and the fastest is, obviously, the 64bit + ICC build.

Note: for 64bit compatibility I've to add these compiler flags:
-DBOOST_NO_INTRINSIC_INT64_T
"-D'__builtin_vsnprintf(__out, __size, __fmt, __args)'='__builtin_vsnprintf(__out, __size, __fmt, (char *) __args)'"

All tests made on my iMac Core 2 Duo, with great performance improvements over the GCC 4.0.1 build on OSX (the default GCC version of Xcode 3.0). :D
alenofx
 
Posts: 14
Joined: Tue Oct 16, 2007 3:05 pm
Location: Italy

Re: GCC vs Intel CC

Postby Radiance » Mon May 05, 2008 1:11 am

Hi guys,

Maybe someone can write a WIKI page with various compilation options so that we can use this information on various platforms for our release builds...

greetz,
radiance
User avatar
Radiance
 
Posts: 3968
Joined: Wed Sep 19, 2007 2:13 am

Re: GCC vs Intel CC

Postby Dade » Mon May 05, 2008 9:00 am

alenofx wrote:Hi,

I've made some tests with GCC 4.3.0 (fast-math, vectorization, loops unroll, etc... enabled) and ICC 10.1.015 too, both on Arch Linux 32bit and 64bit.
64bit builds are a lot faster (performance of 32bit + ICC build is very similar to the 64bit + GCC build), and the fastest is, obviously, the 64bit + ICC build.

Note: for 64bit compatibility I've to add these compiler flags:
-DBOOST_NO_INTRINSIC_INT64_T
"-D'__builtin_vsnprintf(__out, __size, __fmt, __args)'='__builtin_vsnprintf(__out, __size, __fmt, (char *) __args)'"

All tests made on my iMac Core 2 Duo, with great performance improvements over the GCC 4.0.1 build on OSX (the default GCC version of Xcode 3.0). :D


I'm very interested to the 32bit vs 64bit comparison, so is 64bit is noticeably faster than 32bit ? Good, it is time to upgrade my Ubuntu installation !
User avatar
Dade
Developer
 
Posts: 5641
Joined: Sat Apr 19, 2008 6:04 pm
Location: Italy

Re: GCC vs Intel CC

Postby Dade » Mon May 05, 2008 9:04 am

Radiance wrote:Maybe someone can write a WIKI page with various compilation options so that we can use this information on various platforms for our release builds...


I updated http://www.luxrender.net/wiki/index.php ... _method.29 with alenofx's information (btw, alenofx feel free to adding anything you want to that page, it can be a good help for people trying to compile the sources on their own). It looks like a good candidate where to collect this kind of information.
User avatar
Dade
Developer
 
Posts: 5641
Joined: Sat Apr 19, 2008 6:04 pm
Location: Italy

Re: GCC vs Intel CC

Postby jeanphi » Mon May 05, 2008 9:58 am

Hi,

Even if it brings a nice speedu up, you should not use -ffast-math (some suboptions activated by it should be ok though) or the ICC equivalent. Lux code needs correct behaviour with non numerical values like nan or infinity which is almost guaranteed to be missing with this option (I'm almost sure infinity is not honored anymore).
I think the most notable point is that hand made SSE optimization is worse than the automatic code produced by gcc (and I guess icc too). So maybe we don't have to spend time trying to add more of it (except if MSVC is unable to do this by itself).

Jeanphi
jeanphi
Developer
 
Posts: 7108
Joined: Mon Jan 14, 2008 7:21 am

Re: GCC vs Intel CC

Postby alenofx » Wed Jun 25, 2008 3:41 pm

Note: -DBOOST_NO_INTRINSIC_INT64_T flag is needed for 64bit compatibility, also with GCC (0.5 release does not include this flag in the default compiler options).
alenofx
 
Posts: 14
Joined: Tue Oct 16, 2007 3:05 pm
Location: Italy

Next

Return to Organization & Optimization

Who is online

Users browsing this forum: No registered users and 1 guest