GCC
I'm using "gcc (GCC) 4.2.3 (Ubuntu 4.2.3-2ubuntu7)" on an Intel 4xCore@2.4GHz and rendering the fullcornell-metropolis.lxs mandatory test.
Options: "-O2 -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"
This one are the default options on the CVS sources. Rendering performace:
/[4]00:00:05 76599 samples/sec 1.14021 samples/pix
-[4]00:00:10 76943 samples/sec 2.19203 samples/pix
\[4]00:00:15 77255 samples/sec 3.25186 samples/pix
|[4]00:00:20 76813 samples/sec 4.30034 samples/pix
/[4]00:00:25 76104 samples/sec 5.35618 samples/pix
-[4]00:00:30 77277 samples/sec 6.41456 samples/pix
I will take this one as the base line for computing speed up with other compiling options.
Speedup factor: 1.0
Options: "-O3 -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"
After the first test I enabled about all advanced optimization options available. The results are quite good, nice speedup.
/[4]00:00:05 88009 samples/sec 1.29311 samples/pix
-[4]00:00:10 88025 samples/sec 2.49972 samples/pix
\[4]00:00:15 88677 samples/sec 3.70596 samples/pix
|[4]00:00:20 88080 samples/sec 4.91487 samples/pix
/[4]00:00:25 87941 samples/sec 6.11684 samples/pix
-[4]00:00:30 88875 samples/sec 7.32542 samples/pix
Speedup factor: 1.1419 (+14.19%)
Options: "-O3 -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H -DLUX_USE_SSE"
Than I enabled SSE code available in Luxrender however the result is a bit disappointing (it slight slower than before), that code need some attention.
/[4]00:00:05 86699 samples/sec 1.28474 samples/pix
-[4]00:00:10 86679 samples/sec 2.46978 samples/pix
\[4]00:00:15 85877 samples/sec 3.65329 samples/pix
|[4]00:00:20 86178 samples/sec 4.83898 samples/pix
/[4]00:00:25 87708 samples/sec 6.02384 samples/pix
-[4]00:00:30 86302 samples/sec 7.21171 samples/pix
Speedup factor: 1.1242 (+12.42%)
Options: "-O3 -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"
Than I enabled the "-ffast-math". This option can produce, in some case, result not fully compliant to the IEEE stadard (however it works fine on every day work). It offers another little speedup.
/[4]00:00:05 90559 samples/sec 1.33743 samples/pix
-[4]00:00:10 90514 samples/sec 2.5676 samples/pix
\[4]00:00:15 88163 samples/sec 3.79639 samples/pix
|[4]00:00:20 90550 samples/sec 5.03864 samples/pix
/[4]00:00:25 90027 samples/sec 6.26448 samples/pix
-[4]00:00:30 89043 samples/sec 7.4906 samples/pix
Speedup factor: 1.1677 (+16.77%)
Options:
Pass 1 =>"-O3 --coverage -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"
Pass2 => "-O3 -fbranch-probabilities -march=prescott -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math -Wall -DLUX_USE_OPENGL -DHAVE_PTHREAD_H"
Than I used nice feature of most advanced compilers where you can collect statistics about the execution (with an executable compiled in "Pass 1") and compile again the code and optimize with the help of the collected stats (i.e. "Pass 2"). The results are quite good.
/[4]00:00:05 93331 samples/sec 1.38725 samples/pix
-[4]00:00:10 93885 samples/sec 2.66017 samples/pix
\[4]00:00:15 93759 samples/sec 3.93007 samples/pix
|[4]00:00:20 93210 samples/sec 5.21265 samples/pix
/[4]00:00:25 94213 samples/sec 6.4936 samples/pix
-[4]00:00:30 94096 samples/sec 7.7726 samples/pix
Speedup factor: 1.2117 (+21.17%)
This one was the best result I was able to achieve with GCC. Let's try with the Intel compiler.
Intel CC Compiler
Mmmm, How can I describe this compiler ? ... IT SMOKES !!!
Options:
Pass 1 =>"-prof-gen -prof-dir /tmp -O3 -ipo -mtune=core2 -xT -unroll -fp-model fast=2 -rcd -no-prec-div -DLUX_USE_OPENGL -DHAVE_PTHREAD_H '-D"__sync_fetch_and_add(ptr,addend)=_InterlockedExchangeAdd(const_cast<void*>(reinterpret_cast<volatile void*>(ptr)), addend)"'"
Pass2 => "-prof-use -prof-dir /tmp -O3 -ipo -mtune=core2 -xT -unroll -fp-model fast=2 -rcd -no-prec-div -DLUX_USE_OPENGL -DHAVE_PTHREAD_H '-D"__sync_fetch_and_add(ptr,addend)=_InterlockedExchangeAdd(const_cast<void*>(reinterpret_cast<volatile void*>(ptr)), addend)"'"
I'm using here the 2 passes technique too.
/[4]00:00:05 109541 samples/sec 1.56615 samples/pix
-[4]00:00:10 111306 samples/sec 3.08352 samples/pix
\[4]00:00:15 110849 samples/sec 4.60267 samples/pix
|[4]00:00:20 111510 samples/sec 6.11914 samples/pix
/[4]00:00:25 106899 samples/sec 7.61249 samples/pix
-[4]00:00:30 110389 samples/sec 9.13803 samples/pix
It is fast, damn fast !
Speedup factor: 1.4245 (+42.45%)
I uploaded the Luxrender compiled with gcc here (http://davibu.interfree.it/luxrender/lu ... dition-gcc) and the one compiled with Intel CC here (http://davibu.interfree.it/luxrender/lu ... dition-icc keep in mind it should work only on latest pantium4 and core2).
In my opinion we should maintain a binary distribution of RC/final version of Luxrender compiled with Intel CC, it is damn fast. I can compile the version for Linux.
