| Anonymous | Login | Signup for a new account | 2013-05-24 12:17 PDT | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||||
| 0001005 | LuxRender | Core | public | 2011-03-30 06:01 | 2013-05-20 12:54 | ||||
| Reporter | sramij | ||||||||
| Assigned To | Dade | ||||||||
| Priority | normal | Severity | major | Reproducibility | have not tried | ||||
| Status | closed | Resolution | fixed | ||||||
| Platform | OS | OS Version | |||||||
| Product Version | 0.8RC2 | ||||||||
| Target Version | 1.3 | Fixed in Version | |||||||
| Summary | 0001005: OpenCL local work group size is not calculated correctly | ||||||||
| Description | At pathgpu.cpp:957, the benchmark queries the OpenCL sdk for the CL_KERNEL_WORK_GROUP_SIZE for this kernel. Later on, the returned value is used ("as is") as input to NDRange localworkgroupsize parameter. This is not right. because the benchmark needs to to provide as LocalWorkGroupSize a value which is divisible by the GlobalWorkSize, and it doesn't validate that whatever was returned by the SDK does divide the GlobalWorkSize. In a good scenario, the benchmark needs to take the returned value X from the query CL_KERNEL_WORK_GROUP_SIZE, and calculate a new value Y such that: ((Y <= X) and (GLOBAL_WORK_SIZE % Y = 0) ) | ||||||||
| Tags | No tags attached. | ||||||||
| Mercurial Changeset # | |||||||||
| Requires Documentation Update | No | ||||||||
| Requires Exporter Update | |||||||||
| Attached Files | |||||||||
Notes |
|
|
(0002895) Dade (developer) 2011-04-06 05:30 |
You are talking about SLG, not LuxRender, right ? Or LuxMark ? The workgroup size is always forced to 64 in all the example scenes and in any scene saved with the Blender exporter. The forced value can be changed by the user. This because the returned suggested workgroup size by OpenCL has proven to be always not optimal (i.e. it always ends to provide worser performances than 64) in any test I have done. 64 is the usual workgroup size for any ATI hardware. NVIDIA could work with 32 but 64 works well there too. |
|
(0002914) sramij (reporter) 2011-04-09 23:59 |
No, i am talking about LuxMark benchmark. Actually i don't know what is SLG. The OpenCL doesn't receive 64 as input to NDrange but it receives the same values previosly returned upon calling the query CL_KERNEL_WORK_GROUP_SIZE. |
|
(0003921) Dade (developer) 2013-05-16 03:33 |
I fixed this probable by rounding up the GLOBAL_WORK_SIZE to a multiple of WORKGROUP_SIZE. Indeed, the kernel includes already the code to do nothing if the global ID is out of the real GLOBAL_WORK_SIZE. |
|
(0003922) sramij (reporter) 2013-05-16 04:20 |
So why are you querying for KernelWorkGroupSize from the underlying framework? |
|
(0003923) jensverwiebe (developer) 2013-05-16 06:57 edited on: 2013-05-16 10:03 |
Hi Dade I once forced the cpu workgroups to 1 for a good reason: there is a bug reporting intel cpu with 1024 possible workgroups. Somehow this results always in wrong work-items etc.. At least there was never a benefit from having more than 1 wg. As you did now i have a calculated size of 128 for my Xeon CPU which results again in only 1/10th the speed i had before. See some hints here: http://wiki.tiker.net/OpenCLOddities [^] Quote: Apple, CPU - Only allows one work item per work group. (mapping to one thread per CPU) This is atm still a valid fact at least on OSX/Apple, also the AMD sdk even as it allows for more workgroups per cpu, does not take any advatage from it. If you have other experiences with newer sdk's, we should again use 1 workgroup per cpu on apple then. ( i have a codesnippet ready to get get the cpu count if needed ). Jens |
|
(0003927) Dade (developer) 2013-05-17 01:18 edited on: 2013-05-17 01:23 |
Sramij, I query the framework for a valid workgroup size than I round up the task count (i.e. GLOBAL_WORK_SIZE) to valid number when queuing the kernel execution. What is the problem ? An example of what I did: http://src.luxrender.net/luxrays/file/af1954657805/src/slg/engines/pathocl/pathoclthread.cpp#l1386 [^] |
|
(0003929) Dade (developer) 2013-05-17 01:21 |
Jens, the Apple driver shouldn't really return a workgroup size that lead to 1/10th of the performance. Said that, I agree that we can easily solve this problem with an "#ifdef __APPLE__" in order to have 1 as default workgroup size on apple platform for CPU devices. |
|
(0003932) jensverwiebe (developer) 2013-05-17 02:49 edited on: 2013-05-17 03:18 |
Dade, The huge speedloss can be explained due slg4 based luxmark reduced my cpu benchmarks significantly anyway ( around 1/5 slower ), but i never complained due it is all wip. I should soon show up some numbers here .... EDIT: render-hdr.cfg ( luxball ) - luxmark/slg4 opencl.cpu.workgroup.size = 1 -> 3336 - auto set ( 128 ? ) -> 484 - old luxmark 2.0b2 ( with my fix ): 4225 Jens |
|
(0003937) jensverwiebe (developer) 2013-05-19 06:15 edited on: 2013-05-19 06:16 |
I added an apple condtional and set wg to default 1 again. Back to around 3300 in bench again, still wondering where the other performance gets lost. From 4200 to 3300 is still a significant speedloss ... Jens |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2011-03-30 06:01 | sramij | New Issue | |
| 2011-03-30 06:02 | sramij | Severity | minor => major |
| 2011-04-06 05:23 | Dade | Assigned To | => Dade |
| 2011-04-06 05:23 | Dade | Status | new => assigned |
| 2011-04-06 05:30 | Dade | Note Added: 0002895 | |
| 2011-04-09 23:59 | sramij | Note Added: 0002914 | |
| 2011-05-24 12:55 | jeanphi | Target Version | => 1.0 |
| 2012-08-21 06:51 | jeanphi | Target Version | 1.0 => 1.1 |
| 2012-08-21 10:16 | jeanphi | Target Version | 1.1 => |
| 2013-02-25 05:22 | jeanphi | Target Version | => 1.3 |
| 2013-05-16 03:33 | Dade | Note Added: 0003921 | |
| 2013-05-16 03:33 | Dade | Status | assigned => feedback |
| 2013-05-16 04:20 | sramij | Note Added: 0003922 | |
| 2013-05-16 04:20 | sramij | Status | feedback => assigned |
| 2013-05-16 04:20 | sramij | Status | assigned => feedback |
| 2013-05-16 06:57 | jensverwiebe | Note Added: 0003923 | |
| 2013-05-16 07:00 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 07:02 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 07:03 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 07:09 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 07:12 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 07:12 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 07:14 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 07:15 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-16 10:03 | jensverwiebe | Note Edited: 0003923 | View Revisions |
| 2013-05-17 01:18 | Dade | Note Added: 0003927 | |
| 2013-05-17 01:21 | Dade | Note Added: 0003929 | |
| 2013-05-17 01:23 | Dade | Note Edited: 0003927 | View Revisions |
| 2013-05-17 02:49 | jensverwiebe | Note Added: 0003932 | |
| 2013-05-17 02:49 | jensverwiebe | Note Edited: 0003932 | View Revisions |
| 2013-05-17 03:18 | jensverwiebe | Note Edited: 0003932 | View Revisions |
| 2013-05-19 06:15 | jensverwiebe | Note Added: 0003937 | |
| 2013-05-19 06:15 | jensverwiebe | Note Edited: 0003937 | View Revisions |
| 2013-05-19 06:16 | jensverwiebe | Note Edited: 0003937 | View Revisions |
| 2013-05-20 12:54 | Dade | Status | feedback => closed |
| 2013-05-20 12:54 | Dade | Resolution | open => fixed |
| Copyright © 2000 - 2012 MantisBT Group |