[Mesa-dev] [PATCH 7/7] softpipe: add support for compute shaders.

Wed Apr 27 16:13:59 UTC 2016

On 27/04/16 02:46, Roland Scheidegger wrote:
> Am 27.04.2016 um 03:05 schrieb Dave Airlie:
>> On 27 April 2016 at 11:00, Dave Airlie <airlied at gmail.com> wrote:
>>>>> So far I've set the execmask to 1 active channel, I'm contemplating
>>>>> changing that
>>>>> though and using less machines.
>>>> Ah yes, I think that would indeed be desirable.
>>>
>>> I'll look into it, though it's not that trivial, since you might have a 1x20x1
>>> layout, also having to make sure each thread gets the correct system values.
> Looks doable though. I'm mostly asking because the whole point of
> compute shaders is things running in parallel, and while that wouldn't
> really run in parallel it would at least slightly look like it...
>
>>>
>>>>
>>>>>
>>>>> Any ideas how to implement this in llvm? :-) 1024 CPU threads?
>>>> I suppose 1024 is really the minimum work size you have to support?
>>>> But since things are always run 4-wide (or 8-wide) that would "only" be
>>>> 256 (or 128) threads. That many threads sound a bit suboptimal to me
>>>> (unless you really have a boatload of cpu cores), but why not - I
>>>> suppose you can always pause some of the threads, not all need to be
>>>> active at the same time.
>>>> Though I wonder what the opencl-on-cpu guys do...
>>>
>>> pocl appears to spawn a number of threads and split the work out amongst
>>> them in the X direction.
>>>
>>> However I'm not seeing how they handle barriers, or if they handle
>>> them correctly at all.
>>
>> Okay newer versions of pocl seem to have some sort of thread scheduler,
>> that schedule workgroups across up to 8 threads, however I can't see how
>> they deal with barriers still.
>
> Yes the problem with barriers is what I had in mind too. Otherwise could
> just create worker threads, which pick up whatever work items are left.
>
> Roland

Regarding llvmpipe, the simple solution seems indeed to be to use one os 
thread for one register worth.

The second, intermediate, solution is to use the same number of threads 
(ie, == to the number of CPU), each using very large vectors (ie, 
1024/num-cpus ), let LLVM deal with breaking those vectors in smaller units.

Emitting LLVM IR such way that it's able to stop/resume execution in the 
middle of a thread seems hard (thought not impossible, since we already 
deal with execution masks, so it would be mostly a matter of spilling 
all input/temp registers and execution maks to/from malloc memory.

Another solution might be to integrate some thirdparty library that 
implements so called green/user-space threads  (e.g, via setjmp/longjmp, 
or something else).  I don't know any such library off-hand, and getting 
to work on all OSes might be far from trivial.  My gut feeling is that 
this would be the most promissfull option long term: no need to have 
thousands of OS threads, and no need to add increase complexity of LLVM 
code generation.

Jose