[Mesa-dev] [PATCH 7/7] softpipe: add support for compute shaders.

Wed Apr 27 17:48:32 UTC 2016

Am 27.04.2016 um 18:13 schrieb Jose Fonseca:
> On 27/04/16 02:46, Roland Scheidegger wrote:
>> Am 27.04.2016 um 03:05 schrieb Dave Airlie:
>>> On 27 April 2016 at 11:00, Dave Airlie <airlied at gmail.com> wrote:
>>>>>> So far I've set the execmask to 1 active channel, I'm contemplating
>>>>>> changing that
>>>>>> though and using less machines.
>>>>> Ah yes, I think that would indeed be desirable.
>>>>
>>>> I'll look into it, though it's not that trivial, since you might
>>>> have a 1x20x1
>>>> layout, also having to make sure each thread gets the correct system
>>>> values.
>> Looks doable though. I'm mostly asking because the whole point of
>> compute shaders is things running in parallel, and while that wouldn't
>> really run in parallel it would at least slightly look like it...
>>
>>>>
>>>>>
>>>>>>
>>>>>> Any ideas how to implement this in llvm? :-) 1024 CPU threads?
>>>>> I suppose 1024 is really the minimum work size you have to support?
>>>>> But since things are always run 4-wide (or 8-wide) that would
>>>>> "only" be
>>>>> 256 (or 128) threads. That many threads sound a bit suboptimal to me
>>>>> (unless you really have a boatload of cpu cores), but why not - I
>>>>> suppose you can always pause some of the threads, not all need to be
>>>>> active at the same time.
>>>>> Though I wonder what the opencl-on-cpu guys do...
>>>>
>>>> pocl appears to spawn a number of threads and split the work out
>>>> amongst
>>>> them in the X direction.
>>>>
>>>> However I'm not seeing how they handle barriers, or if they handle
>>>> them correctly at all.
>>>
>>> Okay newer versions of pocl seem to have some sort of thread scheduler,
>>> that schedule workgroups across up to 8 threads, however I can't see how
>>> they deal with barriers still.
>>
>> Yes the problem with barriers is what I had in mind too. Otherwise could
>> just create worker threads, which pick up whatever work items are left.
>>
>> Roland
> 
> Regarding llvmpipe, the simple solution seems indeed to be to use one os
> thread for one register worth.
> 
> The second, intermediate, solution is to use the same number of threads
> (ie, == to the number of CPU), each using very large vectors (ie,
> 1024/num-cpus ), let LLVM deal with breaking those vectors in smaller
> units.
Are you sure llvm can actually deal with such massive vectors (not just
in theory but in practice too)?
But even if it can, I don't think that would be all that useful. It's
likely going to result in huge shaders, massive amounts of spilling, not
to mention divergent control flow is going to be terrible.

> 
> Emitting LLVM IR such way that it's able to stop/resume execution in the
> middle of a thread seems hard (thought not impossible, since we already
> deal with execution masks, so it would be mostly a matter of spilling
> all input/temp registers and execution maks to/from malloc memory.
Theoretically doable, but only as long as there's no real control flow I
think. Otherwise looks pretty impossible to me.

> 
> 
> Another solution might be to integrate some thirdparty library that
> implements so called green/user-space threads  (e.g, via setjmp/longjmp,
> or something else).  I don't know any such library off-hand, and getting
> to work on all OSes might be far from trivial.  My gut feeling is that
> this would be the most promissfull option long term: no need to have
> thousands of OS threads, and no need to add increase complexity of LLVM
> code generation.

That looks like a reasonable solution. I'm not really sure though the
overhead of kernel threads is really all that bad compared to user-space
threads (so, 256 ordinary threads or so which I think is the most we'd
need might be just fine).

Roland