[Mesa-dev] [PATCH 7/7] softpipe: add support for compute shaders.

Wed Apr 27 20:45:12 UTC 2016

On 27/04/16 18:48, Roland Scheidegger wrote:
> Am 27.04.2016 um 18:13 schrieb Jose Fonseca:
>> On 27/04/16 02:46, Roland Scheidegger wrote:
>>> Am 27.04.2016 um 03:05 schrieb Dave Airlie:
>>>> On 27 April 2016 at 11:00, Dave Airlie <airlied at gmail.com> wrote:
>>>>>>> So far I've set the execmask to 1 active channel, I'm contemplating
>>>>>>> changing that
>>>>>>> though and using less machines.
>>>>>> Ah yes, I think that would indeed be desirable.
>>>>>
>>>>> I'll look into it, though it's not that trivial, since you might
>>>>> have a 1x20x1
>>>>> layout, also having to make sure each thread gets the correct system
>>>>> values.
>>> Looks doable though. I'm mostly asking because the whole point of
>>> compute shaders is things running in parallel, and while that wouldn't
>>> really run in parallel it would at least slightly look like it...
>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Any ideas how to implement this in llvm? :-) 1024 CPU threads?
>>>>>> I suppose 1024 is really the minimum work size you have to support?
>>>>>> But since things are always run 4-wide (or 8-wide) that would
>>>>>> "only" be
>>>>>> 256 (or 128) threads. That many threads sound a bit suboptimal to me
>>>>>> (unless you really have a boatload of cpu cores), but why not - I
>>>>>> suppose you can always pause some of the threads, not all need to be
>>>>>> active at the same time.
>>>>>> Though I wonder what the opencl-on-cpu guys do...
>>>>>
>>>>> pocl appears to spawn a number of threads and split the work out
>>>>> amongst
>>>>> them in the X direction.
>>>>>
>>>>> However I'm not seeing how they handle barriers, or if they handle
>>>>> them correctly at all.
>>>>
>>>> Okay newer versions of pocl seem to have some sort of thread scheduler,
>>>> that schedule workgroups across up to 8 threads, however I can't see how
>>>> they deal with barriers still.
>>>
>>> Yes the problem with barriers is what I had in mind too. Otherwise could
>>> just create worker threads, which pick up whatever work items are left.
>>>
>>> Roland
>>
>> Regarding llvmpipe, the simple solution seems indeed to be to use one os
>> thread for one register worth.
>>
>> The second, intermediate, solution is to use the same number of threads
>> (ie, == to the number of CPU), each using very large vectors (ie,
>> 1024/num-cpus ), let LLVM deal with breaking those vectors in smaller
>> units.
> Are you sure llvm can actually deal with such massive vectors (not just
> in theory but in practice too)?

I believe that LLVM handles vectors larger than native.  But I'm not 
sure it's bug free. There's also the problem that lp_bld_arit.c and 
friends will never emit any intrinsics for non native vectors.

> But even if it can, I don't think that would be all that useful. It's
> likely going to result in huge shaders, massive amounts of spilling, not
> to mention divergent control flow is going to be terrible.

Right.

>>
>> Emitting LLVM IR such way that it's able to stop/resume execution in the
>> middle of a thread seems hard (thought not impossible, since we already
>> deal with execution masks, so it would be mostly a matter of spilling
>> all input/temp registers and execution maks to/from malloc memory.
> Theoretically doable, but only as long as there's no real control flow I
> think. Otherwise looks pretty impossible to me.

I don't see why control flow is a (bigger) problem here: as long as one 
lane needs to block on the barrier, then all lanes need to be spilled.

Basically one could do the same sort of tricks 
http://dunkels.com/adam/pt/expansion.html does, but with LLVM IR instead 
of macros.

>>
>>
>> Another solution might be to integrate some thirdparty library that
>> implements so called green/user-space threads  (e.g, via setjmp/longjmp,
>> or something else).  I don't know any such library off-hand, and getting
>> to work on all OSes might be far from trivial.  My gut feeling is that
>> this would be the most promissfull option long term: no need to have
>> thousands of OS threads, and no need to add increase complexity of LLVM
>> code generation.
>
> That looks like a reasonable solution.

 > I'm not really sure though the
> overhead of kernel threads is really all that bad compared to user-space
> threads (so, 256 ordinary threads or so which I think is the most we'd
> need might be just fine).

I've never seen any other process creating so many threads... The 
thought of all those threads in gdb or IDE debuggers is a bit scary. 
And I can imagine people that don't particularly care for llvmpipe 
internals, but happen to use SW renderer, might pause when they see all 
those threads.

At very least we'll need to use a more adequate stack size instead of 
the defaults, otherwise we'll need 512M - 2GB just for stack.  And defer 
creating all those extra threads until truly needed.

I did a quick search about potential libraries for this, a good starting 
point seems to be  https://swtch.com/libtask/ which is MIT licensed.

If IIUC, all it takes is a bit of assembly to switch context.  But it 
seems OSes also handy primitives to do this sort of things:

- http://linux.die.net/man/3/swapcontext

- Windows 
https://msdn.microsoft.com/en-gb/library/windows/desktop/ms686919(v=vs.85).aspx

llvmpipe threads don't do tricky things like I/O, so maybe instead of a 
thirdparty library, with merely a few simple wrappers abstracting the OS 
primitives we might get everything needed.

Jose