[Mesa-dev] [PATCH 7/7] softpipe: add support for compute shaders.

Wed Apr 27 22:03:05 UTC 2016

Am 27.04.2016 um 22:45 schrieb Jose Fonseca:
> On 27/04/16 18:48, Roland Scheidegger wrote:
>> Am 27.04.2016 um 18:13 schrieb Jose Fonseca:
>>> On 27/04/16 02:46, Roland Scheidegger wrote:
>>>> Am 27.04.2016 um 03:05 schrieb Dave Airlie:
>>>>> On 27 April 2016 at 11:00, Dave Airlie <airlied at gmail.com> wrote:
>>>>>>>> So far I've set the execmask to 1 active channel, I'm contemplating
>>>>>>>> changing that
>>>>>>>> though and using less machines.
>>>>>>> Ah yes, I think that would indeed be desirable.
>>>>>>
>>>>>> I'll look into it, though it's not that trivial, since you might
>>>>>> have a 1x20x1
>>>>>> layout, also having to make sure each thread gets the correct system
>>>>>> values.
>>>> Looks doable though. I'm mostly asking because the whole point of
>>>> compute shaders is things running in parallel, and while that wouldn't
>>>> really run in parallel it would at least slightly look like it...
>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Any ideas how to implement this in llvm? :-) 1024 CPU threads?
>>>>>>> I suppose 1024 is really the minimum work size you have to support?
>>>>>>> But since things are always run 4-wide (or 8-wide) that would
>>>>>>> "only" be
>>>>>>> 256 (or 128) threads. That many threads sound a bit suboptimal to me
>>>>>>> (unless you really have a boatload of cpu cores), but why not - I
>>>>>>> suppose you can always pause some of the threads, not all need to be
>>>>>>> active at the same time.
>>>>>>> Though I wonder what the opencl-on-cpu guys do...
>>>>>>
>>>>>> pocl appears to spawn a number of threads and split the work out
>>>>>> amongst
>>>>>> them in the X direction.
>>>>>>
>>>>>> However I'm not seeing how they handle barriers, or if they handle
>>>>>> them correctly at all.
>>>>>
>>>>> Okay newer versions of pocl seem to have some sort of thread
>>>>> scheduler,
>>>>> that schedule workgroups across up to 8 threads, however I can't
>>>>> see how
>>>>> they deal with barriers still.
>>>>
>>>> Yes the problem with barriers is what I had in mind too. Otherwise
>>>> could
>>>> just create worker threads, which pick up whatever work items are left.
>>>>
>>>> Roland
>>>
>>> Regarding llvmpipe, the simple solution seems indeed to be to use one os
>>> thread for one register worth.
>>>
>>> The second, intermediate, solution is to use the same number of threads
>>> (ie, == to the number of CPU), each using very large vectors (ie,
>>> 1024/num-cpus ), let LLVM deal with breaking those vectors in smaller
>>> units.
>> Are you sure llvm can actually deal with such massive vectors (not just
>> in theory but in practice too)?
> 
> I believe that LLVM handles vectors larger than native.  But I'm not
> sure it's bug free. There's also the problem that lp_bld_arit.c and
> friends will never emit any intrinsics for non native vectors.
It definitely does handle vectors larger than native, I'm just wondering
if it really works with huge sizes...
We do have some mostly preliminary support for larger-than-native
intrinsics but it's indeed not really fully fledged.

> 
>> But even if it can, I don't think that would be all that useful. It's
>> likely going to result in huge shaders, massive amounts of spilling, not
>> to mention divergent control flow is going to be terrible.
> 
> Right.
> 
>>>
>>> Emitting LLVM IR such way that it's able to stop/resume execution in the
>>> middle of a thread seems hard (thought not impossible, since we already
>>> deal with execution masks, so it would be mostly a matter of spilling
>>> all input/temp registers and execution maks to/from malloc memory.
>> Theoretically doable, but only as long as there's no real control flow I
>> think. Otherwise looks pretty impossible to me.
> 
> I don't see why control flow is a (bigger) problem here: as long as one
> lane needs to block on the barrier, then all lanes need to be spilled.
I was more worried about the fact you need to continue control flow
where you stopped execution.

> 
> Basically one could do the same sort of tricks
> http://dunkels.com/adam/pt/expansion.html does, but with LLVM IR instead
> of macros.
But yes, that looks like it would work. That switch() is quite nifty (or
should I say crazy)...

> 
>>>
>>>
>>> Another solution might be to integrate some thirdparty library that
>>> implements so called green/user-space threads  (e.g, via setjmp/longjmp,
>>> or something else).  I don't know any such library off-hand, and getting
>>> to work on all OSes might be far from trivial.  My gut feeling is that
>>> this would be the most promissfull option long term: no need to have
>>> thousands of OS threads, and no need to add increase complexity of LLVM
>>> code generation.
>>
>> That looks like a reasonable solution.
> 
>> I'm not really sure though the
>> overhead of kernel threads is really all that bad compared to user-space
>> threads (so, 256 ordinary threads or so which I think is the most we'd
>> need might be just fine).
> 
> I've never seen any other process creating so many threads... The
> thought of all those threads in gdb or IDE debuggers is a bit scary. And
> I can imagine people that don't particularly care for llvmpipe
> internals, but happen to use SW renderer, might pause when they see all
> those threads.
Hmm yes that would look scary.

> 
> 
> At very least we'll need to use a more adequate stack size instead of
> the defaults, otherwise we'll need 512M - 2GB just for stack.  And defer
> creating all those extra threads until truly needed.
Actually stack size can't be that small. At the very least, we need
enough stack to be able to spill the temp reg file. That's 4096 entries
max, 4 elements with 4 bytes each. With 8-wide shaders, that's already
half a MB. (Of course, typically the jit shader should never have to
really spill all of it).
Though at least on 64bit that shouldn't be much of a problem (it's just
virtual memory after all not always backed by physical pages).
I absolutely agree though the threads should be only created when needed.

> 
> 
> I did a quick search about potential libraries for this, a good starting
> point seems to be  https://swtch.com/libtask/ which is MIT licensed.
> 
> If IIUC, all it takes is a bit of assembly to switch context.  But it
> seems OSes also handy primitives to do this sort of things:
> 
> - http://linux.die.net/man/3/swapcontext
> 
> - Windows
> https://msdn.microsoft.com/en-gb/library/windows/desktop/ms686919(v=vs.85).aspx
> 
> 
> 
> llvmpipe threads don't do tricky things like I/O, so maybe instead of a
> thirdparty library, with merely a few simple wrappers abstracting the OS
> primitives we might get everything needed.
> 

Yes, I suppose that would ultimately be the best solution.

Roland