[Mesa-dev] [PATCH 7/7] softpipe: add support for compute shaders.
jfonseca at vmware.com
Wed Apr 27 20:45:12 UTC 2016
On 27/04/16 18:48, Roland Scheidegger wrote:
> Am 27.04.2016 um 18:13 schrieb Jose Fonseca:
>> On 27/04/16 02:46, Roland Scheidegger wrote:
>>> Am 27.04.2016 um 03:05 schrieb Dave Airlie:
>>>> On 27 April 2016 at 11:00, Dave Airlie <airlied at gmail.com> wrote:
>>>>>>> So far I've set the execmask to 1 active channel, I'm contemplating
>>>>>>> changing that
>>>>>>> though and using less machines.
>>>>>> Ah yes, I think that would indeed be desirable.
>>>>> I'll look into it, though it's not that trivial, since you might
>>>>> have a 1x20x1
>>>>> layout, also having to make sure each thread gets the correct system
>>> Looks doable though. I'm mostly asking because the whole point of
>>> compute shaders is things running in parallel, and while that wouldn't
>>> really run in parallel it would at least slightly look like it...
>>>>>>> Any ideas how to implement this in llvm? :-) 1024 CPU threads?
>>>>>> I suppose 1024 is really the minimum work size you have to support?
>>>>>> But since things are always run 4-wide (or 8-wide) that would
>>>>>> "only" be
>>>>>> 256 (or 128) threads. That many threads sound a bit suboptimal to me
>>>>>> (unless you really have a boatload of cpu cores), but why not - I
>>>>>> suppose you can always pause some of the threads, not all need to be
>>>>>> active at the same time.
>>>>>> Though I wonder what the opencl-on-cpu guys do...
>>>>> pocl appears to spawn a number of threads and split the work out
>>>>> them in the X direction.
>>>>> However I'm not seeing how they handle barriers, or if they handle
>>>>> them correctly at all.
>>>> Okay newer versions of pocl seem to have some sort of thread scheduler,
>>>> that schedule workgroups across up to 8 threads, however I can't see how
>>>> they deal with barriers still.
>>> Yes the problem with barriers is what I had in mind too. Otherwise could
>>> just create worker threads, which pick up whatever work items are left.
>> Regarding llvmpipe, the simple solution seems indeed to be to use one os
>> thread for one register worth.
>> The second, intermediate, solution is to use the same number of threads
>> (ie, == to the number of CPU), each using very large vectors (ie,
>> 1024/num-cpus ), let LLVM deal with breaking those vectors in smaller
> Are you sure llvm can actually deal with such massive vectors (not just
> in theory but in practice too)?
I believe that LLVM handles vectors larger than native. But I'm not
sure it's bug free. There's also the problem that lp_bld_arit.c and
friends will never emit any intrinsics for non native vectors.
> But even if it can, I don't think that would be all that useful. It's
> likely going to result in huge shaders, massive amounts of spilling, not
> to mention divergent control flow is going to be terrible.
>> Emitting LLVM IR such way that it's able to stop/resume execution in the
>> middle of a thread seems hard (thought not impossible, since we already
>> deal with execution masks, so it would be mostly a matter of spilling
>> all input/temp registers and execution maks to/from malloc memory.
> Theoretically doable, but only as long as there's no real control flow I
> think. Otherwise looks pretty impossible to me.
I don't see why control flow is a (bigger) problem here: as long as one
lane needs to block on the barrier, then all lanes need to be spilled.
Basically one could do the same sort of tricks
http://dunkels.com/adam/pt/expansion.html does, but with LLVM IR instead
>> Another solution might be to integrate some thirdparty library that
>> implements so called green/user-space threads (e.g, via setjmp/longjmp,
>> or something else). I don't know any such library off-hand, and getting
>> to work on all OSes might be far from trivial. My gut feeling is that
>> this would be the most promissfull option long term: no need to have
>> thousands of OS threads, and no need to add increase complexity of LLVM
>> code generation.
> That looks like a reasonable solution.
> I'm not really sure though the
> overhead of kernel threads is really all that bad compared to user-space
> threads (so, 256 ordinary threads or so which I think is the most we'd
> need might be just fine).
I've never seen any other process creating so many threads... The
thought of all those threads in gdb or IDE debuggers is a bit scary.
And I can imagine people that don't particularly care for llvmpipe
internals, but happen to use SW renderer, might pause when they see all
At very least we'll need to use a more adequate stack size instead of
the defaults, otherwise we'll need 512M - 2GB just for stack. And defer
creating all those extra threads until truly needed.
I did a quick search about potential libraries for this, a good starting
point seems to be https://swtch.com/libtask/ which is MIT licensed.
If IIUC, all it takes is a bit of assembly to switch context. But it
seems OSes also handy primitives to do this sort of things:
llvmpipe threads don't do tricky things like I/O, so maybe instead of a
thirdparty library, with merely a few simple wrappers abstracting the OS
primitives we might get everything needed.
More information about the mesa-dev