[Mesa-dev] Introducing OpenSWR: High performance software rasterizer

Wed Oct 21 17:05:16 PDT 2015

Am 22.10.2015 um 00:41 schrieb Rowley, Timothy O:
> 
>> On Oct 20, 2015, at 2:03 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>
>> Certainly looks interesting...
>> From a high level point of view, seems quite similar to llvmpipe (both
>> tile based, using llvm for jitting shaders, ...). Of course llvmpipe
>> isn't well suited for these kind of workloads (the most important use
>> case is desktop compositing, so a couple dozen vertices per frame but
>> millions of pixels...). Making vertex loads scale is something which
>> just wasn't worth the effort so far (there's not actually that many
>> people working on llvmpipe), albeit we realize that the completely
>> non-parallel nature of it currently actually can hinder scaling quite a
>> bit even for "typical" workloads (not desktop compositing, but "simple"
>> 3d apps) once you've got enough cores/threads (8 or so), but that's
>> something we're not worried too much about.
>> I think requiring llvm 3.6 probably isn't going to work if you want to
>> upstream this, a minimum version of 3.6 is fine but the general rule is
>> things should still work with newer versions (including current
>> development version, seems like you're using c++ interface of llvm quite
>> a bit so that's probably going to require some #ifdef mess). Albeit I
>> guess if you just don't try to build the driver with non-released
>> versions that's probably ok (but will limit the ability for some people
>> to try out your driver).
> 
> Some differences between llvmpipe and swr based on my understanding of llvmpipe’s architecture:
> 
> threading model
> 	llvmpipe: single threaded vertex processing, up to 16 rasterization threads
The limit is actually pretty much arbitrary. Though since vertex
processing is single threaded, there's definitely practical scaling
limits (and having more threads than render tiles wouldn't show any
advantage).

> 	swr: common thread pool that pick up frontend or backend work as available
> vertex processing
> 	llvmpipe: entire draw call processed in a single pass
> 	swr: large draws chopped into chunks that can be processed in parallel
> frontend/backend coupling
> 	llvmpipe: separate binning pass in single threaded frontend
> 	swr: frontend vertex processing and binning combined in a single pass
There's definitive advantages to swr there. llvmpipe's binning pass
isn't really separate from vertex processing, so this being
single-threaded is more of a result of vertex processing also being
handled in the same frontend thread (though of course if it were
multithreaded some extra logic would be needed for things to stay
correctly in order).
Part of it is due to draw really being separate from llvmpipe (it can
and is used by other drivers), so the "interface" between vs and fs is
rather simple. But certainly it's not like this is set in stone, rather
noone had the time to do something a bit more scalable there...

> primitive assembly and binning
> 	llvmpipe: scalar c code
there's actually some jit code there plus some manual sse code (though
still c fallback). Albeit it is indeed not quite as parallel as I'd like
(only works on a single primitive at a time).

> 	swr: x86 avx/avx2 working on vector of primitives
> fragment processing
> 	llvmpipe: single jitted shader combining depth/fragment/stencil/blend on16x16 block
It is working on a 4x4 block actually, but otherwise that's right.

> 	swr: separate jitted fragment and blend shaders, plus templated depth test
> in-memory representation
> 	llvmpipe: direct access to render targets
> 	swr: hot-tile working representation with load and/or store at required times
This is actually an interesting difference, of course also tied to
llvmpipe integrating everything together into the fragment shader.

So yes, these are all definitely significant architectural differences
to llvmpipe. But most of it (ok the combined fragment shader / backend
jit code is not) is not really due to a concious design decision - I'd
happily accept patches to make it possible to do vertex processing in
parallel :-).

> As you say, we do use LLVM’s C++ API.  While that has some advantages, it’s not guaranteed to be stable and can/does make nontrivial changes.  3.6 to 3.7 made some change to at least the GEP instruction which we could work around if necessary for upstreaming.
IMHO you should really try to keep up at least with llvm releases (and
ideally llvm head). Otherwise you make it a pain to build not just for
users but developers alike (and if stuff doesn't get at least built, it
has a tendency to break quite often when there's gallium interface
changes etc.).

Roland

> 
> -Tim
>